Making proteomics data accessible and reusable: Current state of ...

20
930 Proteomics 2015, 15, 930–949 DOI 10.1002/pmic.201400302 REVIEW Making proteomics data accessible and reusable: Current state of proteomics databases and repositories Yasset Perez-Riverol, Emanuele Alpi, Rui Wang, Henning Hermjakob and Juan Antonio Vizca´ ıno European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Received: July 1, 2014 Revised: August 6, 2014 Accepted: August 22, 2014 Compared to other data-intensive disciplines such as genomics, public deposition and storage of MS-based proteomics, data are still less developed due to, among other reasons, the inherent complexity of the data and the variety of data types and experimental workflows. In order to address this need, several public repositories for MS proteomics experiments have been developed, each with different purposes in mind. The most established resources are the Global Proteome Machine Database (GPMDB), PeptideAtlas, and the PRIDE database. Additionally, there are other useful (in many cases recently developed) resources such as ProteomicsDB, Mass Spectrometry Interactive Virtual Environment (MassIVE), Chorus, MaxQB, PeptideAtlas SRM Experiment Library (PASSEL), Model Organism Protein Expression Database (MOPED), and the Human Proteinpedia. In addition, the ProteomeXchange consortium has been recently developed to enable better integration of public repositories and the coordinated sharing of proteomics information, maximizing its benefit to the scientific community. Here, we will review each of the major proteomics resources independently and some tools that enable the integration, mining and reuse of the data. We will also discuss some of the major challenges and current pitfalls in the integration and sharing of the data. Keywords: Bioinformatics / Databases / MS / Repositories Additional supporting information may be found in the online version of this article at the publisher’s web-site Correspondence: Dr. Juan Antonio Vizca´ ıno, European Molecu- lar Biology Laboratory, European Bioinformatics Institute (EMBL- EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK E-mail: [email protected] Fax: +44 1223 494 484 Abbreviations: C-HPP, chromosome-based Human Proteome Project; COPaKB, Cardiac Organellar Protein Atlas Knowledge- base; FDR, false discovery rate; HPM, human proteome map; HPRD, human protein reference database; MassIVE, Mass Spec- trometry Interactive Virtual Environment; MOPED, Model Or- ganism Protein Expression Database; NCBI, National Center for Biotechnology Information; PASSEL, PeptideAtlas SRM Experi- ment Library; PSM, peptide spectrum match; PX, ProteomeX- change; TIQAM, targeted identification for quantitative analysis by MRM; TPP, trans-proteomic pipeline 1 Introduction In the age of systems biology and data integration, proteomics data represent a crucial component to understand the “whole picture” of life. Proteomics technologies—particularly MS- based protein identification and quantification approaches— have matured immensely through cumulative advances in high-throughput analytical methodologies [1–5], sample preparation [6], improved instrumentation [7], and the avail- ability of protein sequence databases [8] and computational analysis tools [1, 9]. Therefore, with the development of more powerful and sensitive analytical methods and instrumenta- tion, the identification and quantification of a high proportion of the expressed proteins in a given condition is now achiev- able in an average experiment [10, 11]. In parallel, as a result, Colour Online: See the article online to view Figs. 1–3 in colour. C 2014 The Authors. PROTEOMICS published by Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim. www.proteomics-journal.com This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Transcript of Making proteomics data accessible and reusable: Current state of ...

Page 1: Making proteomics data accessible and reusable: Current state of ...

930 Proteomics 2015 15 930ndash949DOI 101002pmic201400302

REVIEW

Making proteomics data accessible and reusable

Current state of proteomics databases and repositories

Yasset Perez-Riverol Emanuele Alpi Rui Wang Henning Hermjakoband Juan Antonio Vizcaıno

European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) Wellcome Trust GenomeCampus Hinxton Cambridge UK

Received July 1 2014Revised August 6 2014

Accepted August 22 2014

Compared to other data-intensive disciplines such as genomics public deposition and storageof MS-based proteomics data are still less developed due to among other reasons the inherentcomplexity of the data and the variety of data types and experimental workflows In orderto address this need several public repositories for MS proteomics experiments have beendeveloped each with different purposes in mind The most established resources are the GlobalProteome Machine Database (GPMDB) PeptideAtlas and the PRIDE database Additionallythere are other useful (in many cases recently developed) resources such as ProteomicsDB MassSpectrometry Interactive Virtual Environment (MassIVE) Chorus MaxQB PeptideAtlas SRMExperiment Library (PASSEL) Model Organism Protein Expression Database (MOPED) andthe Human Proteinpedia In addition the ProteomeXchange consortium has been recentlydeveloped to enable better integration of public repositories and the coordinated sharing ofproteomics information maximizing its benefit to the scientific community Here we willreview each of the major proteomics resources independently and some tools that enable theintegration mining and reuse of the data We will also discuss some of the major challengesand current pitfalls in the integration and sharing of the data

Keywords

Bioinformatics Databases MS Repositories

Additional supporting information may be found in the online version of this article atthe publisherrsquos web-site

Correspondence Dr Juan Antonio Vizcaıno European Molecu-lar Biology Laboratory European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus Hinxton CambridgeCB10 1SD UKE-mail juanebiacukFax +44 1223 494 484

Abbreviations C-HPP chromosome-based Human ProteomeProject COPaKB Cardiac Organellar Protein Atlas Knowledge-base FDR false discovery rate HPM human proteome mapHPRD human protein reference database MassIVE Mass Spec-trometry Interactive Virtual Environment MOPED Model Or-ganism Protein Expression Database NCBI National Center forBiotechnology Information PASSEL PeptideAtlas SRM Experi-ment Library PSM peptide spectrum match PX ProteomeX-change TIQAM targeted identification for quantitative analysisby MRM TPP trans-proteomic pipeline

1 Introduction

In the age of systems biology and data integration proteomicsdata represent a crucial component to understand the ldquowholepicturerdquo of life Proteomics technologiesmdashparticularly MS-based protein identification and quantification approachesmdashhave matured immensely through cumulative advancesin high-throughput analytical methodologies [1ndash5] samplepreparation [6] improved instrumentation [7] and the avail-ability of protein sequence databases [8] and computationalanalysis tools [19] Therefore with the development of morepowerful and sensitive analytical methods and instrumenta-tion the identification and quantification of a high proportionof the expressed proteins in a given condition is now achiev-able in an average experiment [1011] In parallel as a result

Colour Online See the article online to view Figs 1ndash3 in colour

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

This is an open access article under the terms of the Creative Commons Attribution License which permits use distribution and reproduction inany medium provided the original work is properly cited

Proteomics 2015 15 930ndash949 931

the size of data produced in proteomics laboratories has in-creased by several orders of magnitude [12]

Compared to other data-intensive fields such as genomicsdeposition and storage of original proteomics data in publicresources have been less common [13] This is regrettablesince proteome studies are usually more complex than itscounterpart genomics ones In fact data interpretation in pro-teomics can be considerably more complex than in genomicsdue to the wide variety of analytical approaches [14 15]bioinformatics tools and pipelines [16 17] and the relatedstatistical analysis [1819] However thanks to the guidelinespromoted by several scientific journals and funding agencies[20] there is a growing consensus in the community about theneed for the public dissemination of proteomics data whichis already facilitating the assessment reuse comparativeanalyses and extraction of new findings from published data[13 21]

The complexity of proteomics data is heightened by alter-native splicing PTMs and protein degradation events andis further amplified by the interconnectivity of proteins intocomplexes and signaling networks that are highly divergent intime and space [1] In order to address this complexity newanalytical and bioinformatics methodologies are developedevery year [22 23] which complicate the data standardiza-tion and deposition Additionally the audience interested inproteomics data is very heterogeneous It includes biologistselucidating the mechanisms of regulation of specific proteinsMS researchers improving the current analytical methods orcomputational biologists developing new software tools forthe analysis and interpretation of the data [24]

Data sharing in proteomics requires substantial invest-ment and infrastructure Several public repositories havebeen developed each with different purposes in mind Well-established databases for proteomics data are the Global Pro-teome Machine Database (GPMDB) [25] PeptideAtlas [26]and the PRIDE database [27] Additionally at present thereare other resources (many of them recently developed) suchas ProteomicsDB [28] MassIVE (Mass Spectrometry Inter-active Virtual Environment) Chorus MaxQB [29] PASSEL(PeptideAtlas SRM Experiment Library) [30] MOPED (ModelOrganism Protein Expression Database) [31] PaxDb [32]Human Proteinpedia [33] and the human proteome map(HPM) [34] Furthermore there are several more specializedresources that will only be cited briefly in this review It isimportant to mention here that no single proteomics dataresource will be ideally suited to all possible use cases andall potential users Regrettably two widely used resourceswere discontinued due to lack of funding Peptidome [35]and Tranche [36] This had a negative impact on the effortspromoting data sharing in the field as it was perceived by thecommunity that effort invested in data sharing was lost

Recently the ProteomeXchange (PX) consortium [37] hasbeen formed to enable a better integration of public repos-itories maximizing its benefits to the scientific commu-nity through the implementation of standardized submissionand dissemination pipelines for proteomics information By

Figure 1 Hierarchy of proteomics data repositories anddatabases according to the different data types stored raw MSdata repositories resources that store peptideprotein identifi-cation and quantification results and protein knowledge basesSome resources are duplicated in different levels because theycan be included in more than one category

August 2014 PRIDE PeptideAtlas PASSEL and MassIVEare the active members of the consortium

The aim of this review is to provide an up-to-date overviewof the current state of proteomics data repositories anddatabases providing a solid starting point for those who wantto perform data submission andor data mining There are afew comparable reviews available in the literature [2438ndash41]but there is a need for an update since this has been quite adynamic field over the past few years In this manuscript wewill not include a thorough review about protein knowledgebases such as the Universal Protein resource (UniProt) [42]and neXtProt [43] but we will explain how MS proteomicsinformation is made available in these resources

2 Organization of proteomics repositoriesand databases

The information generated in a typical proteomics experi-ment can be organized in three different levels [44] (i) rawdata (ii) processed results including peptideprotein identi-fication and quantification values and (iii) the resulting bio-logical conclusions Technical andor biological metadata canbe provided for each level independently

These three categories enable the classification of the ex-isting MS proteomics repositories according to their level ofspecialization (Fig 1) In our view these three levels of infor-mation should be captured and properly annotated in publicdatabases and repositories ideally using data standards whenavailable In fact the development of proteomics resources ismore feasible due the maturity of some data standard formats[45ndash47] and open source tools [9] which facilitate public datadeposition

Some of the first open formats developed includedmzXML (for MS data) [48] pepXML and protXML (for

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

932 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

peptideprotein identifications) that were developed as part ofthe trans-proteomic pipeline (TPP) [49] In the context of theProteomics Standards Initiative (PSI) several standard dataformats have been developed over the last few years whichreflect the variety in data types within the field and thereforethose that can be supported by proteomics resources Themain formats developed have been mzML (for MS data) [45]mzIdentML (for peptide and protein identification results)[46] mzQuantML (for capturing a detailed trace of each stageof quantitative analysis) [47] TraML (for representing inputtransitions in SRM approaches) [50] and the recent mzTab(for capturing a simpler summary of the final results) [51]The aims and functionalities of the existing resources willbe explored in detail in the following sections (Table 1)

3 Resources

31 The PX consortium

The PX consortium [37] (httpwwwproteomexchangeorg)was created to promote the collaboration and integration ofmajor stakeholders in the domain of MS proteomics repos-itories comprising among others primary (PRIDE [27] andPASSEL [30]) and secondary resources (PeptideAtlas) pro-teomics researchers and representatives from journals reg-ularly publishing proteomics data Recently in June 2014MassIVE joined the consortium The aim of PX is to pro-vide a common framework for the cooperation of proteomicsresources by defining and implementing consistent harmo-nized user-friendly data deposition and dissemination pro-cedures In addition another important goal is to enable andprovide ldquomutual backuprdquo if one of the resources has fundingissues

The consortiumrsquos members have agreed in providing a suf-ficient set of common experimental and technical metadataThis information is stored using the PX XML format [37]Finally all the submitted datasets get a unique and universalidentifier (PXD identifier)

311 Data submission and format support

By August 2014 two major workflows are fully supportedin PX MSMS and SRM approaches In the first stable im-plementation of the data workflow PRIDE acts as the ini-tial submission point for MSMS data whereas PASSEL hasthe equivalent role for SRM data Both workflows will be ex-plained in detail below At the moment of writing MassIVEhas just joined PX aiming to have an equivalent role to PRIDE

There are two different PX MSMS submission modesldquoCompleterdquo and ldquoPartialrdquo To perform a ldquocompleterdquo submis-sion means that after all the files have been submitted itis possible for the receiving repository to connect directlythe processed identification results with the mass spectraThis can be achieved if the processed identification results

are available in a format supported by the receiving reposi-tory (eg mzIdentML) and if peak list files are included inthe submission ldquoCompleterdquo submissions get a Digital ObjectIdentifier (DOI) to facilitate its traceability

On the other hand after performing a ldquopartialrdquo submissionthe connection between the spectra and the identification re-sults cannot be done in a straightforward way In this case theprocessed results are not available in a supported format bythe receiving repository and the corresponding search engineoutput files (in heterogeneous formats) are made available fordownload For both types of submissions metadata and rawdata are always stored for each dataset

Although ldquopartialrdquo submissions are searchable by theirmetadata peptide and protein identifications cannot be cap-tured by the receiving repository which decreases the abilityof reviewers to check the data and can make data reuse bythird parties challenging For instance ldquopartialrdquo submissionsdo not qualify for the requirements on spectra annotationfrom the journal MCP (Molecular and Cellular ProteomicsmdashhttpwwwmcponlineorgsitemiscParisReport_Finalxhtml)

Finally it needs to be highlighted that all PX memberssupport private review of the data during the manuscriptreview process The submitted data remains private beforemanuscript publication and login details are provided to fa-cilitate access for reviewers and journal editors during themanuscript review process

312 Data mining and visualization

ProteomeCentral (httpproteomecentralproteomexchangeorg) is the centralized portal for accessing all PX datasetsindependently from the original resource where data werestored ProteomeCentral provides the ability to search meta-data associated with datasets in the participating repositories(PRIDE and MassIVE for MSMS data PASSEL for SRMdata or reprocessed original PX datasets in PeptideAtlas) Itis then possible to query the archive and identify datasets ofinterest using biological and technical metadata keywordstags or publication information To monitor the releaseof new public PX datasets researchers can subscribe to aRich Site Summary feed (httpgroupsgooglecomgroupproteomexchangefeedrss_v2_0_msgsxml) Next the maincharacteristics of the individual members of the consortiumwill be explained

32 PRIDE

The PRIDE database [27] (httpwwwebiacukpride) wasinitially developed at the European Bioinformatics Institute(EBI Cambridge UK) to store the experimental data includedin publications supporting the manuscript review processThe main data types stored in PRIDE are peptideproteinidentifications (including PTMs) peptideprotein expression

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 933

Ta

ble

1

Mai

nch

arac

teri

stic

so

fth

em

ajo

rM

S-b

ased

pro

teo

mic

sre

po

sito

ries

and

dat

abas

es

Rep

osi

tori

esR

awd

ata

Su

pp

ort

for

targ

eted

app

roac

hes

Met

adat

aH

um

anp

rote

inex

pre

ssio

nin

form

atio

n

Sp

ecie

sQ

uan

ti-

fica

tio

nd

ata

Rel

ated

stan

d-a

lon

eto

ols

Web

serv

ices

UR

LU

RL

PR

IDE

(Ju

ne

2014

)X

ndashH

igh

leve

l41

835

Pro

tein

acce

ssio

ns

269

806

Un

iqu

ep

epti

de

seq

uen

ces

Ap

pro

xim

atel

y10

1m

illio

nsp

ectr

a

Ap

pro

xim

atel

y45

0sp

ecie

sX

PR

IDE

Insp

ecto

rP

RID

EC

onv

erte

r2

Pep

tid

eSh

aker

htt

p

ww

we

bi

acu

kp

rid

ew

sar

chiv

eh

ttp

w

ww

eb

iac

uk

pri

de

Pep

tid

eAtl

as(H

um

an

Au

gu

st20

13)

XX

Med

ium

leve

l14

018

Pro

tein

s33

801

3Pe

pti

des

Ap

pro

xim

atel

y25

8m

illio

nsp

ectr

a

Mu

sm

usc

ulu

sC

and

ida

alb

ican

sC

and

ida

Cae

no

rhab

dit

isel

egan

sD

roso

ph

ilam

elan

oga

ster

H

alo

bac

teri

um

Eq

uu

scab

allu

sR

attu

sn

orv

egic

us

Sac

char

om

yces

cere

visi

ae

Dan

iore

rio

Su

sscr

ofa

M

yco

bac

teri

um

tub

ercu

losi

s

ndashT

IQA

M

TIQ

AM

ndashDig

esto

rT

IQA

Mndash

Pep

tid

eAtl

as

TIQ

AM

ndashVie

wer

A

TAQ

SP

IPE

2PA

BS

T

htt

p

ww

wp

epti

dea

tlas

org

GP

MD

B(M

ay20

14)

ndashX

Hig

hle

vel

136

373

Pro

tein

acce

ssio

ns

178

669

8Pe

pti

des

Ap

pro

xim

atel

y10

20m

illio

nsp

ectr

a

Bo

sta

uru

sC

anis

fam

iliar

isH

om

osa

pie

ns

Mm

usc

ulu

sR

no

rveg

icu

sG

allu

sga

llus

Dr

erio

Xen

op

us

tro

pic

alis

An

op

hel

esga

mb

iae

Ap

ism

ellif

era

Dm

elan

oga

ster

Ce

lega

ns

Ory

zasa

tiva

Ara

bid

op

sis

thal

ian

aS

acch

aro

myc

esce

revi

siae

Bac

illu

san

thra

cis

Am

esE

sch

eric

hia

coli

(K12

)La

cto

cocc

us

lact

is(I

l140

3)

Mt

ub

ercu

losi

sS

hig

ella

dys

ente

riae

S

alm

on

ella

typ

hi

Sal

mo

nel

laty

ph

imu

riu

m(L

T2)

ndashndash

htt

p

rest

th

egp

mo

rg1

htt

p

gp

md

bt

heg

pm

org

Mas

sIV

E(M

ay20

14)

Xndash

Low

leve

l

ndash

ndashndash

htt

p

mas

sive

ucs

de

du

Ch

oru

s(M

ay20

14)

Xndash

Low

leve

l

ndash

ndashndash

htt

ps

ch

oru

spro

ject

org

Pro

teo

mic

sDB

(May

2014

)X

ndashH

igh

leve

l18

097

Pro

tein

s73

940

6Pe

pti

des

Ap

pro

xim

atel

y70

mill

ion

spec

tra

Hs

apie

ns

Xndash

ndashh

ttp

sw

ww

pro

teo

mic

sdb

org

MO

PE

D(M

ay20

14)

ndashndash

Med

ium

leve

l17

141

Pro

tein

s25

000

0U

niq

ue

pep

tid

esA

pp

roxi

mat

ely

15m

illio

nsp

ectr

a

Hs

apie

ns

Mm

usc

ulu

sC

ele

gan

sS

cer

evis

iae

Xndash

ndashh

ttp

sw

ww

pro

tein

spir

eo

rg

MO

PE

D

Hu

man

Pro

tein

ped

ia(M

ay20

14)

Xndash

Low

leve

l15

231

Pro

tein

s1

960

352

Pep

tid

esA

pp

roxi

mat

ely

5m

illio

nsp

ectr

a

Hs

apie

ns

ndashndash

ndashh

ttp

w

ww

hu

man

pro

tein

ped

iao

rg

Max

QB

(May

2014

)ndash

ndashM

ediu

mle

vel

1473

2Pr

ote

ins

370

551

Pep

tid

esA

pp

roxi

mat

ely

20m

illio

nsp

ectr

a

Mm

usc

ulu

sH

sap

ien

sS

cer

evis

iae

Xndash

ndashh

ttp

m

axq

bb

ioch

emm

pg

de

mxd

b

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

934 Y Perez-Riverol et al Proteomics 2015 15 930ndash949Ta

ble

1

Co

nti

nu

ed

Rep

osi

tori

esR

awd

ata

Su

pp

ort

for

targ

eted

app

roac

hes

Met

adat

aH

um

anp

rote

inex

pre

ssio

nin

form

atio

n

Sp

ecie

sQ

uan

ti-

fica

tio

nd

ata

Rel

ated

stan

d-a

lon

eto

ols

Web

serv

ices

UR

LU

RL

PaxD

b(M

ay20

14)

ndashndash

Low

leve

l10

482

Pro

tein

s14

345

6Pe

pti

des

Ap

pro

xim

atel

y24

mill

ion

spec

tra

At

hal

ian

aM

mu

scu

lus

Hs

apie

ns

Sc

erev

isia

eD

mel

ano

gast

erE

co

li(K

12)

Mic

rocy

stis

aeru

gin

osa

C

ele

gan

sLe

pto

spir

ain

terr

oga

ns

sero

var

Co

pen

hag

eni

Mt

ub

ercu

losi

s(H

37R

v)S

trep

toco

ccu

sp

yog

enes

M1

GA

SS

chiz

osa

cch

aro

myc

esp

om

be

Bt

auru

sS

dys

ente

riae

Bs

ub

tilis

G

gal

lus

Th

erm

oco

ccu

sga

mm

ato

lera

ns

(EJ3

)M

yco

pla

sma

pn

eum

on

iae

Hal

ob

acte

riu

m(N

RC

ndash1)

St

yph

imu

riu

mLT

2R

no

rveg

icu

sD

ein

oco

ccu

sd

eser

ti(V

CD

115)

S

hig

ella

flex

ner

iS

ynec

ho

cyst

is

Am

ellif

era

Cf

amili

aris

Su

ssc

rofa

X

tro

pic

alis

Os

ativ

a

Xndash

htt

p

pax

ndashdb

org

ap

isea

rch

q=

hu

man

htt

p

pax

-db

org

HP

M(J

un

e20

14)

ndashndash

Low

leve

l10

482

Pro

tein

s29

300

0U

niq

ue

pep

tid

esA

pp

roxi

mat

ely

25m

illio

nsp

ectr

a

Hs

apie

ns

Xndash

ndashh

ttp

h

um

anp

rote

om

emap

org

Xt

he

feat

ure

or

char

acte

rist

icis

sup

po

rted

-t

he

feat

ure

isn

ot

sup

po

rted

i

tw

asn

ot

po

ssib

leto

retr

ieve

the

corr

esp

on

din

gin

form

atio

n

values the analyzed mass spectra (both as raw data andpeak lists) and the related technicalbiological metadataIn PRIDE data are stored as originally analyzed by the re-searchers (the authorrsquos analysis view on the data) supportingmany popular search enginesanalysis workflows Most ofthe terms information and metadata supporting the PRIDEdata are based on ontologies or controlled vocabularies [52]In this context the ldquoOntology Lookup Servicerdquo [53] was devel-oped as a spin-off of PRIDE to enable the querying browsingand navigation of biomedical ontologies Since the inceptionof PX the number of datasets submitted to PRIDE has grownconsiderably (Fig 2) In parallel the size of the experiments interms of spectrum numbers has also increased significantly(Fig 2)

321 Data submission and format support

As a member of the PX consortium PRIDE supports bothldquocompleterdquo and ldquopartialrdquo submissions For ldquocompleterdquo sub-missions processed identification results need to be pro-vided in PRIDE XML (PRIDE original data format) or thePSI standard mzIdentML (version 11) format If mzIdentMLis used the corresponding peak list files referenced by themzIdentML files need to be included as well A ldquocompleterdquosubmission ensures that the processed results data can be in-tegrated in PRIDE and visualized using tools such as PRIDEInspector [54] (see below) Therefore for performing a ldquocom-pleterdquo submission the output files from the analysis soft-ware need to be converted or exported to either mzIdentML(see httpwwwpsidevinfotools-implementing-mzidentmland httpwwwebiacukpridehelparchivesubmissionmzidentml) or PRIDE XML (using PRIDE Converter 2 [55] orother external tools) It is important to highlight that the re-cently implemented support for mzIdentML makes possiblethat data from a growing number of toolsanalysis software(that were never supported via PRIDE XML) can now be fullysupported by PRIDE

The PRIDE Converter 2 tool suite (httppride-converter-2googlecodecom) [55] is an open source andplatform-independent software that allows users to convertseveral search engine output files to PRIDE XML The toolsuite consists of four different applications PRIDE Converter2 PRIDE mzTab generator PRIDE XML merger and PRIDEXML filter All of these tools can be launched using the graph-ical user interface or from the command line PRIDE Con-verter 2 currently supports the output from MASCOT (dat)[56] XTandem (xml) [57] OMSSA (csv) among others Italso supports different mass spectra file formats (mzML dtamgf mzData mzXML and pkl) As mentioned above thereare other tools not developed by the PRIDE team that canalso export to PRIDE XML [37] It is expected that as the mzI-dentML format becomes more and more popular the use ofPRIDE XML will decrease in time

The ldquopartialrdquo submission route is aimed at situationswhere processed identification results cannot be generated in

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 935

Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month

mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others

The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps

(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg

the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the

PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or

FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets

In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so

322 Data mining and visualization

PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers

The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services

Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE

33 PASSEL

PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines

331 Data submission and format support

PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass

spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo

332 Data mining and visualization

The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)

The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)

34 PeptideAtlas

The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]

Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]

In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 937

341 Data submission and format support

PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework

342 Data mining and visualization

The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins

Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein

identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)

PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)

35 MassIVE

The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset

351 Data submission and format support

The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface

MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset

The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed

352 Data mining and visualization

The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets

Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]

36 Chorus

Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass

spectral viewers as well as a database search engine for pro-tein sequence identification

The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects

37 GPMDB

GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database

371 Data submission and format support

The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database

372 Data mining and visualization

The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins

When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 939

More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl

38 ProteomicsDB

ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication

381 Data submission and format support

The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]

382 Data mining and visualization

A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be

visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage

39 MaxQB

MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities

391 Data submission and format support

Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface

Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database

392 Data mining and visualization

The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 2: Making proteomics data accessible and reusable: Current state of ...

Proteomics 2015 15 930ndash949 931

the size of data produced in proteomics laboratories has in-creased by several orders of magnitude [12]

Compared to other data-intensive fields such as genomicsdeposition and storage of original proteomics data in publicresources have been less common [13] This is regrettablesince proteome studies are usually more complex than itscounterpart genomics ones In fact data interpretation in pro-teomics can be considerably more complex than in genomicsdue to the wide variety of analytical approaches [14 15]bioinformatics tools and pipelines [16 17] and the relatedstatistical analysis [1819] However thanks to the guidelinespromoted by several scientific journals and funding agencies[20] there is a growing consensus in the community about theneed for the public dissemination of proteomics data whichis already facilitating the assessment reuse comparativeanalyses and extraction of new findings from published data[13 21]

The complexity of proteomics data is heightened by alter-native splicing PTMs and protein degradation events andis further amplified by the interconnectivity of proteins intocomplexes and signaling networks that are highly divergent intime and space [1] In order to address this complexity newanalytical and bioinformatics methodologies are developedevery year [22 23] which complicate the data standardiza-tion and deposition Additionally the audience interested inproteomics data is very heterogeneous It includes biologistselucidating the mechanisms of regulation of specific proteinsMS researchers improving the current analytical methods orcomputational biologists developing new software tools forthe analysis and interpretation of the data [24]

Data sharing in proteomics requires substantial invest-ment and infrastructure Several public repositories havebeen developed each with different purposes in mind Well-established databases for proteomics data are the Global Pro-teome Machine Database (GPMDB) [25] PeptideAtlas [26]and the PRIDE database [27] Additionally at present thereare other resources (many of them recently developed) suchas ProteomicsDB [28] MassIVE (Mass Spectrometry Inter-active Virtual Environment) Chorus MaxQB [29] PASSEL(PeptideAtlas SRM Experiment Library) [30] MOPED (ModelOrganism Protein Expression Database) [31] PaxDb [32]Human Proteinpedia [33] and the human proteome map(HPM) [34] Furthermore there are several more specializedresources that will only be cited briefly in this review It isimportant to mention here that no single proteomics dataresource will be ideally suited to all possible use cases andall potential users Regrettably two widely used resourceswere discontinued due to lack of funding Peptidome [35]and Tranche [36] This had a negative impact on the effortspromoting data sharing in the field as it was perceived by thecommunity that effort invested in data sharing was lost

Recently the ProteomeXchange (PX) consortium [37] hasbeen formed to enable a better integration of public repos-itories maximizing its benefits to the scientific commu-nity through the implementation of standardized submissionand dissemination pipelines for proteomics information By

Figure 1 Hierarchy of proteomics data repositories anddatabases according to the different data types stored raw MSdata repositories resources that store peptideprotein identifi-cation and quantification results and protein knowledge basesSome resources are duplicated in different levels because theycan be included in more than one category

August 2014 PRIDE PeptideAtlas PASSEL and MassIVEare the active members of the consortium

The aim of this review is to provide an up-to-date overviewof the current state of proteomics data repositories anddatabases providing a solid starting point for those who wantto perform data submission andor data mining There are afew comparable reviews available in the literature [2438ndash41]but there is a need for an update since this has been quite adynamic field over the past few years In this manuscript wewill not include a thorough review about protein knowledgebases such as the Universal Protein resource (UniProt) [42]and neXtProt [43] but we will explain how MS proteomicsinformation is made available in these resources

2 Organization of proteomics repositoriesand databases

The information generated in a typical proteomics experi-ment can be organized in three different levels [44] (i) rawdata (ii) processed results including peptideprotein identi-fication and quantification values and (iii) the resulting bio-logical conclusions Technical andor biological metadata canbe provided for each level independently

These three categories enable the classification of the ex-isting MS proteomics repositories according to their level ofspecialization (Fig 1) In our view these three levels of infor-mation should be captured and properly annotated in publicdatabases and repositories ideally using data standards whenavailable In fact the development of proteomics resources ismore feasible due the maturity of some data standard formats[45ndash47] and open source tools [9] which facilitate public datadeposition

Some of the first open formats developed includedmzXML (for MS data) [48] pepXML and protXML (for

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

932 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

peptideprotein identifications) that were developed as part ofthe trans-proteomic pipeline (TPP) [49] In the context of theProteomics Standards Initiative (PSI) several standard dataformats have been developed over the last few years whichreflect the variety in data types within the field and thereforethose that can be supported by proteomics resources Themain formats developed have been mzML (for MS data) [45]mzIdentML (for peptide and protein identification results)[46] mzQuantML (for capturing a detailed trace of each stageof quantitative analysis) [47] TraML (for representing inputtransitions in SRM approaches) [50] and the recent mzTab(for capturing a simpler summary of the final results) [51]The aims and functionalities of the existing resources willbe explored in detail in the following sections (Table 1)

3 Resources

31 The PX consortium

The PX consortium [37] (httpwwwproteomexchangeorg)was created to promote the collaboration and integration ofmajor stakeholders in the domain of MS proteomics repos-itories comprising among others primary (PRIDE [27] andPASSEL [30]) and secondary resources (PeptideAtlas) pro-teomics researchers and representatives from journals reg-ularly publishing proteomics data Recently in June 2014MassIVE joined the consortium The aim of PX is to pro-vide a common framework for the cooperation of proteomicsresources by defining and implementing consistent harmo-nized user-friendly data deposition and dissemination pro-cedures In addition another important goal is to enable andprovide ldquomutual backuprdquo if one of the resources has fundingissues

The consortiumrsquos members have agreed in providing a suf-ficient set of common experimental and technical metadataThis information is stored using the PX XML format [37]Finally all the submitted datasets get a unique and universalidentifier (PXD identifier)

311 Data submission and format support

By August 2014 two major workflows are fully supportedin PX MSMS and SRM approaches In the first stable im-plementation of the data workflow PRIDE acts as the ini-tial submission point for MSMS data whereas PASSEL hasthe equivalent role for SRM data Both workflows will be ex-plained in detail below At the moment of writing MassIVEhas just joined PX aiming to have an equivalent role to PRIDE

There are two different PX MSMS submission modesldquoCompleterdquo and ldquoPartialrdquo To perform a ldquocompleterdquo submis-sion means that after all the files have been submitted itis possible for the receiving repository to connect directlythe processed identification results with the mass spectraThis can be achieved if the processed identification results

are available in a format supported by the receiving reposi-tory (eg mzIdentML) and if peak list files are included inthe submission ldquoCompleterdquo submissions get a Digital ObjectIdentifier (DOI) to facilitate its traceability

On the other hand after performing a ldquopartialrdquo submissionthe connection between the spectra and the identification re-sults cannot be done in a straightforward way In this case theprocessed results are not available in a supported format bythe receiving repository and the corresponding search engineoutput files (in heterogeneous formats) are made available fordownload For both types of submissions metadata and rawdata are always stored for each dataset

Although ldquopartialrdquo submissions are searchable by theirmetadata peptide and protein identifications cannot be cap-tured by the receiving repository which decreases the abilityof reviewers to check the data and can make data reuse bythird parties challenging For instance ldquopartialrdquo submissionsdo not qualify for the requirements on spectra annotationfrom the journal MCP (Molecular and Cellular ProteomicsmdashhttpwwwmcponlineorgsitemiscParisReport_Finalxhtml)

Finally it needs to be highlighted that all PX memberssupport private review of the data during the manuscriptreview process The submitted data remains private beforemanuscript publication and login details are provided to fa-cilitate access for reviewers and journal editors during themanuscript review process

312 Data mining and visualization

ProteomeCentral (httpproteomecentralproteomexchangeorg) is the centralized portal for accessing all PX datasetsindependently from the original resource where data werestored ProteomeCentral provides the ability to search meta-data associated with datasets in the participating repositories(PRIDE and MassIVE for MSMS data PASSEL for SRMdata or reprocessed original PX datasets in PeptideAtlas) Itis then possible to query the archive and identify datasets ofinterest using biological and technical metadata keywordstags or publication information To monitor the releaseof new public PX datasets researchers can subscribe to aRich Site Summary feed (httpgroupsgooglecomgroupproteomexchangefeedrss_v2_0_msgsxml) Next the maincharacteristics of the individual members of the consortiumwill be explained

32 PRIDE

The PRIDE database [27] (httpwwwebiacukpride) wasinitially developed at the European Bioinformatics Institute(EBI Cambridge UK) to store the experimental data includedin publications supporting the manuscript review processThe main data types stored in PRIDE are peptideproteinidentifications (including PTMs) peptideprotein expression

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 933

Ta

ble

1

Mai

nch

arac

teri

stic

so

fth

em

ajo

rM

S-b

ased

pro

teo

mic

sre

po

sito

ries

and

dat

abas

es

Rep

osi

tori

esR

awd

ata

Su

pp

ort

for

targ

eted

app

roac

hes

Met

adat

aH

um

anp

rote

inex

pre

ssio

nin

form

atio

n

Sp

ecie

sQ

uan

ti-

fica

tio

nd

ata

Rel

ated

stan

d-a

lon

eto

ols

Web

serv

ices

UR

LU

RL

PR

IDE

(Ju

ne

2014

)X

ndashH

igh

leve

l41

835

Pro

tein

acce

ssio

ns

269

806

Un

iqu

ep

epti

de

seq

uen

ces

Ap

pro

xim

atel

y10

1m

illio

nsp

ectr

a

Ap

pro

xim

atel

y45

0sp

ecie

sX

PR

IDE

Insp

ecto

rP

RID

EC

onv

erte

r2

Pep

tid

eSh

aker

htt

p

ww

we

bi

acu

kp

rid

ew

sar

chiv

eh

ttp

w

ww

eb

iac

uk

pri

de

Pep

tid

eAtl

as(H

um

an

Au

gu

st20

13)

XX

Med

ium

leve

l14

018

Pro

tein

s33

801

3Pe

pti

des

Ap

pro

xim

atel

y25

8m

illio

nsp

ectr

a

Mu

sm

usc

ulu

sC

and

ida

alb

ican

sC

and

ida

Cae

no

rhab

dit

isel

egan

sD

roso

ph

ilam

elan

oga

ster

H

alo

bac

teri

um

Eq

uu

scab

allu

sR

attu

sn

orv

egic

us

Sac

char

om

yces

cere

visi

ae

Dan

iore

rio

Su

sscr

ofa

M

yco

bac

teri

um

tub

ercu

losi

s

ndashT

IQA

M

TIQ

AM

ndashDig

esto

rT

IQA

Mndash

Pep

tid

eAtl

as

TIQ

AM

ndashVie

wer

A

TAQ

SP

IPE

2PA

BS

T

htt

p

ww

wp

epti

dea

tlas

org

GP

MD

B(M

ay20

14)

ndashX

Hig

hle

vel

136

373

Pro

tein

acce

ssio

ns

178

669

8Pe

pti

des

Ap

pro

xim

atel

y10

20m

illio

nsp

ectr

a

Bo

sta

uru

sC

anis

fam

iliar

isH

om

osa

pie

ns

Mm

usc

ulu

sR

no

rveg

icu

sG

allu

sga

llus

Dr

erio

Xen

op

us

tro

pic

alis

An

op

hel

esga

mb

iae

Ap

ism

ellif

era

Dm

elan

oga

ster

Ce

lega

ns

Ory

zasa

tiva

Ara

bid

op

sis

thal

ian

aS

acch

aro

myc

esce

revi

siae

Bac

illu

san

thra

cis

Am

esE

sch

eric

hia

coli

(K12

)La

cto

cocc

us

lact

is(I

l140

3)

Mt

ub

ercu

losi

sS

hig

ella

dys

ente

riae

S

alm

on

ella

typ

hi

Sal

mo

nel

laty

ph

imu

riu

m(L

T2)

ndashndash

htt

p

rest

th

egp

mo

rg1

htt

p

gp

md

bt

heg

pm

org

Mas

sIV

E(M

ay20

14)

Xndash

Low

leve

l

ndash

ndashndash

htt

p

mas

sive

ucs

de

du

Ch

oru

s(M

ay20

14)

Xndash

Low

leve

l

ndash

ndashndash

htt

ps

ch

oru

spro

ject

org

Pro

teo

mic

sDB

(May

2014

)X

ndashH

igh

leve

l18

097

Pro

tein

s73

940

6Pe

pti

des

Ap

pro

xim

atel

y70

mill

ion

spec

tra

Hs

apie

ns

Xndash

ndashh

ttp

sw

ww

pro

teo

mic

sdb

org

MO

PE

D(M

ay20

14)

ndashndash

Med

ium

leve

l17

141

Pro

tein

s25

000

0U

niq

ue

pep

tid

esA

pp

roxi

mat

ely

15m

illio

nsp

ectr

a

Hs

apie

ns

Mm

usc

ulu

sC

ele

gan

sS

cer

evis

iae

Xndash

ndashh

ttp

sw

ww

pro

tein

spir

eo

rg

MO

PE

D

Hu

man

Pro

tein

ped

ia(M

ay20

14)

Xndash

Low

leve

l15

231

Pro

tein

s1

960

352

Pep

tid

esA

pp

roxi

mat

ely

5m

illio

nsp

ectr

a

Hs

apie

ns

ndashndash

ndashh

ttp

w

ww

hu

man

pro

tein

ped

iao

rg

Max

QB

(May

2014

)ndash

ndashM

ediu

mle

vel

1473

2Pr

ote

ins

370

551

Pep

tid

esA

pp

roxi

mat

ely

20m

illio

nsp

ectr

a

Mm

usc

ulu

sH

sap

ien

sS

cer

evis

iae

Xndash

ndashh

ttp

m

axq

bb

ioch

emm

pg

de

mxd

b

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

934 Y Perez-Riverol et al Proteomics 2015 15 930ndash949Ta

ble

1

Co

nti

nu

ed

Rep

osi

tori

esR

awd

ata

Su

pp

ort

for

targ

eted

app

roac

hes

Met

adat

aH

um

anp

rote

inex

pre

ssio

nin

form

atio

n

Sp

ecie

sQ

uan

ti-

fica

tio

nd

ata

Rel

ated

stan

d-a

lon

eto

ols

Web

serv

ices

UR

LU

RL

PaxD

b(M

ay20

14)

ndashndash

Low

leve

l10

482

Pro

tein

s14

345

6Pe

pti

des

Ap

pro

xim

atel

y24

mill

ion

spec

tra

At

hal

ian

aM

mu

scu

lus

Hs

apie

ns

Sc

erev

isia

eD

mel

ano

gast

erE

co

li(K

12)

Mic

rocy

stis

aeru

gin

osa

C

ele

gan

sLe

pto

spir

ain

terr

oga

ns

sero

var

Co

pen

hag

eni

Mt

ub

ercu

losi

s(H

37R

v)S

trep

toco

ccu

sp

yog

enes

M1

GA

SS

chiz

osa

cch

aro

myc

esp

om

be

Bt

auru

sS

dys

ente

riae

Bs

ub

tilis

G

gal

lus

Th

erm

oco

ccu

sga

mm

ato

lera

ns

(EJ3

)M

yco

pla

sma

pn

eum

on

iae

Hal

ob

acte

riu

m(N

RC

ndash1)

St

yph

imu

riu

mLT

2R

no

rveg

icu

sD

ein

oco

ccu

sd

eser

ti(V

CD

115)

S

hig

ella

flex

ner

iS

ynec

ho

cyst

is

Am

ellif

era

Cf

amili

aris

Su

ssc

rofa

X

tro

pic

alis

Os

ativ

a

Xndash

htt

p

pax

ndashdb

org

ap

isea

rch

q=

hu

man

htt

p

pax

-db

org

HP

M(J

un

e20

14)

ndashndash

Low

leve

l10

482

Pro

tein

s29

300

0U

niq

ue

pep

tid

esA

pp

roxi

mat

ely

25m

illio

nsp

ectr

a

Hs

apie

ns

Xndash

ndashh

ttp

h

um

anp

rote

om

emap

org

Xt

he

feat

ure

or

char

acte

rist

icis

sup

po

rted

-t

he

feat

ure

isn

ot

sup

po

rted

i

tw

asn

ot

po

ssib

leto

retr

ieve

the

corr

esp

on

din

gin

form

atio

n

values the analyzed mass spectra (both as raw data andpeak lists) and the related technicalbiological metadataIn PRIDE data are stored as originally analyzed by the re-searchers (the authorrsquos analysis view on the data) supportingmany popular search enginesanalysis workflows Most ofthe terms information and metadata supporting the PRIDEdata are based on ontologies or controlled vocabularies [52]In this context the ldquoOntology Lookup Servicerdquo [53] was devel-oped as a spin-off of PRIDE to enable the querying browsingand navigation of biomedical ontologies Since the inceptionof PX the number of datasets submitted to PRIDE has grownconsiderably (Fig 2) In parallel the size of the experiments interms of spectrum numbers has also increased significantly(Fig 2)

321 Data submission and format support

As a member of the PX consortium PRIDE supports bothldquocompleterdquo and ldquopartialrdquo submissions For ldquocompleterdquo sub-missions processed identification results need to be pro-vided in PRIDE XML (PRIDE original data format) or thePSI standard mzIdentML (version 11) format If mzIdentMLis used the corresponding peak list files referenced by themzIdentML files need to be included as well A ldquocompleterdquosubmission ensures that the processed results data can be in-tegrated in PRIDE and visualized using tools such as PRIDEInspector [54] (see below) Therefore for performing a ldquocom-pleterdquo submission the output files from the analysis soft-ware need to be converted or exported to either mzIdentML(see httpwwwpsidevinfotools-implementing-mzidentmland httpwwwebiacukpridehelparchivesubmissionmzidentml) or PRIDE XML (using PRIDE Converter 2 [55] orother external tools) It is important to highlight that the re-cently implemented support for mzIdentML makes possiblethat data from a growing number of toolsanalysis software(that were never supported via PRIDE XML) can now be fullysupported by PRIDE

The PRIDE Converter 2 tool suite (httppride-converter-2googlecodecom) [55] is an open source andplatform-independent software that allows users to convertseveral search engine output files to PRIDE XML The toolsuite consists of four different applications PRIDE Converter2 PRIDE mzTab generator PRIDE XML merger and PRIDEXML filter All of these tools can be launched using the graph-ical user interface or from the command line PRIDE Con-verter 2 currently supports the output from MASCOT (dat)[56] XTandem (xml) [57] OMSSA (csv) among others Italso supports different mass spectra file formats (mzML dtamgf mzData mzXML and pkl) As mentioned above thereare other tools not developed by the PRIDE team that canalso export to PRIDE XML [37] It is expected that as the mzI-dentML format becomes more and more popular the use ofPRIDE XML will decrease in time

The ldquopartialrdquo submission route is aimed at situationswhere processed identification results cannot be generated in

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 935

Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month

mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others

The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps

(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg

the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the

PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or

FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets

In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so

322 Data mining and visualization

PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers

The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services

Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE

33 PASSEL

PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines

331 Data submission and format support

PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass

spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo

332 Data mining and visualization

The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)

The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)

34 PeptideAtlas

The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]

Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]

In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 937

341 Data submission and format support

PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework

342 Data mining and visualization

The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins

Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein

identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)

PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)

35 MassIVE

The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset

351 Data submission and format support

The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface

MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset

The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed

352 Data mining and visualization

The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets

Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]

36 Chorus

Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass

spectral viewers as well as a database search engine for pro-tein sequence identification

The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects

37 GPMDB

GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database

371 Data submission and format support

The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database

372 Data mining and visualization

The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins

When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 939

More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl

38 ProteomicsDB

ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication

381 Data submission and format support

The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]

382 Data mining and visualization

A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be

visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage

39 MaxQB

MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities

391 Data submission and format support

Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface

Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database

392 Data mining and visualization

The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 3: Making proteomics data accessible and reusable: Current state of ...

932 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

peptideprotein identifications) that were developed as part ofthe trans-proteomic pipeline (TPP) [49] In the context of theProteomics Standards Initiative (PSI) several standard dataformats have been developed over the last few years whichreflect the variety in data types within the field and thereforethose that can be supported by proteomics resources Themain formats developed have been mzML (for MS data) [45]mzIdentML (for peptide and protein identification results)[46] mzQuantML (for capturing a detailed trace of each stageof quantitative analysis) [47] TraML (for representing inputtransitions in SRM approaches) [50] and the recent mzTab(for capturing a simpler summary of the final results) [51]The aims and functionalities of the existing resources willbe explored in detail in the following sections (Table 1)

3 Resources

31 The PX consortium

The PX consortium [37] (httpwwwproteomexchangeorg)was created to promote the collaboration and integration ofmajor stakeholders in the domain of MS proteomics repos-itories comprising among others primary (PRIDE [27] andPASSEL [30]) and secondary resources (PeptideAtlas) pro-teomics researchers and representatives from journals reg-ularly publishing proteomics data Recently in June 2014MassIVE joined the consortium The aim of PX is to pro-vide a common framework for the cooperation of proteomicsresources by defining and implementing consistent harmo-nized user-friendly data deposition and dissemination pro-cedures In addition another important goal is to enable andprovide ldquomutual backuprdquo if one of the resources has fundingissues

The consortiumrsquos members have agreed in providing a suf-ficient set of common experimental and technical metadataThis information is stored using the PX XML format [37]Finally all the submitted datasets get a unique and universalidentifier (PXD identifier)

311 Data submission and format support

By August 2014 two major workflows are fully supportedin PX MSMS and SRM approaches In the first stable im-plementation of the data workflow PRIDE acts as the ini-tial submission point for MSMS data whereas PASSEL hasthe equivalent role for SRM data Both workflows will be ex-plained in detail below At the moment of writing MassIVEhas just joined PX aiming to have an equivalent role to PRIDE

There are two different PX MSMS submission modesldquoCompleterdquo and ldquoPartialrdquo To perform a ldquocompleterdquo submis-sion means that after all the files have been submitted itis possible for the receiving repository to connect directlythe processed identification results with the mass spectraThis can be achieved if the processed identification results

are available in a format supported by the receiving reposi-tory (eg mzIdentML) and if peak list files are included inthe submission ldquoCompleterdquo submissions get a Digital ObjectIdentifier (DOI) to facilitate its traceability

On the other hand after performing a ldquopartialrdquo submissionthe connection between the spectra and the identification re-sults cannot be done in a straightforward way In this case theprocessed results are not available in a supported format bythe receiving repository and the corresponding search engineoutput files (in heterogeneous formats) are made available fordownload For both types of submissions metadata and rawdata are always stored for each dataset

Although ldquopartialrdquo submissions are searchable by theirmetadata peptide and protein identifications cannot be cap-tured by the receiving repository which decreases the abilityof reviewers to check the data and can make data reuse bythird parties challenging For instance ldquopartialrdquo submissionsdo not qualify for the requirements on spectra annotationfrom the journal MCP (Molecular and Cellular ProteomicsmdashhttpwwwmcponlineorgsitemiscParisReport_Finalxhtml)

Finally it needs to be highlighted that all PX memberssupport private review of the data during the manuscriptreview process The submitted data remains private beforemanuscript publication and login details are provided to fa-cilitate access for reviewers and journal editors during themanuscript review process

312 Data mining and visualization

ProteomeCentral (httpproteomecentralproteomexchangeorg) is the centralized portal for accessing all PX datasetsindependently from the original resource where data werestored ProteomeCentral provides the ability to search meta-data associated with datasets in the participating repositories(PRIDE and MassIVE for MSMS data PASSEL for SRMdata or reprocessed original PX datasets in PeptideAtlas) Itis then possible to query the archive and identify datasets ofinterest using biological and technical metadata keywordstags or publication information To monitor the releaseof new public PX datasets researchers can subscribe to aRich Site Summary feed (httpgroupsgooglecomgroupproteomexchangefeedrss_v2_0_msgsxml) Next the maincharacteristics of the individual members of the consortiumwill be explained

32 PRIDE

The PRIDE database [27] (httpwwwebiacukpride) wasinitially developed at the European Bioinformatics Institute(EBI Cambridge UK) to store the experimental data includedin publications supporting the manuscript review processThe main data types stored in PRIDE are peptideproteinidentifications (including PTMs) peptideprotein expression

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 933

Ta

ble

1

Mai

nch

arac

teri

stic

so

fth

em

ajo

rM

S-b

ased

pro

teo

mic

sre

po

sito

ries

and

dat

abas

es

Rep

osi

tori

esR

awd

ata

Su

pp

ort

for

targ

eted

app

roac

hes

Met

adat

aH

um

anp

rote

inex

pre

ssio

nin

form

atio

n

Sp

ecie

sQ

uan

ti-

fica

tio

nd

ata

Rel

ated

stan

d-a

lon

eto

ols

Web

serv

ices

UR

LU

RL

PR

IDE

(Ju

ne

2014

)X

ndashH

igh

leve

l41

835

Pro

tein

acce

ssio

ns

269

806

Un

iqu

ep

epti

de

seq

uen

ces

Ap

pro

xim

atel

y10

1m

illio

nsp

ectr

a

Ap

pro

xim

atel

y45

0sp

ecie

sX

PR

IDE

Insp

ecto

rP

RID

EC

onv

erte

r2

Pep

tid

eSh

aker

htt

p

ww

we

bi

acu

kp

rid

ew

sar

chiv

eh

ttp

w

ww

eb

iac

uk

pri

de

Pep

tid

eAtl

as(H

um

an

Au

gu

st20

13)

XX

Med

ium

leve

l14

018

Pro

tein

s33

801

3Pe

pti

des

Ap

pro

xim

atel

y25

8m

illio

nsp

ectr

a

Mu

sm

usc

ulu

sC

and

ida

alb

ican

sC

and

ida

Cae

no

rhab

dit

isel

egan

sD

roso

ph

ilam

elan

oga

ster

H

alo

bac

teri

um

Eq

uu

scab

allu

sR

attu

sn

orv

egic

us

Sac

char

om

yces

cere

visi

ae

Dan

iore

rio

Su

sscr

ofa

M

yco

bac

teri

um

tub

ercu

losi

s

ndashT

IQA

M

TIQ

AM

ndashDig

esto

rT

IQA

Mndash

Pep

tid

eAtl

as

TIQ

AM

ndashVie

wer

A

TAQ

SP

IPE

2PA

BS

T

htt

p

ww

wp

epti

dea

tlas

org

GP

MD

B(M

ay20

14)

ndashX

Hig

hle

vel

136

373

Pro

tein

acce

ssio

ns

178

669

8Pe

pti

des

Ap

pro

xim

atel

y10

20m

illio

nsp

ectr

a

Bo

sta

uru

sC

anis

fam

iliar

isH

om

osa

pie

ns

Mm

usc

ulu

sR

no

rveg

icu

sG

allu

sga

llus

Dr

erio

Xen

op

us

tro

pic

alis

An

op

hel

esga

mb

iae

Ap

ism

ellif

era

Dm

elan

oga

ster

Ce

lega

ns

Ory

zasa

tiva

Ara

bid

op

sis

thal

ian

aS

acch

aro

myc

esce

revi

siae

Bac

illu

san

thra

cis

Am

esE

sch

eric

hia

coli

(K12

)La

cto

cocc

us

lact

is(I

l140

3)

Mt

ub

ercu

losi

sS

hig

ella

dys

ente

riae

S

alm

on

ella

typ

hi

Sal

mo

nel

laty

ph

imu

riu

m(L

T2)

ndashndash

htt

p

rest

th

egp

mo

rg1

htt

p

gp

md

bt

heg

pm

org

Mas

sIV

E(M

ay20

14)

Xndash

Low

leve

l

ndash

ndashndash

htt

p

mas

sive

ucs

de

du

Ch

oru

s(M

ay20

14)

Xndash

Low

leve

l

ndash

ndashndash

htt

ps

ch

oru

spro

ject

org

Pro

teo

mic

sDB

(May

2014

)X

ndashH

igh

leve

l18

097

Pro

tein

s73

940

6Pe

pti

des

Ap

pro

xim

atel

y70

mill

ion

spec

tra

Hs

apie

ns

Xndash

ndashh

ttp

sw

ww

pro

teo

mic

sdb

org

MO

PE

D(M

ay20

14)

ndashndash

Med

ium

leve

l17

141

Pro

tein

s25

000

0U

niq

ue

pep

tid

esA

pp

roxi

mat

ely

15m

illio

nsp

ectr

a

Hs

apie

ns

Mm

usc

ulu

sC

ele

gan

sS

cer

evis

iae

Xndash

ndashh

ttp

sw

ww

pro

tein

spir

eo

rg

MO

PE

D

Hu

man

Pro

tein

ped

ia(M

ay20

14)

Xndash

Low

leve

l15

231

Pro

tein

s1

960

352

Pep

tid

esA

pp

roxi

mat

ely

5m

illio

nsp

ectr

a

Hs

apie

ns

ndashndash

ndashh

ttp

w

ww

hu

man

pro

tein

ped

iao

rg

Max

QB

(May

2014

)ndash

ndashM

ediu

mle

vel

1473

2Pr

ote

ins

370

551

Pep

tid

esA

pp

roxi

mat

ely

20m

illio

nsp

ectr

a

Mm

usc

ulu

sH

sap

ien

sS

cer

evis

iae

Xndash

ndashh

ttp

m

axq

bb

ioch

emm

pg

de

mxd

b

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

934 Y Perez-Riverol et al Proteomics 2015 15 930ndash949Ta

ble

1

Co

nti

nu

ed

Rep

osi

tori

esR

awd

ata

Su

pp

ort

for

targ

eted

app

roac

hes

Met

adat

aH

um

anp

rote

inex

pre

ssio

nin

form

atio

n

Sp

ecie

sQ

uan

ti-

fica

tio

nd

ata

Rel

ated

stan

d-a

lon

eto

ols

Web

serv

ices

UR

LU

RL

PaxD

b(M

ay20

14)

ndashndash

Low

leve

l10

482

Pro

tein

s14

345

6Pe

pti

des

Ap

pro

xim

atel

y24

mill

ion

spec

tra

At

hal

ian

aM

mu

scu

lus

Hs

apie

ns

Sc

erev

isia

eD

mel

ano

gast

erE

co

li(K

12)

Mic

rocy

stis

aeru

gin

osa

C

ele

gan

sLe

pto

spir

ain

terr

oga

ns

sero

var

Co

pen

hag

eni

Mt

ub

ercu

losi

s(H

37R

v)S

trep

toco

ccu

sp

yog

enes

M1

GA

SS

chiz

osa

cch

aro

myc

esp

om

be

Bt

auru

sS

dys

ente

riae

Bs

ub

tilis

G

gal

lus

Th

erm

oco

ccu

sga

mm

ato

lera

ns

(EJ3

)M

yco

pla

sma

pn

eum

on

iae

Hal

ob

acte

riu

m(N

RC

ndash1)

St

yph

imu

riu

mLT

2R

no

rveg

icu

sD

ein

oco

ccu

sd

eser

ti(V

CD

115)

S

hig

ella

flex

ner

iS

ynec

ho

cyst

is

Am

ellif

era

Cf

amili

aris

Su

ssc

rofa

X

tro

pic

alis

Os

ativ

a

Xndash

htt

p

pax

ndashdb

org

ap

isea

rch

q=

hu

man

htt

p

pax

-db

org

HP

M(J

un

e20

14)

ndashndash

Low

leve

l10

482

Pro

tein

s29

300

0U

niq

ue

pep

tid

esA

pp

roxi

mat

ely

25m

illio

nsp

ectr

a

Hs

apie

ns

Xndash

ndashh

ttp

h

um

anp

rote

om

emap

org

Xt

he

feat

ure

or

char

acte

rist

icis

sup

po

rted

-t

he

feat

ure

isn

ot

sup

po

rted

i

tw

asn

ot

po

ssib

leto

retr

ieve

the

corr

esp

on

din

gin

form

atio

n

values the analyzed mass spectra (both as raw data andpeak lists) and the related technicalbiological metadataIn PRIDE data are stored as originally analyzed by the re-searchers (the authorrsquos analysis view on the data) supportingmany popular search enginesanalysis workflows Most ofthe terms information and metadata supporting the PRIDEdata are based on ontologies or controlled vocabularies [52]In this context the ldquoOntology Lookup Servicerdquo [53] was devel-oped as a spin-off of PRIDE to enable the querying browsingand navigation of biomedical ontologies Since the inceptionof PX the number of datasets submitted to PRIDE has grownconsiderably (Fig 2) In parallel the size of the experiments interms of spectrum numbers has also increased significantly(Fig 2)

321 Data submission and format support

As a member of the PX consortium PRIDE supports bothldquocompleterdquo and ldquopartialrdquo submissions For ldquocompleterdquo sub-missions processed identification results need to be pro-vided in PRIDE XML (PRIDE original data format) or thePSI standard mzIdentML (version 11) format If mzIdentMLis used the corresponding peak list files referenced by themzIdentML files need to be included as well A ldquocompleterdquosubmission ensures that the processed results data can be in-tegrated in PRIDE and visualized using tools such as PRIDEInspector [54] (see below) Therefore for performing a ldquocom-pleterdquo submission the output files from the analysis soft-ware need to be converted or exported to either mzIdentML(see httpwwwpsidevinfotools-implementing-mzidentmland httpwwwebiacukpridehelparchivesubmissionmzidentml) or PRIDE XML (using PRIDE Converter 2 [55] orother external tools) It is important to highlight that the re-cently implemented support for mzIdentML makes possiblethat data from a growing number of toolsanalysis software(that were never supported via PRIDE XML) can now be fullysupported by PRIDE

The PRIDE Converter 2 tool suite (httppride-converter-2googlecodecom) [55] is an open source andplatform-independent software that allows users to convertseveral search engine output files to PRIDE XML The toolsuite consists of four different applications PRIDE Converter2 PRIDE mzTab generator PRIDE XML merger and PRIDEXML filter All of these tools can be launched using the graph-ical user interface or from the command line PRIDE Con-verter 2 currently supports the output from MASCOT (dat)[56] XTandem (xml) [57] OMSSA (csv) among others Italso supports different mass spectra file formats (mzML dtamgf mzData mzXML and pkl) As mentioned above thereare other tools not developed by the PRIDE team that canalso export to PRIDE XML [37] It is expected that as the mzI-dentML format becomes more and more popular the use ofPRIDE XML will decrease in time

The ldquopartialrdquo submission route is aimed at situationswhere processed identification results cannot be generated in

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 935

Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month

mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others

The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps

(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg

the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the

PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or

FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets

In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so

322 Data mining and visualization

PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers

The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services

Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE

33 PASSEL

PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines

331 Data submission and format support

PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass

spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo

332 Data mining and visualization

The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)

The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)

34 PeptideAtlas

The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]

Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]

In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 937

341 Data submission and format support

PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework

342 Data mining and visualization

The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins

Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein

identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)

PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)

35 MassIVE

The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset

351 Data submission and format support

The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface

MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset

The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed

352 Data mining and visualization

The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets

Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]

36 Chorus

Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass

spectral viewers as well as a database search engine for pro-tein sequence identification

The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects

37 GPMDB

GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database

371 Data submission and format support

The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database

372 Data mining and visualization

The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins

When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 939

More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl

38 ProteomicsDB

ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication

381 Data submission and format support

The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]

382 Data mining and visualization

A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be

visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage

39 MaxQB

MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities

391 Data submission and format support

Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface

Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database

392 Data mining and visualization

The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 4: Making proteomics data accessible and reusable: Current state of ...

Proteomics 2015 15 930ndash949 933

Ta

ble

1

Mai

nch

arac

teri

stic

so

fth

em

ajo

rM

S-b

ased

pro

teo

mic

sre

po

sito

ries

and

dat

abas

es

Rep

osi

tori

esR

awd

ata

Su

pp

ort

for

targ

eted

app

roac

hes

Met

adat

aH

um

anp

rote

inex

pre

ssio

nin

form

atio

n

Sp

ecie

sQ

uan

ti-

fica

tio

nd

ata

Rel

ated

stan

d-a

lon

eto

ols

Web

serv

ices

UR

LU

RL

PR

IDE

(Ju

ne

2014

)X

ndashH

igh

leve

l41

835

Pro

tein

acce

ssio

ns

269

806

Un

iqu

ep

epti

de

seq

uen

ces

Ap

pro

xim

atel

y10

1m

illio

nsp

ectr

a

Ap

pro

xim

atel

y45

0sp

ecie

sX

PR

IDE

Insp

ecto

rP

RID

EC

onv

erte

r2

Pep

tid

eSh

aker

htt

p

ww

we

bi

acu

kp

rid

ew

sar

chiv

eh

ttp

w

ww

eb

iac

uk

pri

de

Pep

tid

eAtl

as(H

um

an

Au

gu

st20

13)

XX

Med

ium

leve

l14

018

Pro

tein

s33

801

3Pe

pti

des

Ap

pro

xim

atel

y25

8m

illio

nsp

ectr

a

Mu

sm

usc

ulu

sC

and

ida

alb

ican

sC

and

ida

Cae

no

rhab

dit

isel

egan

sD

roso

ph

ilam

elan

oga

ster

H

alo

bac

teri

um

Eq

uu

scab

allu

sR

attu

sn

orv

egic

us

Sac

char

om

yces

cere

visi

ae

Dan

iore

rio

Su

sscr

ofa

M

yco

bac

teri

um

tub

ercu

losi

s

ndashT

IQA

M

TIQ

AM

ndashDig

esto

rT

IQA

Mndash

Pep

tid

eAtl

as

TIQ

AM

ndashVie

wer

A

TAQ

SP

IPE

2PA

BS

T

htt

p

ww

wp

epti

dea

tlas

org

GP

MD

B(M

ay20

14)

ndashX

Hig

hle

vel

136

373

Pro

tein

acce

ssio

ns

178

669

8Pe

pti

des

Ap

pro

xim

atel

y10

20m

illio

nsp

ectr

a

Bo

sta

uru

sC

anis

fam

iliar

isH

om

osa

pie

ns

Mm

usc

ulu

sR

no

rveg

icu

sG

allu

sga

llus

Dr

erio

Xen

op

us

tro

pic

alis

An

op

hel

esga

mb

iae

Ap

ism

ellif

era

Dm

elan

oga

ster

Ce

lega

ns

Ory

zasa

tiva

Ara

bid

op

sis

thal

ian

aS

acch

aro

myc

esce

revi

siae

Bac

illu

san

thra

cis

Am

esE

sch

eric

hia

coli

(K12

)La

cto

cocc

us

lact

is(I

l140

3)

Mt

ub

ercu

losi

sS

hig

ella

dys

ente

riae

S

alm

on

ella

typ

hi

Sal

mo

nel

laty

ph

imu

riu

m(L

T2)

ndashndash

htt

p

rest

th

egp

mo

rg1

htt

p

gp

md

bt

heg

pm

org

Mas

sIV

E(M

ay20

14)

Xndash

Low

leve

l

ndash

ndashndash

htt

p

mas

sive

ucs

de

du

Ch

oru

s(M

ay20

14)

Xndash

Low

leve

l

ndash

ndashndash

htt

ps

ch

oru

spro

ject

org

Pro

teo

mic

sDB

(May

2014

)X

ndashH

igh

leve

l18

097

Pro

tein

s73

940

6Pe

pti

des

Ap

pro

xim

atel

y70

mill

ion

spec

tra

Hs

apie

ns

Xndash

ndashh

ttp

sw

ww

pro

teo

mic

sdb

org

MO

PE

D(M

ay20

14)

ndashndash

Med

ium

leve

l17

141

Pro

tein

s25

000

0U

niq

ue

pep

tid

esA

pp

roxi

mat

ely

15m

illio

nsp

ectr

a

Hs

apie

ns

Mm

usc

ulu

sC

ele

gan

sS

cer

evis

iae

Xndash

ndashh

ttp

sw

ww

pro

tein

spir

eo

rg

MO

PE

D

Hu

man

Pro

tein

ped

ia(M

ay20

14)

Xndash

Low

leve

l15

231

Pro

tein

s1

960

352

Pep

tid

esA

pp

roxi

mat

ely

5m

illio

nsp

ectr

a

Hs

apie

ns

ndashndash

ndashh

ttp

w

ww

hu

man

pro

tein

ped

iao

rg

Max

QB

(May

2014

)ndash

ndashM

ediu

mle

vel

1473

2Pr

ote

ins

370

551

Pep

tid

esA

pp

roxi

mat

ely

20m

illio

nsp

ectr

a

Mm

usc

ulu

sH

sap

ien

sS

cer

evis

iae

Xndash

ndashh

ttp

m

axq

bb

ioch

emm

pg

de

mxd

b

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

934 Y Perez-Riverol et al Proteomics 2015 15 930ndash949Ta

ble

1

Co

nti

nu

ed

Rep

osi

tori

esR

awd

ata

Su

pp

ort

for

targ

eted

app

roac

hes

Met

adat

aH

um

anp

rote

inex

pre

ssio

nin

form

atio

n

Sp

ecie

sQ

uan

ti-

fica

tio

nd

ata

Rel

ated

stan

d-a

lon

eto

ols

Web

serv

ices

UR

LU

RL

PaxD

b(M

ay20

14)

ndashndash

Low

leve

l10

482

Pro

tein

s14

345

6Pe

pti

des

Ap

pro

xim

atel

y24

mill

ion

spec

tra

At

hal

ian

aM

mu

scu

lus

Hs

apie

ns

Sc

erev

isia

eD

mel

ano

gast

erE

co

li(K

12)

Mic

rocy

stis

aeru

gin

osa

C

ele

gan

sLe

pto

spir

ain

terr

oga

ns

sero

var

Co

pen

hag

eni

Mt

ub

ercu

losi

s(H

37R

v)S

trep

toco

ccu

sp

yog

enes

M1

GA

SS

chiz

osa

cch

aro

myc

esp

om

be

Bt

auru

sS

dys

ente

riae

Bs

ub

tilis

G

gal

lus

Th

erm

oco

ccu

sga

mm

ato

lera

ns

(EJ3

)M

yco

pla

sma

pn

eum

on

iae

Hal

ob

acte

riu

m(N

RC

ndash1)

St

yph

imu

riu

mLT

2R

no

rveg

icu

sD

ein

oco

ccu

sd

eser

ti(V

CD

115)

S

hig

ella

flex

ner

iS

ynec

ho

cyst

is

Am

ellif

era

Cf

amili

aris

Su

ssc

rofa

X

tro

pic

alis

Os

ativ

a

Xndash

htt

p

pax

ndashdb

org

ap

isea

rch

q=

hu

man

htt

p

pax

-db

org

HP

M(J

un

e20

14)

ndashndash

Low

leve

l10

482

Pro

tein

s29

300

0U

niq

ue

pep

tid

esA

pp

roxi

mat

ely

25m

illio

nsp

ectr

a

Hs

apie

ns

Xndash

ndashh

ttp

h

um

anp

rote

om

emap

org

Xt

he

feat

ure

or

char

acte

rist

icis

sup

po

rted

-t

he

feat

ure

isn

ot

sup

po

rted

i

tw

asn

ot

po

ssib

leto

retr

ieve

the

corr

esp

on

din

gin

form

atio

n

values the analyzed mass spectra (both as raw data andpeak lists) and the related technicalbiological metadataIn PRIDE data are stored as originally analyzed by the re-searchers (the authorrsquos analysis view on the data) supportingmany popular search enginesanalysis workflows Most ofthe terms information and metadata supporting the PRIDEdata are based on ontologies or controlled vocabularies [52]In this context the ldquoOntology Lookup Servicerdquo [53] was devel-oped as a spin-off of PRIDE to enable the querying browsingand navigation of biomedical ontologies Since the inceptionof PX the number of datasets submitted to PRIDE has grownconsiderably (Fig 2) In parallel the size of the experiments interms of spectrum numbers has also increased significantly(Fig 2)

321 Data submission and format support

As a member of the PX consortium PRIDE supports bothldquocompleterdquo and ldquopartialrdquo submissions For ldquocompleterdquo sub-missions processed identification results need to be pro-vided in PRIDE XML (PRIDE original data format) or thePSI standard mzIdentML (version 11) format If mzIdentMLis used the corresponding peak list files referenced by themzIdentML files need to be included as well A ldquocompleterdquosubmission ensures that the processed results data can be in-tegrated in PRIDE and visualized using tools such as PRIDEInspector [54] (see below) Therefore for performing a ldquocom-pleterdquo submission the output files from the analysis soft-ware need to be converted or exported to either mzIdentML(see httpwwwpsidevinfotools-implementing-mzidentmland httpwwwebiacukpridehelparchivesubmissionmzidentml) or PRIDE XML (using PRIDE Converter 2 [55] orother external tools) It is important to highlight that the re-cently implemented support for mzIdentML makes possiblethat data from a growing number of toolsanalysis software(that were never supported via PRIDE XML) can now be fullysupported by PRIDE

The PRIDE Converter 2 tool suite (httppride-converter-2googlecodecom) [55] is an open source andplatform-independent software that allows users to convertseveral search engine output files to PRIDE XML The toolsuite consists of four different applications PRIDE Converter2 PRIDE mzTab generator PRIDE XML merger and PRIDEXML filter All of these tools can be launched using the graph-ical user interface or from the command line PRIDE Con-verter 2 currently supports the output from MASCOT (dat)[56] XTandem (xml) [57] OMSSA (csv) among others Italso supports different mass spectra file formats (mzML dtamgf mzData mzXML and pkl) As mentioned above thereare other tools not developed by the PRIDE team that canalso export to PRIDE XML [37] It is expected that as the mzI-dentML format becomes more and more popular the use ofPRIDE XML will decrease in time

The ldquopartialrdquo submission route is aimed at situationswhere processed identification results cannot be generated in

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 935

Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month

mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others

The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps

(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg

the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the

PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or

FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets

In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so

322 Data mining and visualization

PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers

The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services

Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE

33 PASSEL

PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines

331 Data submission and format support

PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass

spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo

332 Data mining and visualization

The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)

The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)

34 PeptideAtlas

The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]

Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]

In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 937

341 Data submission and format support

PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework

342 Data mining and visualization

The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins

Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein

identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)

PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)

35 MassIVE

The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset

351 Data submission and format support

The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface

MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset

The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed

352 Data mining and visualization

The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets

Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]

36 Chorus

Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass

spectral viewers as well as a database search engine for pro-tein sequence identification

The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects

37 GPMDB

GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database

371 Data submission and format support

The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database

372 Data mining and visualization

The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins

When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 939

More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl

38 ProteomicsDB

ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication

381 Data submission and format support

The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]

382 Data mining and visualization

A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be

visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage

39 MaxQB

MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities

391 Data submission and format support

Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface

Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database

392 Data mining and visualization

The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 5: Making proteomics data accessible and reusable: Current state of ...

934 Y Perez-Riverol et al Proteomics 2015 15 930ndash949Ta

ble

1

Co

nti

nu

ed

Rep

osi

tori

esR

awd

ata

Su

pp

ort

for

targ

eted

app

roac

hes

Met

adat

aH

um

anp

rote

inex

pre

ssio

nin

form

atio

n

Sp

ecie

sQ

uan

ti-

fica

tio

nd

ata

Rel

ated

stan

d-a

lon

eto

ols

Web

serv

ices

UR

LU

RL

PaxD

b(M

ay20

14)

ndashndash

Low

leve

l10

482

Pro

tein

s14

345

6Pe

pti

des

Ap

pro

xim

atel

y24

mill

ion

spec

tra

At

hal

ian

aM

mu

scu

lus

Hs

apie

ns

Sc

erev

isia

eD

mel

ano

gast

erE

co

li(K

12)

Mic

rocy

stis

aeru

gin

osa

C

ele

gan

sLe

pto

spir

ain

terr

oga

ns

sero

var

Co

pen

hag

eni

Mt

ub

ercu

losi

s(H

37R

v)S

trep

toco

ccu

sp

yog

enes

M1

GA

SS

chiz

osa

cch

aro

myc

esp

om

be

Bt

auru

sS

dys

ente

riae

Bs

ub

tilis

G

gal

lus

Th

erm

oco

ccu

sga

mm

ato

lera

ns

(EJ3

)M

yco

pla

sma

pn

eum

on

iae

Hal

ob

acte

riu

m(N

RC

ndash1)

St

yph

imu

riu

mLT

2R

no

rveg

icu

sD

ein

oco

ccu

sd

eser

ti(V

CD

115)

S

hig

ella

flex

ner

iS

ynec

ho

cyst

is

Am

ellif

era

Cf

amili

aris

Su

ssc

rofa

X

tro

pic

alis

Os

ativ

a

Xndash

htt

p

pax

ndashdb

org

ap

isea

rch

q=

hu

man

htt

p

pax

-db

org

HP

M(J

un

e20

14)

ndashndash

Low

leve

l10

482

Pro

tein

s29

300

0U

niq

ue

pep

tid

esA

pp

roxi

mat

ely

25m

illio

nsp

ectr

a

Hs

apie

ns

Xndash

ndashh

ttp

h

um

anp

rote

om

emap

org

Xt

he

feat

ure

or

char

acte

rist

icis

sup

po

rted

-t

he

feat

ure

isn

ot

sup

po

rted

i

tw

asn

ot

po

ssib

leto

retr

ieve

the

corr

esp

on

din

gin

form

atio

n

values the analyzed mass spectra (both as raw data andpeak lists) and the related technicalbiological metadataIn PRIDE data are stored as originally analyzed by the re-searchers (the authorrsquos analysis view on the data) supportingmany popular search enginesanalysis workflows Most ofthe terms information and metadata supporting the PRIDEdata are based on ontologies or controlled vocabularies [52]In this context the ldquoOntology Lookup Servicerdquo [53] was devel-oped as a spin-off of PRIDE to enable the querying browsingand navigation of biomedical ontologies Since the inceptionof PX the number of datasets submitted to PRIDE has grownconsiderably (Fig 2) In parallel the size of the experiments interms of spectrum numbers has also increased significantly(Fig 2)

321 Data submission and format support

As a member of the PX consortium PRIDE supports bothldquocompleterdquo and ldquopartialrdquo submissions For ldquocompleterdquo sub-missions processed identification results need to be pro-vided in PRIDE XML (PRIDE original data format) or thePSI standard mzIdentML (version 11) format If mzIdentMLis used the corresponding peak list files referenced by themzIdentML files need to be included as well A ldquocompleterdquosubmission ensures that the processed results data can be in-tegrated in PRIDE and visualized using tools such as PRIDEInspector [54] (see below) Therefore for performing a ldquocom-pleterdquo submission the output files from the analysis soft-ware need to be converted or exported to either mzIdentML(see httpwwwpsidevinfotools-implementing-mzidentmland httpwwwebiacukpridehelparchivesubmissionmzidentml) or PRIDE XML (using PRIDE Converter 2 [55] orother external tools) It is important to highlight that the re-cently implemented support for mzIdentML makes possiblethat data from a growing number of toolsanalysis software(that were never supported via PRIDE XML) can now be fullysupported by PRIDE

The PRIDE Converter 2 tool suite (httppride-converter-2googlecodecom) [55] is an open source andplatform-independent software that allows users to convertseveral search engine output files to PRIDE XML The toolsuite consists of four different applications PRIDE Converter2 PRIDE mzTab generator PRIDE XML merger and PRIDEXML filter All of these tools can be launched using the graph-ical user interface or from the command line PRIDE Con-verter 2 currently supports the output from MASCOT (dat)[56] XTandem (xml) [57] OMSSA (csv) among others Italso supports different mass spectra file formats (mzML dtamgf mzData mzXML and pkl) As mentioned above thereare other tools not developed by the PRIDE team that canalso export to PRIDE XML [37] It is expected that as the mzI-dentML format becomes more and more popular the use ofPRIDE XML will decrease in time

The ldquopartialrdquo submission route is aimed at situationswhere processed identification results cannot be generated in

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 935

Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month

mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others

The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps

(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg

the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the

PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or

FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets

In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so

322 Data mining and visualization

PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers

The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services

Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE

33 PASSEL

PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines

331 Data submission and format support

PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass

spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo

332 Data mining and visualization

The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)

The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)

34 PeptideAtlas

The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]

Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]

In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 937

341 Data submission and format support

PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework

342 Data mining and visualization

The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins

Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein

identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)

PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)

35 MassIVE

The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset

351 Data submission and format support

The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface

MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset

The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed

352 Data mining and visualization

The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets

Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]

36 Chorus

Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass

spectral viewers as well as a database search engine for pro-tein sequence identification

The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects

37 GPMDB

GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database

371 Data submission and format support

The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database

372 Data mining and visualization

The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins

When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 939

More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl

38 ProteomicsDB

ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication

381 Data submission and format support

The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]

382 Data mining and visualization

A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be

visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage

39 MaxQB

MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities

391 Data submission and format support

Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface

Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database

392 Data mining and visualization

The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 6: Making proteomics data accessible and reusable: Current state of ...

Proteomics 2015 15 930ndash949 935

Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month

mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others

The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps

(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg

the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the

PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or

FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets

In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so

322 Data mining and visualization

PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers

The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services

Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE

33 PASSEL

PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines

331 Data submission and format support

PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass

spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo

332 Data mining and visualization

The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)

The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)

34 PeptideAtlas

The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]

Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]

In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 937

341 Data submission and format support

PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework

342 Data mining and visualization

The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins

Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein

identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)

PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)

35 MassIVE

The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset

351 Data submission and format support

The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface

MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset

The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed

352 Data mining and visualization

The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets

Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]

36 Chorus

Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass

spectral viewers as well as a database search engine for pro-tein sequence identification

The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects

37 GPMDB

GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database

371 Data submission and format support

The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database

372 Data mining and visualization

The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins

When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 939

More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl

38 ProteomicsDB

ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication

381 Data submission and format support

The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]

382 Data mining and visualization

A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be

visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage

39 MaxQB

MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities

391 Data submission and format support

Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface

Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database

392 Data mining and visualization

The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 7: Making proteomics data accessible and reusable: Current state of ...

936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services

Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE

33 PASSEL

PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines

331 Data submission and format support

PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass

spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo

332 Data mining and visualization

The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)

The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)

34 PeptideAtlas

The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]

Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]

In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 937

341 Data submission and format support

PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework

342 Data mining and visualization

The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins

Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein

identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)

PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)

35 MassIVE

The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset

351 Data submission and format support

The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface

MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset

The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed

352 Data mining and visualization

The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets

Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]

36 Chorus

Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass

spectral viewers as well as a database search engine for pro-tein sequence identification

The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects

37 GPMDB

GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database

371 Data submission and format support

The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database

372 Data mining and visualization

The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins

When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 939

More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl

38 ProteomicsDB

ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication

381 Data submission and format support

The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]

382 Data mining and visualization

A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be

visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage

39 MaxQB

MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities

391 Data submission and format support

Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface

Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database

392 Data mining and visualization

The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 8: Making proteomics data accessible and reusable: Current state of ...

Proteomics 2015 15 930ndash949 937

341 Data submission and format support

PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework

342 Data mining and visualization

The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins

Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein

identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)

PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)

35 MassIVE

The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset

351 Data submission and format support

The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface

MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset

The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed

352 Data mining and visualization

The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets

Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]

36 Chorus

Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass

spectral viewers as well as a database search engine for pro-tein sequence identification

The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects

37 GPMDB

GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database

371 Data submission and format support

The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database

372 Data mining and visualization

The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins

When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 939

More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl

38 ProteomicsDB

ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication

381 Data submission and format support

The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]

382 Data mining and visualization

A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be

visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage

39 MaxQB

MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities

391 Data submission and format support

Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface

Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database

392 Data mining and visualization

The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 9: Making proteomics data accessible and reusable: Current state of ...

938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset

The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed

352 Data mining and visualization

The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets

Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]

36 Chorus

Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass

spectral viewers as well as a database search engine for pro-tein sequence identification

The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects

37 GPMDB

GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database

371 Data submission and format support

The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database

372 Data mining and visualization

The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins

When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 939

More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl

38 ProteomicsDB

ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication

381 Data submission and format support

The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]

382 Data mining and visualization

A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be

visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage

39 MaxQB

MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities

391 Data submission and format support

Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface

Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database

392 Data mining and visualization

The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 10: Making proteomics data accessible and reusable: Current state of ...

Proteomics 2015 15 930ndash949 939

More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl

38 ProteomicsDB

ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication

381 Data submission and format support

The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]

382 Data mining and visualization

A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be

visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage

39 MaxQB

MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities

391 Data submission and format support

Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface

Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database

392 Data mining and visualization

The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 11: Making proteomics data accessible and reusable: Current state of ...

940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score

310 MOPED

MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)

3101 Data submission and format support

In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae

3102 Data mining and visualization

The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table

displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name

The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks

311 PaxDb

The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions

3111 Data submission and format support

The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]

3112 Data mining and visualization

The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 12: Making proteomics data accessible and reusable: Current state of ...

Proteomics 2015 15 930ndash949 941

In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available

312 Human Proteinpedia

Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed

3121 Data submission and format support

Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)

The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually

3122 Data mining and visualization

The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)

313 HPM

The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers

3131 Data submission and format support

MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 13: Making proteomics data accessible and reusable: Current state of ...

942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

resource was developed as a protein expression database anddo not support the submission of new data from externalusers

3132 Data mining and visualization

The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)

314 Other proteomics resources

There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata

Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra

iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets

315 Proteomics information available through

UniProt and neXtProt

UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information

MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]

Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB

neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]

All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout

4 Data reuse from public resources

Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 14: Making proteomics data accessible and reusable: Current state of ...

Proteomics 2015 15 930ndash949 943

of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel

Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]

Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]

In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds

Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]

We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows

5 Pitfalls and future challenges

Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task

In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata

The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein

To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation

As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 15: Making proteomics data accessible and reusable: Current state of ...

944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all

With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable

In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-

ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed

Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 16: Making proteomics data accessible and reusable: Current state of ...

Proteomics 2015 15 930ndash949 945

identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets

6 Conclusions

In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata

YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)

The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources

7 References

[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48

[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171

[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62

[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434

[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326

[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362

[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015

[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47

[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76

[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549

[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548

[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004

[13] Credit where credit is overdue Nat Biotechnol 2009 27579

[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431

[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553

[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657

[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38

[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855

[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181

[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20

[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419

[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170

[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294

[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 17: Making proteomics data accessible and reusable: Current state of ...

946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242

[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75

[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069

[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587

[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068

[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175

[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343

[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500

[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781

[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581

[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601

[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145

[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226

[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227

[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146

[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881

[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786

[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198

[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298

[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7

[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133

[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381

[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340

[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466

[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159

[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040

[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681

[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107

[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160

[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137

[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689

[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 18: Making proteomics data accessible and reusable: Current state of ...

Proteomics 2015 15 930ndash949 947

sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567

[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467

[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press

[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755

[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96

[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435

[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914

[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171

[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9

[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431

[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46

[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434

[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131

[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392

[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658

[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417

[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374

[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on

antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932

[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076

[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223

[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500

[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78

[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176

[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363

[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038

[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28

[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167

[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199

[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485

[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017

[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973

[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096

[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 19: Making proteomics data accessible and reusable: Current state of ...

948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949

[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849

[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850

[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29

[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30

[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513

[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20

[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470

[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805

[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372

[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342

[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127

[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099

[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113

[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126

[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020

[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61

[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction

networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568

[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306

[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772

[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463

[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121

[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989

[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053

[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237

[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123

[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023

[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198

[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270

[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452

[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285

[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom

Page 20: Making proteomics data accessible and reusable: Current state of ...

Proteomics 2015 15 930ndash949 949

[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332

[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181

[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871

[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069

[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772

[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767

[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161

[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488

[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490

[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191

[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148

[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077

[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194

[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438

[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70

[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999

[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114

[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562

[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550

[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138

[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695

[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70

Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom