Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) Data
Making proteomics data accessible and reusable: Current state of ...
Transcript of Making proteomics data accessible and reusable: Current state of ...
930 Proteomics 2015 15 930ndash949DOI 101002pmic201400302
REVIEW
Making proteomics data accessible and reusable
Current state of proteomics databases and repositories
Yasset Perez-Riverol Emanuele Alpi Rui Wang Henning Hermjakoband Juan Antonio Vizcaıno
European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) Wellcome Trust GenomeCampus Hinxton Cambridge UK
Received July 1 2014Revised August 6 2014
Accepted August 22 2014
Compared to other data-intensive disciplines such as genomics public deposition and storageof MS-based proteomics data are still less developed due to among other reasons the inherentcomplexity of the data and the variety of data types and experimental workflows In orderto address this need several public repositories for MS proteomics experiments have beendeveloped each with different purposes in mind The most established resources are the GlobalProteome Machine Database (GPMDB) PeptideAtlas and the PRIDE database Additionallythere are other useful (in many cases recently developed) resources such as ProteomicsDB MassSpectrometry Interactive Virtual Environment (MassIVE) Chorus MaxQB PeptideAtlas SRMExperiment Library (PASSEL) Model Organism Protein Expression Database (MOPED) andthe Human Proteinpedia In addition the ProteomeXchange consortium has been recentlydeveloped to enable better integration of public repositories and the coordinated sharing ofproteomics information maximizing its benefit to the scientific community Here we willreview each of the major proteomics resources independently and some tools that enable theintegration mining and reuse of the data We will also discuss some of the major challengesand current pitfalls in the integration and sharing of the data
Keywords
Bioinformatics Databases MS Repositories
Additional supporting information may be found in the online version of this article atthe publisherrsquos web-site
Correspondence Dr Juan Antonio Vizcaıno European Molecu-lar Biology Laboratory European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus Hinxton CambridgeCB10 1SD UKE-mail juanebiacukFax +44 1223 494 484
Abbreviations C-HPP chromosome-based Human ProteomeProject COPaKB Cardiac Organellar Protein Atlas Knowledge-base FDR false discovery rate HPM human proteome mapHPRD human protein reference database MassIVE Mass Spec-trometry Interactive Virtual Environment MOPED Model Or-ganism Protein Expression Database NCBI National Center forBiotechnology Information PASSEL PeptideAtlas SRM Experi-ment Library PSM peptide spectrum match PX ProteomeX-change TIQAM targeted identification for quantitative analysisby MRM TPP trans-proteomic pipeline
1 Introduction
In the age of systems biology and data integration proteomicsdata represent a crucial component to understand the ldquowholepicturerdquo of life Proteomics technologiesmdashparticularly MS-based protein identification and quantification approachesmdashhave matured immensely through cumulative advancesin high-throughput analytical methodologies [1ndash5] samplepreparation [6] improved instrumentation [7] and the avail-ability of protein sequence databases [8] and computationalanalysis tools [19] Therefore with the development of morepowerful and sensitive analytical methods and instrumenta-tion the identification and quantification of a high proportionof the expressed proteins in a given condition is now achiev-able in an average experiment [1011] In parallel as a result
Colour Online See the article online to view Figs 1ndash3 in colour
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
This is an open access article under the terms of the Creative Commons Attribution License which permits use distribution and reproduction inany medium provided the original work is properly cited
Proteomics 2015 15 930ndash949 931
the size of data produced in proteomics laboratories has in-creased by several orders of magnitude [12]
Compared to other data-intensive fields such as genomicsdeposition and storage of original proteomics data in publicresources have been less common [13] This is regrettablesince proteome studies are usually more complex than itscounterpart genomics ones In fact data interpretation in pro-teomics can be considerably more complex than in genomicsdue to the wide variety of analytical approaches [14 15]bioinformatics tools and pipelines [16 17] and the relatedstatistical analysis [1819] However thanks to the guidelinespromoted by several scientific journals and funding agencies[20] there is a growing consensus in the community about theneed for the public dissemination of proteomics data whichis already facilitating the assessment reuse comparativeanalyses and extraction of new findings from published data[13 21]
The complexity of proteomics data is heightened by alter-native splicing PTMs and protein degradation events andis further amplified by the interconnectivity of proteins intocomplexes and signaling networks that are highly divergent intime and space [1] In order to address this complexity newanalytical and bioinformatics methodologies are developedevery year [22 23] which complicate the data standardiza-tion and deposition Additionally the audience interested inproteomics data is very heterogeneous It includes biologistselucidating the mechanisms of regulation of specific proteinsMS researchers improving the current analytical methods orcomputational biologists developing new software tools forthe analysis and interpretation of the data [24]
Data sharing in proteomics requires substantial invest-ment and infrastructure Several public repositories havebeen developed each with different purposes in mind Well-established databases for proteomics data are the Global Pro-teome Machine Database (GPMDB) [25] PeptideAtlas [26]and the PRIDE database [27] Additionally at present thereare other resources (many of them recently developed) suchas ProteomicsDB [28] MassIVE (Mass Spectrometry Inter-active Virtual Environment) Chorus MaxQB [29] PASSEL(PeptideAtlas SRM Experiment Library) [30] MOPED (ModelOrganism Protein Expression Database) [31] PaxDb [32]Human Proteinpedia [33] and the human proteome map(HPM) [34] Furthermore there are several more specializedresources that will only be cited briefly in this review It isimportant to mention here that no single proteomics dataresource will be ideally suited to all possible use cases andall potential users Regrettably two widely used resourceswere discontinued due to lack of funding Peptidome [35]and Tranche [36] This had a negative impact on the effortspromoting data sharing in the field as it was perceived by thecommunity that effort invested in data sharing was lost
Recently the ProteomeXchange (PX) consortium [37] hasbeen formed to enable a better integration of public repos-itories maximizing its benefits to the scientific commu-nity through the implementation of standardized submissionand dissemination pipelines for proteomics information By
Figure 1 Hierarchy of proteomics data repositories anddatabases according to the different data types stored raw MSdata repositories resources that store peptideprotein identifi-cation and quantification results and protein knowledge basesSome resources are duplicated in different levels because theycan be included in more than one category
August 2014 PRIDE PeptideAtlas PASSEL and MassIVEare the active members of the consortium
The aim of this review is to provide an up-to-date overviewof the current state of proteomics data repositories anddatabases providing a solid starting point for those who wantto perform data submission andor data mining There are afew comparable reviews available in the literature [2438ndash41]but there is a need for an update since this has been quite adynamic field over the past few years In this manuscript wewill not include a thorough review about protein knowledgebases such as the Universal Protein resource (UniProt) [42]and neXtProt [43] but we will explain how MS proteomicsinformation is made available in these resources
2 Organization of proteomics repositoriesand databases
The information generated in a typical proteomics experi-ment can be organized in three different levels [44] (i) rawdata (ii) processed results including peptideprotein identi-fication and quantification values and (iii) the resulting bio-logical conclusions Technical andor biological metadata canbe provided for each level independently
These three categories enable the classification of the ex-isting MS proteomics repositories according to their level ofspecialization (Fig 1) In our view these three levels of infor-mation should be captured and properly annotated in publicdatabases and repositories ideally using data standards whenavailable In fact the development of proteomics resources ismore feasible due the maturity of some data standard formats[45ndash47] and open source tools [9] which facilitate public datadeposition
Some of the first open formats developed includedmzXML (for MS data) [48] pepXML and protXML (for
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
932 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
peptideprotein identifications) that were developed as part ofthe trans-proteomic pipeline (TPP) [49] In the context of theProteomics Standards Initiative (PSI) several standard dataformats have been developed over the last few years whichreflect the variety in data types within the field and thereforethose that can be supported by proteomics resources Themain formats developed have been mzML (for MS data) [45]mzIdentML (for peptide and protein identification results)[46] mzQuantML (for capturing a detailed trace of each stageof quantitative analysis) [47] TraML (for representing inputtransitions in SRM approaches) [50] and the recent mzTab(for capturing a simpler summary of the final results) [51]The aims and functionalities of the existing resources willbe explored in detail in the following sections (Table 1)
3 Resources
31 The PX consortium
The PX consortium [37] (httpwwwproteomexchangeorg)was created to promote the collaboration and integration ofmajor stakeholders in the domain of MS proteomics repos-itories comprising among others primary (PRIDE [27] andPASSEL [30]) and secondary resources (PeptideAtlas) pro-teomics researchers and representatives from journals reg-ularly publishing proteomics data Recently in June 2014MassIVE joined the consortium The aim of PX is to pro-vide a common framework for the cooperation of proteomicsresources by defining and implementing consistent harmo-nized user-friendly data deposition and dissemination pro-cedures In addition another important goal is to enable andprovide ldquomutual backuprdquo if one of the resources has fundingissues
The consortiumrsquos members have agreed in providing a suf-ficient set of common experimental and technical metadataThis information is stored using the PX XML format [37]Finally all the submitted datasets get a unique and universalidentifier (PXD identifier)
311 Data submission and format support
By August 2014 two major workflows are fully supportedin PX MSMS and SRM approaches In the first stable im-plementation of the data workflow PRIDE acts as the ini-tial submission point for MSMS data whereas PASSEL hasthe equivalent role for SRM data Both workflows will be ex-plained in detail below At the moment of writing MassIVEhas just joined PX aiming to have an equivalent role to PRIDE
There are two different PX MSMS submission modesldquoCompleterdquo and ldquoPartialrdquo To perform a ldquocompleterdquo submis-sion means that after all the files have been submitted itis possible for the receiving repository to connect directlythe processed identification results with the mass spectraThis can be achieved if the processed identification results
are available in a format supported by the receiving reposi-tory (eg mzIdentML) and if peak list files are included inthe submission ldquoCompleterdquo submissions get a Digital ObjectIdentifier (DOI) to facilitate its traceability
On the other hand after performing a ldquopartialrdquo submissionthe connection between the spectra and the identification re-sults cannot be done in a straightforward way In this case theprocessed results are not available in a supported format bythe receiving repository and the corresponding search engineoutput files (in heterogeneous formats) are made available fordownload For both types of submissions metadata and rawdata are always stored for each dataset
Although ldquopartialrdquo submissions are searchable by theirmetadata peptide and protein identifications cannot be cap-tured by the receiving repository which decreases the abilityof reviewers to check the data and can make data reuse bythird parties challenging For instance ldquopartialrdquo submissionsdo not qualify for the requirements on spectra annotationfrom the journal MCP (Molecular and Cellular ProteomicsmdashhttpwwwmcponlineorgsitemiscParisReport_Finalxhtml)
Finally it needs to be highlighted that all PX memberssupport private review of the data during the manuscriptreview process The submitted data remains private beforemanuscript publication and login details are provided to fa-cilitate access for reviewers and journal editors during themanuscript review process
312 Data mining and visualization
ProteomeCentral (httpproteomecentralproteomexchangeorg) is the centralized portal for accessing all PX datasetsindependently from the original resource where data werestored ProteomeCentral provides the ability to search meta-data associated with datasets in the participating repositories(PRIDE and MassIVE for MSMS data PASSEL for SRMdata or reprocessed original PX datasets in PeptideAtlas) Itis then possible to query the archive and identify datasets ofinterest using biological and technical metadata keywordstags or publication information To monitor the releaseof new public PX datasets researchers can subscribe to aRich Site Summary feed (httpgroupsgooglecomgroupproteomexchangefeedrss_v2_0_msgsxml) Next the maincharacteristics of the individual members of the consortiumwill be explained
32 PRIDE
The PRIDE database [27] (httpwwwebiacukpride) wasinitially developed at the European Bioinformatics Institute(EBI Cambridge UK) to store the experimental data includedin publications supporting the manuscript review processThe main data types stored in PRIDE are peptideproteinidentifications (including PTMs) peptideprotein expression
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 933
Ta
ble
1
Mai
nch
arac
teri
stic
so
fth
em
ajo
rM
S-b
ased
pro
teo
mic
sre
po
sito
ries
and
dat
abas
es
Rep
osi
tori
esR
awd
ata
Su
pp
ort
for
targ
eted
app
roac
hes
Met
adat
aH
um
anp
rote
inex
pre
ssio
nin
form
atio
n
Sp
ecie
sQ
uan
ti-
fica
tio
nd
ata
Rel
ated
stan
d-a
lon
eto
ols
Web
serv
ices
UR
LU
RL
PR
IDE
(Ju
ne
2014
)X
ndashH
igh
leve
l41
835
Pro
tein
acce
ssio
ns
269
806
Un
iqu
ep
epti
de
seq
uen
ces
Ap
pro
xim
atel
y10
1m
illio
nsp
ectr
a
Ap
pro
xim
atel
y45
0sp
ecie
sX
PR
IDE
Insp
ecto
rP
RID
EC
onv
erte
r2
Pep
tid
eSh
aker
htt
p
ww
we
bi
acu
kp
rid
ew
sar
chiv
eh
ttp
w
ww
eb
iac
uk
pri
de
Pep
tid
eAtl
as(H
um
an
Au
gu
st20
13)
XX
Med
ium
leve
l14
018
Pro
tein
s33
801
3Pe
pti
des
Ap
pro
xim
atel
y25
8m
illio
nsp
ectr
a
Mu
sm
usc
ulu
sC
and
ida
alb
ican
sC
and
ida
Cae
no
rhab
dit
isel
egan
sD
roso
ph
ilam
elan
oga
ster
H
alo
bac
teri
um
Eq
uu
scab
allu
sR
attu
sn
orv
egic
us
Sac
char
om
yces
cere
visi
ae
Dan
iore
rio
Su
sscr
ofa
M
yco
bac
teri
um
tub
ercu
losi
s
ndashT
IQA
M
TIQ
AM
ndashDig
esto
rT
IQA
Mndash
Pep
tid
eAtl
as
TIQ
AM
ndashVie
wer
A
TAQ
SP
IPE
2PA
BS
T
htt
p
ww
wp
epti
dea
tlas
org
GP
MD
B(M
ay20
14)
ndashX
Hig
hle
vel
136
373
Pro
tein
acce
ssio
ns
178
669
8Pe
pti
des
Ap
pro
xim
atel
y10
20m
illio
nsp
ectr
a
Bo
sta
uru
sC
anis
fam
iliar
isH
om
osa
pie
ns
Mm
usc
ulu
sR
no
rveg
icu
sG
allu
sga
llus
Dr
erio
Xen
op
us
tro
pic
alis
An
op
hel
esga
mb
iae
Ap
ism
ellif
era
Dm
elan
oga
ster
Ce
lega
ns
Ory
zasa
tiva
Ara
bid
op
sis
thal
ian
aS
acch
aro
myc
esce
revi
siae
Bac
illu
san
thra
cis
Am
esE
sch
eric
hia
coli
(K12
)La
cto
cocc
us
lact
is(I
l140
3)
Mt
ub
ercu
losi
sS
hig
ella
dys
ente
riae
S
alm
on
ella
typ
hi
Sal
mo
nel
laty
ph
imu
riu
m(L
T2)
ndashndash
htt
p
rest
th
egp
mo
rg1
htt
p
gp
md
bt
heg
pm
org
Mas
sIV
E(M
ay20
14)
Xndash
Low
leve
l
ndash
ndashndash
htt
p
mas
sive
ucs
de
du
Ch
oru
s(M
ay20
14)
Xndash
Low
leve
l
ndash
ndashndash
htt
ps
ch
oru
spro
ject
org
Pro
teo
mic
sDB
(May
2014
)X
ndashH
igh
leve
l18
097
Pro
tein
s73
940
6Pe
pti
des
Ap
pro
xim
atel
y70
mill
ion
spec
tra
Hs
apie
ns
Xndash
ndashh
ttp
sw
ww
pro
teo
mic
sdb
org
MO
PE
D(M
ay20
14)
ndashndash
Med
ium
leve
l17
141
Pro
tein
s25
000
0U
niq
ue
pep
tid
esA
pp
roxi
mat
ely
15m
illio
nsp
ectr
a
Hs
apie
ns
Mm
usc
ulu
sC
ele
gan
sS
cer
evis
iae
Xndash
ndashh
ttp
sw
ww
pro
tein
spir
eo
rg
MO
PE
D
Hu
man
Pro
tein
ped
ia(M
ay20
14)
Xndash
Low
leve
l15
231
Pro
tein
s1
960
352
Pep
tid
esA
pp
roxi
mat
ely
5m
illio
nsp
ectr
a
Hs
apie
ns
ndashndash
ndashh
ttp
w
ww
hu
man
pro
tein
ped
iao
rg
Max
QB
(May
2014
)ndash
ndashM
ediu
mle
vel
1473
2Pr
ote
ins
370
551
Pep
tid
esA
pp
roxi
mat
ely
20m
illio
nsp
ectr
a
Mm
usc
ulu
sH
sap
ien
sS
cer
evis
iae
Xndash
ndashh
ttp
m
axq
bb
ioch
emm
pg
de
mxd
b
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
934 Y Perez-Riverol et al Proteomics 2015 15 930ndash949Ta
ble
1
Co
nti
nu
ed
Rep
osi
tori
esR
awd
ata
Su
pp
ort
for
targ
eted
app
roac
hes
Met
adat
aH
um
anp
rote
inex
pre
ssio
nin
form
atio
n
Sp
ecie
sQ
uan
ti-
fica
tio
nd
ata
Rel
ated
stan
d-a
lon
eto
ols
Web
serv
ices
UR
LU
RL
PaxD
b(M
ay20
14)
ndashndash
Low
leve
l10
482
Pro
tein
s14
345
6Pe
pti
des
Ap
pro
xim
atel
y24
mill
ion
spec
tra
At
hal
ian
aM
mu
scu
lus
Hs
apie
ns
Sc
erev
isia
eD
mel
ano
gast
erE
co
li(K
12)
Mic
rocy
stis
aeru
gin
osa
C
ele
gan
sLe
pto
spir
ain
terr
oga
ns
sero
var
Co
pen
hag
eni
Mt
ub
ercu
losi
s(H
37R
v)S
trep
toco
ccu
sp
yog
enes
M1
GA
SS
chiz
osa
cch
aro
myc
esp
om
be
Bt
auru
sS
dys
ente
riae
Bs
ub
tilis
G
gal
lus
Th
erm
oco
ccu
sga
mm
ato
lera
ns
(EJ3
)M
yco
pla
sma
pn
eum
on
iae
Hal
ob
acte
riu
m(N
RC
ndash1)
St
yph
imu
riu
mLT
2R
no
rveg
icu
sD
ein
oco
ccu
sd
eser
ti(V
CD
115)
S
hig
ella
flex
ner
iS
ynec
ho
cyst
is
Am
ellif
era
Cf
amili
aris
Su
ssc
rofa
X
tro
pic
alis
Os
ativ
a
Xndash
htt
p
pax
ndashdb
org
ap
isea
rch
q=
hu
man
htt
p
pax
-db
org
HP
M(J
un
e20
14)
ndashndash
Low
leve
l10
482
Pro
tein
s29
300
0U
niq
ue
pep
tid
esA
pp
roxi
mat
ely
25m
illio
nsp
ectr
a
Hs
apie
ns
Xndash
ndashh
ttp
h
um
anp
rote
om
emap
org
Xt
he
feat
ure
or
char
acte
rist
icis
sup
po
rted
-t
he
feat
ure
isn
ot
sup
po
rted
i
tw
asn
ot
po
ssib
leto
retr
ieve
the
corr
esp
on
din
gin
form
atio
n
values the analyzed mass spectra (both as raw data andpeak lists) and the related technicalbiological metadataIn PRIDE data are stored as originally analyzed by the re-searchers (the authorrsquos analysis view on the data) supportingmany popular search enginesanalysis workflows Most ofthe terms information and metadata supporting the PRIDEdata are based on ontologies or controlled vocabularies [52]In this context the ldquoOntology Lookup Servicerdquo [53] was devel-oped as a spin-off of PRIDE to enable the querying browsingand navigation of biomedical ontologies Since the inceptionof PX the number of datasets submitted to PRIDE has grownconsiderably (Fig 2) In parallel the size of the experiments interms of spectrum numbers has also increased significantly(Fig 2)
321 Data submission and format support
As a member of the PX consortium PRIDE supports bothldquocompleterdquo and ldquopartialrdquo submissions For ldquocompleterdquo sub-missions processed identification results need to be pro-vided in PRIDE XML (PRIDE original data format) or thePSI standard mzIdentML (version 11) format If mzIdentMLis used the corresponding peak list files referenced by themzIdentML files need to be included as well A ldquocompleterdquosubmission ensures that the processed results data can be in-tegrated in PRIDE and visualized using tools such as PRIDEInspector [54] (see below) Therefore for performing a ldquocom-pleterdquo submission the output files from the analysis soft-ware need to be converted or exported to either mzIdentML(see httpwwwpsidevinfotools-implementing-mzidentmland httpwwwebiacukpridehelparchivesubmissionmzidentml) or PRIDE XML (using PRIDE Converter 2 [55] orother external tools) It is important to highlight that the re-cently implemented support for mzIdentML makes possiblethat data from a growing number of toolsanalysis software(that were never supported via PRIDE XML) can now be fullysupported by PRIDE
The PRIDE Converter 2 tool suite (httppride-converter-2googlecodecom) [55] is an open source andplatform-independent software that allows users to convertseveral search engine output files to PRIDE XML The toolsuite consists of four different applications PRIDE Converter2 PRIDE mzTab generator PRIDE XML merger and PRIDEXML filter All of these tools can be launched using the graph-ical user interface or from the command line PRIDE Con-verter 2 currently supports the output from MASCOT (dat)[56] XTandem (xml) [57] OMSSA (csv) among others Italso supports different mass spectra file formats (mzML dtamgf mzData mzXML and pkl) As mentioned above thereare other tools not developed by the PRIDE team that canalso export to PRIDE XML [37] It is expected that as the mzI-dentML format becomes more and more popular the use ofPRIDE XML will decrease in time
The ldquopartialrdquo submission route is aimed at situationswhere processed identification results cannot be generated in
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 935
Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month
mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others
The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps
(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg
the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the
PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or
FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets
In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so
322 Data mining and visualization
PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers
The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services
Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE
33 PASSEL
PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines
331 Data submission and format support
PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass
spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo
332 Data mining and visualization
The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)
The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)
34 PeptideAtlas
The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]
Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]
In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 937
341 Data submission and format support
PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework
342 Data mining and visualization
The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins
Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein
identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)
PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)
35 MassIVE
The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset
351 Data submission and format support
The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface
MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset
The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed
352 Data mining and visualization
The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets
Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]
36 Chorus
Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass
spectral viewers as well as a database search engine for pro-tein sequence identification
The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects
37 GPMDB
GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database
371 Data submission and format support
The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database
372 Data mining and visualization
The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins
When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 939
More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl
38 ProteomicsDB
ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication
381 Data submission and format support
The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]
382 Data mining and visualization
A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be
visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage
39 MaxQB
MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities
391 Data submission and format support
Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface
Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database
392 Data mining and visualization
The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 931
the size of data produced in proteomics laboratories has in-creased by several orders of magnitude [12]
Compared to other data-intensive fields such as genomicsdeposition and storage of original proteomics data in publicresources have been less common [13] This is regrettablesince proteome studies are usually more complex than itscounterpart genomics ones In fact data interpretation in pro-teomics can be considerably more complex than in genomicsdue to the wide variety of analytical approaches [14 15]bioinformatics tools and pipelines [16 17] and the relatedstatistical analysis [1819] However thanks to the guidelinespromoted by several scientific journals and funding agencies[20] there is a growing consensus in the community about theneed for the public dissemination of proteomics data whichis already facilitating the assessment reuse comparativeanalyses and extraction of new findings from published data[13 21]
The complexity of proteomics data is heightened by alter-native splicing PTMs and protein degradation events andis further amplified by the interconnectivity of proteins intocomplexes and signaling networks that are highly divergent intime and space [1] In order to address this complexity newanalytical and bioinformatics methodologies are developedevery year [22 23] which complicate the data standardiza-tion and deposition Additionally the audience interested inproteomics data is very heterogeneous It includes biologistselucidating the mechanisms of regulation of specific proteinsMS researchers improving the current analytical methods orcomputational biologists developing new software tools forthe analysis and interpretation of the data [24]
Data sharing in proteomics requires substantial invest-ment and infrastructure Several public repositories havebeen developed each with different purposes in mind Well-established databases for proteomics data are the Global Pro-teome Machine Database (GPMDB) [25] PeptideAtlas [26]and the PRIDE database [27] Additionally at present thereare other resources (many of them recently developed) suchas ProteomicsDB [28] MassIVE (Mass Spectrometry Inter-active Virtual Environment) Chorus MaxQB [29] PASSEL(PeptideAtlas SRM Experiment Library) [30] MOPED (ModelOrganism Protein Expression Database) [31] PaxDb [32]Human Proteinpedia [33] and the human proteome map(HPM) [34] Furthermore there are several more specializedresources that will only be cited briefly in this review It isimportant to mention here that no single proteomics dataresource will be ideally suited to all possible use cases andall potential users Regrettably two widely used resourceswere discontinued due to lack of funding Peptidome [35]and Tranche [36] This had a negative impact on the effortspromoting data sharing in the field as it was perceived by thecommunity that effort invested in data sharing was lost
Recently the ProteomeXchange (PX) consortium [37] hasbeen formed to enable a better integration of public repos-itories maximizing its benefits to the scientific commu-nity through the implementation of standardized submissionand dissemination pipelines for proteomics information By
Figure 1 Hierarchy of proteomics data repositories anddatabases according to the different data types stored raw MSdata repositories resources that store peptideprotein identifi-cation and quantification results and protein knowledge basesSome resources are duplicated in different levels because theycan be included in more than one category
August 2014 PRIDE PeptideAtlas PASSEL and MassIVEare the active members of the consortium
The aim of this review is to provide an up-to-date overviewof the current state of proteomics data repositories anddatabases providing a solid starting point for those who wantto perform data submission andor data mining There are afew comparable reviews available in the literature [2438ndash41]but there is a need for an update since this has been quite adynamic field over the past few years In this manuscript wewill not include a thorough review about protein knowledgebases such as the Universal Protein resource (UniProt) [42]and neXtProt [43] but we will explain how MS proteomicsinformation is made available in these resources
2 Organization of proteomics repositoriesand databases
The information generated in a typical proteomics experi-ment can be organized in three different levels [44] (i) rawdata (ii) processed results including peptideprotein identi-fication and quantification values and (iii) the resulting bio-logical conclusions Technical andor biological metadata canbe provided for each level independently
These three categories enable the classification of the ex-isting MS proteomics repositories according to their level ofspecialization (Fig 1) In our view these three levels of infor-mation should be captured and properly annotated in publicdatabases and repositories ideally using data standards whenavailable In fact the development of proteomics resources ismore feasible due the maturity of some data standard formats[45ndash47] and open source tools [9] which facilitate public datadeposition
Some of the first open formats developed includedmzXML (for MS data) [48] pepXML and protXML (for
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
932 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
peptideprotein identifications) that were developed as part ofthe trans-proteomic pipeline (TPP) [49] In the context of theProteomics Standards Initiative (PSI) several standard dataformats have been developed over the last few years whichreflect the variety in data types within the field and thereforethose that can be supported by proteomics resources Themain formats developed have been mzML (for MS data) [45]mzIdentML (for peptide and protein identification results)[46] mzQuantML (for capturing a detailed trace of each stageof quantitative analysis) [47] TraML (for representing inputtransitions in SRM approaches) [50] and the recent mzTab(for capturing a simpler summary of the final results) [51]The aims and functionalities of the existing resources willbe explored in detail in the following sections (Table 1)
3 Resources
31 The PX consortium
The PX consortium [37] (httpwwwproteomexchangeorg)was created to promote the collaboration and integration ofmajor stakeholders in the domain of MS proteomics repos-itories comprising among others primary (PRIDE [27] andPASSEL [30]) and secondary resources (PeptideAtlas) pro-teomics researchers and representatives from journals reg-ularly publishing proteomics data Recently in June 2014MassIVE joined the consortium The aim of PX is to pro-vide a common framework for the cooperation of proteomicsresources by defining and implementing consistent harmo-nized user-friendly data deposition and dissemination pro-cedures In addition another important goal is to enable andprovide ldquomutual backuprdquo if one of the resources has fundingissues
The consortiumrsquos members have agreed in providing a suf-ficient set of common experimental and technical metadataThis information is stored using the PX XML format [37]Finally all the submitted datasets get a unique and universalidentifier (PXD identifier)
311 Data submission and format support
By August 2014 two major workflows are fully supportedin PX MSMS and SRM approaches In the first stable im-plementation of the data workflow PRIDE acts as the ini-tial submission point for MSMS data whereas PASSEL hasthe equivalent role for SRM data Both workflows will be ex-plained in detail below At the moment of writing MassIVEhas just joined PX aiming to have an equivalent role to PRIDE
There are two different PX MSMS submission modesldquoCompleterdquo and ldquoPartialrdquo To perform a ldquocompleterdquo submis-sion means that after all the files have been submitted itis possible for the receiving repository to connect directlythe processed identification results with the mass spectraThis can be achieved if the processed identification results
are available in a format supported by the receiving reposi-tory (eg mzIdentML) and if peak list files are included inthe submission ldquoCompleterdquo submissions get a Digital ObjectIdentifier (DOI) to facilitate its traceability
On the other hand after performing a ldquopartialrdquo submissionthe connection between the spectra and the identification re-sults cannot be done in a straightforward way In this case theprocessed results are not available in a supported format bythe receiving repository and the corresponding search engineoutput files (in heterogeneous formats) are made available fordownload For both types of submissions metadata and rawdata are always stored for each dataset
Although ldquopartialrdquo submissions are searchable by theirmetadata peptide and protein identifications cannot be cap-tured by the receiving repository which decreases the abilityof reviewers to check the data and can make data reuse bythird parties challenging For instance ldquopartialrdquo submissionsdo not qualify for the requirements on spectra annotationfrom the journal MCP (Molecular and Cellular ProteomicsmdashhttpwwwmcponlineorgsitemiscParisReport_Finalxhtml)
Finally it needs to be highlighted that all PX memberssupport private review of the data during the manuscriptreview process The submitted data remains private beforemanuscript publication and login details are provided to fa-cilitate access for reviewers and journal editors during themanuscript review process
312 Data mining and visualization
ProteomeCentral (httpproteomecentralproteomexchangeorg) is the centralized portal for accessing all PX datasetsindependently from the original resource where data werestored ProteomeCentral provides the ability to search meta-data associated with datasets in the participating repositories(PRIDE and MassIVE for MSMS data PASSEL for SRMdata or reprocessed original PX datasets in PeptideAtlas) Itis then possible to query the archive and identify datasets ofinterest using biological and technical metadata keywordstags or publication information To monitor the releaseof new public PX datasets researchers can subscribe to aRich Site Summary feed (httpgroupsgooglecomgroupproteomexchangefeedrss_v2_0_msgsxml) Next the maincharacteristics of the individual members of the consortiumwill be explained
32 PRIDE
The PRIDE database [27] (httpwwwebiacukpride) wasinitially developed at the European Bioinformatics Institute(EBI Cambridge UK) to store the experimental data includedin publications supporting the manuscript review processThe main data types stored in PRIDE are peptideproteinidentifications (including PTMs) peptideprotein expression
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 933
Ta
ble
1
Mai
nch
arac
teri
stic
so
fth
em
ajo
rM
S-b
ased
pro
teo
mic
sre
po
sito
ries
and
dat
abas
es
Rep
osi
tori
esR
awd
ata
Su
pp
ort
for
targ
eted
app
roac
hes
Met
adat
aH
um
anp
rote
inex
pre
ssio
nin
form
atio
n
Sp
ecie
sQ
uan
ti-
fica
tio
nd
ata
Rel
ated
stan
d-a
lon
eto
ols
Web
serv
ices
UR
LU
RL
PR
IDE
(Ju
ne
2014
)X
ndashH
igh
leve
l41
835
Pro
tein
acce
ssio
ns
269
806
Un
iqu
ep
epti
de
seq
uen
ces
Ap
pro
xim
atel
y10
1m
illio
nsp
ectr
a
Ap
pro
xim
atel
y45
0sp
ecie
sX
PR
IDE
Insp
ecto
rP
RID
EC
onv
erte
r2
Pep
tid
eSh
aker
htt
p
ww
we
bi
acu
kp
rid
ew
sar
chiv
eh
ttp
w
ww
eb
iac
uk
pri
de
Pep
tid
eAtl
as(H
um
an
Au
gu
st20
13)
XX
Med
ium
leve
l14
018
Pro
tein
s33
801
3Pe
pti
des
Ap
pro
xim
atel
y25
8m
illio
nsp
ectr
a
Mu
sm
usc
ulu
sC
and
ida
alb
ican
sC
and
ida
Cae
no
rhab
dit
isel
egan
sD
roso
ph
ilam
elan
oga
ster
H
alo
bac
teri
um
Eq
uu
scab
allu
sR
attu
sn
orv
egic
us
Sac
char
om
yces
cere
visi
ae
Dan
iore
rio
Su
sscr
ofa
M
yco
bac
teri
um
tub
ercu
losi
s
ndashT
IQA
M
TIQ
AM
ndashDig
esto
rT
IQA
Mndash
Pep
tid
eAtl
as
TIQ
AM
ndashVie
wer
A
TAQ
SP
IPE
2PA
BS
T
htt
p
ww
wp
epti
dea
tlas
org
GP
MD
B(M
ay20
14)
ndashX
Hig
hle
vel
136
373
Pro
tein
acce
ssio
ns
178
669
8Pe
pti
des
Ap
pro
xim
atel
y10
20m
illio
nsp
ectr
a
Bo
sta
uru
sC
anis
fam
iliar
isH
om
osa
pie
ns
Mm
usc
ulu
sR
no
rveg
icu
sG
allu
sga
llus
Dr
erio
Xen
op
us
tro
pic
alis
An
op
hel
esga
mb
iae
Ap
ism
ellif
era
Dm
elan
oga
ster
Ce
lega
ns
Ory
zasa
tiva
Ara
bid
op
sis
thal
ian
aS
acch
aro
myc
esce
revi
siae
Bac
illu
san
thra
cis
Am
esE
sch
eric
hia
coli
(K12
)La
cto
cocc
us
lact
is(I
l140
3)
Mt
ub
ercu
losi
sS
hig
ella
dys
ente
riae
S
alm
on
ella
typ
hi
Sal
mo
nel
laty
ph
imu
riu
m(L
T2)
ndashndash
htt
p
rest
th
egp
mo
rg1
htt
p
gp
md
bt
heg
pm
org
Mas
sIV
E(M
ay20
14)
Xndash
Low
leve
l
ndash
ndashndash
htt
p
mas
sive
ucs
de
du
Ch
oru
s(M
ay20
14)
Xndash
Low
leve
l
ndash
ndashndash
htt
ps
ch
oru
spro
ject
org
Pro
teo
mic
sDB
(May
2014
)X
ndashH
igh
leve
l18
097
Pro
tein
s73
940
6Pe
pti
des
Ap
pro
xim
atel
y70
mill
ion
spec
tra
Hs
apie
ns
Xndash
ndashh
ttp
sw
ww
pro
teo
mic
sdb
org
MO
PE
D(M
ay20
14)
ndashndash
Med
ium
leve
l17
141
Pro
tein
s25
000
0U
niq
ue
pep
tid
esA
pp
roxi
mat
ely
15m
illio
nsp
ectr
a
Hs
apie
ns
Mm
usc
ulu
sC
ele
gan
sS
cer
evis
iae
Xndash
ndashh
ttp
sw
ww
pro
tein
spir
eo
rg
MO
PE
D
Hu
man
Pro
tein
ped
ia(M
ay20
14)
Xndash
Low
leve
l15
231
Pro
tein
s1
960
352
Pep
tid
esA
pp
roxi
mat
ely
5m
illio
nsp
ectr
a
Hs
apie
ns
ndashndash
ndashh
ttp
w
ww
hu
man
pro
tein
ped
iao
rg
Max
QB
(May
2014
)ndash
ndashM
ediu
mle
vel
1473
2Pr
ote
ins
370
551
Pep
tid
esA
pp
roxi
mat
ely
20m
illio
nsp
ectr
a
Mm
usc
ulu
sH
sap
ien
sS
cer
evis
iae
Xndash
ndashh
ttp
m
axq
bb
ioch
emm
pg
de
mxd
b
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
934 Y Perez-Riverol et al Proteomics 2015 15 930ndash949Ta
ble
1
Co
nti
nu
ed
Rep
osi
tori
esR
awd
ata
Su
pp
ort
for
targ
eted
app
roac
hes
Met
adat
aH
um
anp
rote
inex
pre
ssio
nin
form
atio
n
Sp
ecie
sQ
uan
ti-
fica
tio
nd
ata
Rel
ated
stan
d-a
lon
eto
ols
Web
serv
ices
UR
LU
RL
PaxD
b(M
ay20
14)
ndashndash
Low
leve
l10
482
Pro
tein
s14
345
6Pe
pti
des
Ap
pro
xim
atel
y24
mill
ion
spec
tra
At
hal
ian
aM
mu
scu
lus
Hs
apie
ns
Sc
erev
isia
eD
mel
ano
gast
erE
co
li(K
12)
Mic
rocy
stis
aeru
gin
osa
C
ele
gan
sLe
pto
spir
ain
terr
oga
ns
sero
var
Co
pen
hag
eni
Mt
ub
ercu
losi
s(H
37R
v)S
trep
toco
ccu
sp
yog
enes
M1
GA
SS
chiz
osa
cch
aro
myc
esp
om
be
Bt
auru
sS
dys
ente
riae
Bs
ub
tilis
G
gal
lus
Th
erm
oco
ccu
sga
mm
ato
lera
ns
(EJ3
)M
yco
pla
sma
pn
eum
on
iae
Hal
ob
acte
riu
m(N
RC
ndash1)
St
yph
imu
riu
mLT
2R
no
rveg
icu
sD
ein
oco
ccu
sd
eser
ti(V
CD
115)
S
hig
ella
flex
ner
iS
ynec
ho
cyst
is
Am
ellif
era
Cf
amili
aris
Su
ssc
rofa
X
tro
pic
alis
Os
ativ
a
Xndash
htt
p
pax
ndashdb
org
ap
isea
rch
q=
hu
man
htt
p
pax
-db
org
HP
M(J
un
e20
14)
ndashndash
Low
leve
l10
482
Pro
tein
s29
300
0U
niq
ue
pep
tid
esA
pp
roxi
mat
ely
25m
illio
nsp
ectr
a
Hs
apie
ns
Xndash
ndashh
ttp
h
um
anp
rote
om
emap
org
Xt
he
feat
ure
or
char
acte
rist
icis
sup
po
rted
-t
he
feat
ure
isn
ot
sup
po
rted
i
tw
asn
ot
po
ssib
leto
retr
ieve
the
corr
esp
on
din
gin
form
atio
n
values the analyzed mass spectra (both as raw data andpeak lists) and the related technicalbiological metadataIn PRIDE data are stored as originally analyzed by the re-searchers (the authorrsquos analysis view on the data) supportingmany popular search enginesanalysis workflows Most ofthe terms information and metadata supporting the PRIDEdata are based on ontologies or controlled vocabularies [52]In this context the ldquoOntology Lookup Servicerdquo [53] was devel-oped as a spin-off of PRIDE to enable the querying browsingand navigation of biomedical ontologies Since the inceptionof PX the number of datasets submitted to PRIDE has grownconsiderably (Fig 2) In parallel the size of the experiments interms of spectrum numbers has also increased significantly(Fig 2)
321 Data submission and format support
As a member of the PX consortium PRIDE supports bothldquocompleterdquo and ldquopartialrdquo submissions For ldquocompleterdquo sub-missions processed identification results need to be pro-vided in PRIDE XML (PRIDE original data format) or thePSI standard mzIdentML (version 11) format If mzIdentMLis used the corresponding peak list files referenced by themzIdentML files need to be included as well A ldquocompleterdquosubmission ensures that the processed results data can be in-tegrated in PRIDE and visualized using tools such as PRIDEInspector [54] (see below) Therefore for performing a ldquocom-pleterdquo submission the output files from the analysis soft-ware need to be converted or exported to either mzIdentML(see httpwwwpsidevinfotools-implementing-mzidentmland httpwwwebiacukpridehelparchivesubmissionmzidentml) or PRIDE XML (using PRIDE Converter 2 [55] orother external tools) It is important to highlight that the re-cently implemented support for mzIdentML makes possiblethat data from a growing number of toolsanalysis software(that were never supported via PRIDE XML) can now be fullysupported by PRIDE
The PRIDE Converter 2 tool suite (httppride-converter-2googlecodecom) [55] is an open source andplatform-independent software that allows users to convertseveral search engine output files to PRIDE XML The toolsuite consists of four different applications PRIDE Converter2 PRIDE mzTab generator PRIDE XML merger and PRIDEXML filter All of these tools can be launched using the graph-ical user interface or from the command line PRIDE Con-verter 2 currently supports the output from MASCOT (dat)[56] XTandem (xml) [57] OMSSA (csv) among others Italso supports different mass spectra file formats (mzML dtamgf mzData mzXML and pkl) As mentioned above thereare other tools not developed by the PRIDE team that canalso export to PRIDE XML [37] It is expected that as the mzI-dentML format becomes more and more popular the use ofPRIDE XML will decrease in time
The ldquopartialrdquo submission route is aimed at situationswhere processed identification results cannot be generated in
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 935
Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month
mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others
The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps
(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg
the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the
PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or
FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets
In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so
322 Data mining and visualization
PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers
The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services
Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE
33 PASSEL
PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines
331 Data submission and format support
PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass
spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo
332 Data mining and visualization
The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)
The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)
34 PeptideAtlas
The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]
Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]
In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 937
341 Data submission and format support
PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework
342 Data mining and visualization
The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins
Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein
identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)
PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)
35 MassIVE
The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset
351 Data submission and format support
The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface
MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset
The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed
352 Data mining and visualization
The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets
Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]
36 Chorus
Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass
spectral viewers as well as a database search engine for pro-tein sequence identification
The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects
37 GPMDB
GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database
371 Data submission and format support
The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database
372 Data mining and visualization
The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins
When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 939
More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl
38 ProteomicsDB
ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication
381 Data submission and format support
The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]
382 Data mining and visualization
A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be
visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage
39 MaxQB
MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities
391 Data submission and format support
Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface
Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database
392 Data mining and visualization
The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
932 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
peptideprotein identifications) that were developed as part ofthe trans-proteomic pipeline (TPP) [49] In the context of theProteomics Standards Initiative (PSI) several standard dataformats have been developed over the last few years whichreflect the variety in data types within the field and thereforethose that can be supported by proteomics resources Themain formats developed have been mzML (for MS data) [45]mzIdentML (for peptide and protein identification results)[46] mzQuantML (for capturing a detailed trace of each stageof quantitative analysis) [47] TraML (for representing inputtransitions in SRM approaches) [50] and the recent mzTab(for capturing a simpler summary of the final results) [51]The aims and functionalities of the existing resources willbe explored in detail in the following sections (Table 1)
3 Resources
31 The PX consortium
The PX consortium [37] (httpwwwproteomexchangeorg)was created to promote the collaboration and integration ofmajor stakeholders in the domain of MS proteomics repos-itories comprising among others primary (PRIDE [27] andPASSEL [30]) and secondary resources (PeptideAtlas) pro-teomics researchers and representatives from journals reg-ularly publishing proteomics data Recently in June 2014MassIVE joined the consortium The aim of PX is to pro-vide a common framework for the cooperation of proteomicsresources by defining and implementing consistent harmo-nized user-friendly data deposition and dissemination pro-cedures In addition another important goal is to enable andprovide ldquomutual backuprdquo if one of the resources has fundingissues
The consortiumrsquos members have agreed in providing a suf-ficient set of common experimental and technical metadataThis information is stored using the PX XML format [37]Finally all the submitted datasets get a unique and universalidentifier (PXD identifier)
311 Data submission and format support
By August 2014 two major workflows are fully supportedin PX MSMS and SRM approaches In the first stable im-plementation of the data workflow PRIDE acts as the ini-tial submission point for MSMS data whereas PASSEL hasthe equivalent role for SRM data Both workflows will be ex-plained in detail below At the moment of writing MassIVEhas just joined PX aiming to have an equivalent role to PRIDE
There are two different PX MSMS submission modesldquoCompleterdquo and ldquoPartialrdquo To perform a ldquocompleterdquo submis-sion means that after all the files have been submitted itis possible for the receiving repository to connect directlythe processed identification results with the mass spectraThis can be achieved if the processed identification results
are available in a format supported by the receiving reposi-tory (eg mzIdentML) and if peak list files are included inthe submission ldquoCompleterdquo submissions get a Digital ObjectIdentifier (DOI) to facilitate its traceability
On the other hand after performing a ldquopartialrdquo submissionthe connection between the spectra and the identification re-sults cannot be done in a straightforward way In this case theprocessed results are not available in a supported format bythe receiving repository and the corresponding search engineoutput files (in heterogeneous formats) are made available fordownload For both types of submissions metadata and rawdata are always stored for each dataset
Although ldquopartialrdquo submissions are searchable by theirmetadata peptide and protein identifications cannot be cap-tured by the receiving repository which decreases the abilityof reviewers to check the data and can make data reuse bythird parties challenging For instance ldquopartialrdquo submissionsdo not qualify for the requirements on spectra annotationfrom the journal MCP (Molecular and Cellular ProteomicsmdashhttpwwwmcponlineorgsitemiscParisReport_Finalxhtml)
Finally it needs to be highlighted that all PX memberssupport private review of the data during the manuscriptreview process The submitted data remains private beforemanuscript publication and login details are provided to fa-cilitate access for reviewers and journal editors during themanuscript review process
312 Data mining and visualization
ProteomeCentral (httpproteomecentralproteomexchangeorg) is the centralized portal for accessing all PX datasetsindependently from the original resource where data werestored ProteomeCentral provides the ability to search meta-data associated with datasets in the participating repositories(PRIDE and MassIVE for MSMS data PASSEL for SRMdata or reprocessed original PX datasets in PeptideAtlas) Itis then possible to query the archive and identify datasets ofinterest using biological and technical metadata keywordstags or publication information To monitor the releaseof new public PX datasets researchers can subscribe to aRich Site Summary feed (httpgroupsgooglecomgroupproteomexchangefeedrss_v2_0_msgsxml) Next the maincharacteristics of the individual members of the consortiumwill be explained
32 PRIDE
The PRIDE database [27] (httpwwwebiacukpride) wasinitially developed at the European Bioinformatics Institute(EBI Cambridge UK) to store the experimental data includedin publications supporting the manuscript review processThe main data types stored in PRIDE are peptideproteinidentifications (including PTMs) peptideprotein expression
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 933
Ta
ble
1
Mai
nch
arac
teri
stic
so
fth
em
ajo
rM
S-b
ased
pro
teo
mic
sre
po
sito
ries
and
dat
abas
es
Rep
osi
tori
esR
awd
ata
Su
pp
ort
for
targ
eted
app
roac
hes
Met
adat
aH
um
anp
rote
inex
pre
ssio
nin
form
atio
n
Sp
ecie
sQ
uan
ti-
fica
tio
nd
ata
Rel
ated
stan
d-a
lon
eto
ols
Web
serv
ices
UR
LU
RL
PR
IDE
(Ju
ne
2014
)X
ndashH
igh
leve
l41
835
Pro
tein
acce
ssio
ns
269
806
Un
iqu
ep
epti
de
seq
uen
ces
Ap
pro
xim
atel
y10
1m
illio
nsp
ectr
a
Ap
pro
xim
atel
y45
0sp
ecie
sX
PR
IDE
Insp
ecto
rP
RID
EC
onv
erte
r2
Pep
tid
eSh
aker
htt
p
ww
we
bi
acu
kp
rid
ew
sar
chiv
eh
ttp
w
ww
eb
iac
uk
pri
de
Pep
tid
eAtl
as(H
um
an
Au
gu
st20
13)
XX
Med
ium
leve
l14
018
Pro
tein
s33
801
3Pe
pti
des
Ap
pro
xim
atel
y25
8m
illio
nsp
ectr
a
Mu
sm
usc
ulu
sC
and
ida
alb
ican
sC
and
ida
Cae
no
rhab
dit
isel
egan
sD
roso
ph
ilam
elan
oga
ster
H
alo
bac
teri
um
Eq
uu
scab
allu
sR
attu
sn
orv
egic
us
Sac
char
om
yces
cere
visi
ae
Dan
iore
rio
Su
sscr
ofa
M
yco
bac
teri
um
tub
ercu
losi
s
ndashT
IQA
M
TIQ
AM
ndashDig
esto
rT
IQA
Mndash
Pep
tid
eAtl
as
TIQ
AM
ndashVie
wer
A
TAQ
SP
IPE
2PA
BS
T
htt
p
ww
wp
epti
dea
tlas
org
GP
MD
B(M
ay20
14)
ndashX
Hig
hle
vel
136
373
Pro
tein
acce
ssio
ns
178
669
8Pe
pti
des
Ap
pro
xim
atel
y10
20m
illio
nsp
ectr
a
Bo
sta
uru
sC
anis
fam
iliar
isH
om
osa
pie
ns
Mm
usc
ulu
sR
no
rveg
icu
sG
allu
sga
llus
Dr
erio
Xen
op
us
tro
pic
alis
An
op
hel
esga
mb
iae
Ap
ism
ellif
era
Dm
elan
oga
ster
Ce
lega
ns
Ory
zasa
tiva
Ara
bid
op
sis
thal
ian
aS
acch
aro
myc
esce
revi
siae
Bac
illu
san
thra
cis
Am
esE
sch
eric
hia
coli
(K12
)La
cto
cocc
us
lact
is(I
l140
3)
Mt
ub
ercu
losi
sS
hig
ella
dys
ente
riae
S
alm
on
ella
typ
hi
Sal
mo
nel
laty
ph
imu
riu
m(L
T2)
ndashndash
htt
p
rest
th
egp
mo
rg1
htt
p
gp
md
bt
heg
pm
org
Mas
sIV
E(M
ay20
14)
Xndash
Low
leve
l
ndash
ndashndash
htt
p
mas
sive
ucs
de
du
Ch
oru
s(M
ay20
14)
Xndash
Low
leve
l
ndash
ndashndash
htt
ps
ch
oru
spro
ject
org
Pro
teo
mic
sDB
(May
2014
)X
ndashH
igh
leve
l18
097
Pro
tein
s73
940
6Pe
pti
des
Ap
pro
xim
atel
y70
mill
ion
spec
tra
Hs
apie
ns
Xndash
ndashh
ttp
sw
ww
pro
teo
mic
sdb
org
MO
PE
D(M
ay20
14)
ndashndash
Med
ium
leve
l17
141
Pro
tein
s25
000
0U
niq
ue
pep
tid
esA
pp
roxi
mat
ely
15m
illio
nsp
ectr
a
Hs
apie
ns
Mm
usc
ulu
sC
ele
gan
sS
cer
evis
iae
Xndash
ndashh
ttp
sw
ww
pro
tein
spir
eo
rg
MO
PE
D
Hu
man
Pro
tein
ped
ia(M
ay20
14)
Xndash
Low
leve
l15
231
Pro
tein
s1
960
352
Pep
tid
esA
pp
roxi
mat
ely
5m
illio
nsp
ectr
a
Hs
apie
ns
ndashndash
ndashh
ttp
w
ww
hu
man
pro
tein
ped
iao
rg
Max
QB
(May
2014
)ndash
ndashM
ediu
mle
vel
1473
2Pr
ote
ins
370
551
Pep
tid
esA
pp
roxi
mat
ely
20m
illio
nsp
ectr
a
Mm
usc
ulu
sH
sap
ien
sS
cer
evis
iae
Xndash
ndashh
ttp
m
axq
bb
ioch
emm
pg
de
mxd
b
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
934 Y Perez-Riverol et al Proteomics 2015 15 930ndash949Ta
ble
1
Co
nti
nu
ed
Rep
osi
tori
esR
awd
ata
Su
pp
ort
for
targ
eted
app
roac
hes
Met
adat
aH
um
anp
rote
inex
pre
ssio
nin
form
atio
n
Sp
ecie
sQ
uan
ti-
fica
tio
nd
ata
Rel
ated
stan
d-a
lon
eto
ols
Web
serv
ices
UR
LU
RL
PaxD
b(M
ay20
14)
ndashndash
Low
leve
l10
482
Pro
tein
s14
345
6Pe
pti
des
Ap
pro
xim
atel
y24
mill
ion
spec
tra
At
hal
ian
aM
mu
scu
lus
Hs
apie
ns
Sc
erev
isia
eD
mel
ano
gast
erE
co
li(K
12)
Mic
rocy
stis
aeru
gin
osa
C
ele
gan
sLe
pto
spir
ain
terr
oga
ns
sero
var
Co
pen
hag
eni
Mt
ub
ercu
losi
s(H
37R
v)S
trep
toco
ccu
sp
yog
enes
M1
GA
SS
chiz
osa
cch
aro
myc
esp
om
be
Bt
auru
sS
dys
ente
riae
Bs
ub
tilis
G
gal
lus
Th
erm
oco
ccu
sga
mm
ato
lera
ns
(EJ3
)M
yco
pla
sma
pn
eum
on
iae
Hal
ob
acte
riu
m(N
RC
ndash1)
St
yph
imu
riu
mLT
2R
no
rveg
icu
sD
ein
oco
ccu
sd
eser
ti(V
CD
115)
S
hig
ella
flex
ner
iS
ynec
ho
cyst
is
Am
ellif
era
Cf
amili
aris
Su
ssc
rofa
X
tro
pic
alis
Os
ativ
a
Xndash
htt
p
pax
ndashdb
org
ap
isea
rch
q=
hu
man
htt
p
pax
-db
org
HP
M(J
un
e20
14)
ndashndash
Low
leve
l10
482
Pro
tein
s29
300
0U
niq
ue
pep
tid
esA
pp
roxi
mat
ely
25m
illio
nsp
ectr
a
Hs
apie
ns
Xndash
ndashh
ttp
h
um
anp
rote
om
emap
org
Xt
he
feat
ure
or
char
acte
rist
icis
sup
po
rted
-t
he
feat
ure
isn
ot
sup
po
rted
i
tw
asn
ot
po
ssib
leto
retr
ieve
the
corr
esp
on
din
gin
form
atio
n
values the analyzed mass spectra (both as raw data andpeak lists) and the related technicalbiological metadataIn PRIDE data are stored as originally analyzed by the re-searchers (the authorrsquos analysis view on the data) supportingmany popular search enginesanalysis workflows Most ofthe terms information and metadata supporting the PRIDEdata are based on ontologies or controlled vocabularies [52]In this context the ldquoOntology Lookup Servicerdquo [53] was devel-oped as a spin-off of PRIDE to enable the querying browsingand navigation of biomedical ontologies Since the inceptionof PX the number of datasets submitted to PRIDE has grownconsiderably (Fig 2) In parallel the size of the experiments interms of spectrum numbers has also increased significantly(Fig 2)
321 Data submission and format support
As a member of the PX consortium PRIDE supports bothldquocompleterdquo and ldquopartialrdquo submissions For ldquocompleterdquo sub-missions processed identification results need to be pro-vided in PRIDE XML (PRIDE original data format) or thePSI standard mzIdentML (version 11) format If mzIdentMLis used the corresponding peak list files referenced by themzIdentML files need to be included as well A ldquocompleterdquosubmission ensures that the processed results data can be in-tegrated in PRIDE and visualized using tools such as PRIDEInspector [54] (see below) Therefore for performing a ldquocom-pleterdquo submission the output files from the analysis soft-ware need to be converted or exported to either mzIdentML(see httpwwwpsidevinfotools-implementing-mzidentmland httpwwwebiacukpridehelparchivesubmissionmzidentml) or PRIDE XML (using PRIDE Converter 2 [55] orother external tools) It is important to highlight that the re-cently implemented support for mzIdentML makes possiblethat data from a growing number of toolsanalysis software(that were never supported via PRIDE XML) can now be fullysupported by PRIDE
The PRIDE Converter 2 tool suite (httppride-converter-2googlecodecom) [55] is an open source andplatform-independent software that allows users to convertseveral search engine output files to PRIDE XML The toolsuite consists of four different applications PRIDE Converter2 PRIDE mzTab generator PRIDE XML merger and PRIDEXML filter All of these tools can be launched using the graph-ical user interface or from the command line PRIDE Con-verter 2 currently supports the output from MASCOT (dat)[56] XTandem (xml) [57] OMSSA (csv) among others Italso supports different mass spectra file formats (mzML dtamgf mzData mzXML and pkl) As mentioned above thereare other tools not developed by the PRIDE team that canalso export to PRIDE XML [37] It is expected that as the mzI-dentML format becomes more and more popular the use ofPRIDE XML will decrease in time
The ldquopartialrdquo submission route is aimed at situationswhere processed identification results cannot be generated in
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 935
Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month
mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others
The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps
(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg
the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the
PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or
FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets
In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so
322 Data mining and visualization
PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers
The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services
Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE
33 PASSEL
PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines
331 Data submission and format support
PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass
spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo
332 Data mining and visualization
The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)
The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)
34 PeptideAtlas
The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]
Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]
In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 937
341 Data submission and format support
PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework
342 Data mining and visualization
The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins
Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein
identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)
PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)
35 MassIVE
The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset
351 Data submission and format support
The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface
MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset
The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed
352 Data mining and visualization
The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets
Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]
36 Chorus
Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass
spectral viewers as well as a database search engine for pro-tein sequence identification
The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects
37 GPMDB
GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database
371 Data submission and format support
The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database
372 Data mining and visualization
The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins
When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 939
More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl
38 ProteomicsDB
ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication
381 Data submission and format support
The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]
382 Data mining and visualization
A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be
visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage
39 MaxQB
MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities
391 Data submission and format support
Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface
Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database
392 Data mining and visualization
The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 933
Ta
ble
1
Mai
nch
arac
teri
stic
so
fth
em
ajo
rM
S-b
ased
pro
teo
mic
sre
po
sito
ries
and
dat
abas
es
Rep
osi
tori
esR
awd
ata
Su
pp
ort
for
targ
eted
app
roac
hes
Met
adat
aH
um
anp
rote
inex
pre
ssio
nin
form
atio
n
Sp
ecie
sQ
uan
ti-
fica
tio
nd
ata
Rel
ated
stan
d-a
lon
eto
ols
Web
serv
ices
UR
LU
RL
PR
IDE
(Ju
ne
2014
)X
ndashH
igh
leve
l41
835
Pro
tein
acce
ssio
ns
269
806
Un
iqu
ep
epti
de
seq
uen
ces
Ap
pro
xim
atel
y10
1m
illio
nsp
ectr
a
Ap
pro
xim
atel
y45
0sp
ecie
sX
PR
IDE
Insp
ecto
rP
RID
EC
onv
erte
r2
Pep
tid
eSh
aker
htt
p
ww
we
bi
acu
kp
rid
ew
sar
chiv
eh
ttp
w
ww
eb
iac
uk
pri
de
Pep
tid
eAtl
as(H
um
an
Au
gu
st20
13)
XX
Med
ium
leve
l14
018
Pro
tein
s33
801
3Pe
pti
des
Ap
pro
xim
atel
y25
8m
illio
nsp
ectr
a
Mu
sm
usc
ulu
sC
and
ida
alb
ican
sC
and
ida
Cae
no
rhab
dit
isel
egan
sD
roso
ph
ilam
elan
oga
ster
H
alo
bac
teri
um
Eq
uu
scab
allu
sR
attu
sn
orv
egic
us
Sac
char
om
yces
cere
visi
ae
Dan
iore
rio
Su
sscr
ofa
M
yco
bac
teri
um
tub
ercu
losi
s
ndashT
IQA
M
TIQ
AM
ndashDig
esto
rT
IQA
Mndash
Pep
tid
eAtl
as
TIQ
AM
ndashVie
wer
A
TAQ
SP
IPE
2PA
BS
T
htt
p
ww
wp
epti
dea
tlas
org
GP
MD
B(M
ay20
14)
ndashX
Hig
hle
vel
136
373
Pro
tein
acce
ssio
ns
178
669
8Pe
pti
des
Ap
pro
xim
atel
y10
20m
illio
nsp
ectr
a
Bo
sta
uru
sC
anis
fam
iliar
isH
om
osa
pie
ns
Mm
usc
ulu
sR
no
rveg
icu
sG
allu
sga
llus
Dr
erio
Xen
op
us
tro
pic
alis
An
op
hel
esga
mb
iae
Ap
ism
ellif
era
Dm
elan
oga
ster
Ce
lega
ns
Ory
zasa
tiva
Ara
bid
op
sis
thal
ian
aS
acch
aro
myc
esce
revi
siae
Bac
illu
san
thra
cis
Am
esE
sch
eric
hia
coli
(K12
)La
cto
cocc
us
lact
is(I
l140
3)
Mt
ub
ercu
losi
sS
hig
ella
dys
ente
riae
S
alm
on
ella
typ
hi
Sal
mo
nel
laty
ph
imu
riu
m(L
T2)
ndashndash
htt
p
rest
th
egp
mo
rg1
htt
p
gp
md
bt
heg
pm
org
Mas
sIV
E(M
ay20
14)
Xndash
Low
leve
l
ndash
ndashndash
htt
p
mas
sive
ucs
de
du
Ch
oru
s(M
ay20
14)
Xndash
Low
leve
l
ndash
ndashndash
htt
ps
ch
oru
spro
ject
org
Pro
teo
mic
sDB
(May
2014
)X
ndashH
igh
leve
l18
097
Pro
tein
s73
940
6Pe
pti
des
Ap
pro
xim
atel
y70
mill
ion
spec
tra
Hs
apie
ns
Xndash
ndashh
ttp
sw
ww
pro
teo
mic
sdb
org
MO
PE
D(M
ay20
14)
ndashndash
Med
ium
leve
l17
141
Pro
tein
s25
000
0U
niq
ue
pep
tid
esA
pp
roxi
mat
ely
15m
illio
nsp
ectr
a
Hs
apie
ns
Mm
usc
ulu
sC
ele
gan
sS
cer
evis
iae
Xndash
ndashh
ttp
sw
ww
pro
tein
spir
eo
rg
MO
PE
D
Hu
man
Pro
tein
ped
ia(M
ay20
14)
Xndash
Low
leve
l15
231
Pro
tein
s1
960
352
Pep
tid
esA
pp
roxi
mat
ely
5m
illio
nsp
ectr
a
Hs
apie
ns
ndashndash
ndashh
ttp
w
ww
hu
man
pro
tein
ped
iao
rg
Max
QB
(May
2014
)ndash
ndashM
ediu
mle
vel
1473
2Pr
ote
ins
370
551
Pep
tid
esA
pp
roxi
mat
ely
20m
illio
nsp
ectr
a
Mm
usc
ulu
sH
sap
ien
sS
cer
evis
iae
Xndash
ndashh
ttp
m
axq
bb
ioch
emm
pg
de
mxd
b
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
934 Y Perez-Riverol et al Proteomics 2015 15 930ndash949Ta
ble
1
Co
nti
nu
ed
Rep
osi
tori
esR
awd
ata
Su
pp
ort
for
targ
eted
app
roac
hes
Met
adat
aH
um
anp
rote
inex
pre
ssio
nin
form
atio
n
Sp
ecie
sQ
uan
ti-
fica
tio
nd
ata
Rel
ated
stan
d-a
lon
eto
ols
Web
serv
ices
UR
LU
RL
PaxD
b(M
ay20
14)
ndashndash
Low
leve
l10
482
Pro
tein
s14
345
6Pe
pti
des
Ap
pro
xim
atel
y24
mill
ion
spec
tra
At
hal
ian
aM
mu
scu
lus
Hs
apie
ns
Sc
erev
isia
eD
mel
ano
gast
erE
co
li(K
12)
Mic
rocy
stis
aeru
gin
osa
C
ele
gan
sLe
pto
spir
ain
terr
oga
ns
sero
var
Co
pen
hag
eni
Mt
ub
ercu
losi
s(H
37R
v)S
trep
toco
ccu
sp
yog
enes
M1
GA
SS
chiz
osa
cch
aro
myc
esp
om
be
Bt
auru
sS
dys
ente
riae
Bs
ub
tilis
G
gal
lus
Th
erm
oco
ccu
sga
mm
ato
lera
ns
(EJ3
)M
yco
pla
sma
pn
eum
on
iae
Hal
ob
acte
riu
m(N
RC
ndash1)
St
yph
imu
riu
mLT
2R
no
rveg
icu
sD
ein
oco
ccu
sd
eser
ti(V
CD
115)
S
hig
ella
flex
ner
iS
ynec
ho
cyst
is
Am
ellif
era
Cf
amili
aris
Su
ssc
rofa
X
tro
pic
alis
Os
ativ
a
Xndash
htt
p
pax
ndashdb
org
ap
isea
rch
q=
hu
man
htt
p
pax
-db
org
HP
M(J
un
e20
14)
ndashndash
Low
leve
l10
482
Pro
tein
s29
300
0U
niq
ue
pep
tid
esA
pp
roxi
mat
ely
25m
illio
nsp
ectr
a
Hs
apie
ns
Xndash
ndashh
ttp
h
um
anp
rote
om
emap
org
Xt
he
feat
ure
or
char
acte
rist
icis
sup
po
rted
-t
he
feat
ure
isn
ot
sup
po
rted
i
tw
asn
ot
po
ssib
leto
retr
ieve
the
corr
esp
on
din
gin
form
atio
n
values the analyzed mass spectra (both as raw data andpeak lists) and the related technicalbiological metadataIn PRIDE data are stored as originally analyzed by the re-searchers (the authorrsquos analysis view on the data) supportingmany popular search enginesanalysis workflows Most ofthe terms information and metadata supporting the PRIDEdata are based on ontologies or controlled vocabularies [52]In this context the ldquoOntology Lookup Servicerdquo [53] was devel-oped as a spin-off of PRIDE to enable the querying browsingand navigation of biomedical ontologies Since the inceptionof PX the number of datasets submitted to PRIDE has grownconsiderably (Fig 2) In parallel the size of the experiments interms of spectrum numbers has also increased significantly(Fig 2)
321 Data submission and format support
As a member of the PX consortium PRIDE supports bothldquocompleterdquo and ldquopartialrdquo submissions For ldquocompleterdquo sub-missions processed identification results need to be pro-vided in PRIDE XML (PRIDE original data format) or thePSI standard mzIdentML (version 11) format If mzIdentMLis used the corresponding peak list files referenced by themzIdentML files need to be included as well A ldquocompleterdquosubmission ensures that the processed results data can be in-tegrated in PRIDE and visualized using tools such as PRIDEInspector [54] (see below) Therefore for performing a ldquocom-pleterdquo submission the output files from the analysis soft-ware need to be converted or exported to either mzIdentML(see httpwwwpsidevinfotools-implementing-mzidentmland httpwwwebiacukpridehelparchivesubmissionmzidentml) or PRIDE XML (using PRIDE Converter 2 [55] orother external tools) It is important to highlight that the re-cently implemented support for mzIdentML makes possiblethat data from a growing number of toolsanalysis software(that were never supported via PRIDE XML) can now be fullysupported by PRIDE
The PRIDE Converter 2 tool suite (httppride-converter-2googlecodecom) [55] is an open source andplatform-independent software that allows users to convertseveral search engine output files to PRIDE XML The toolsuite consists of four different applications PRIDE Converter2 PRIDE mzTab generator PRIDE XML merger and PRIDEXML filter All of these tools can be launched using the graph-ical user interface or from the command line PRIDE Con-verter 2 currently supports the output from MASCOT (dat)[56] XTandem (xml) [57] OMSSA (csv) among others Italso supports different mass spectra file formats (mzML dtamgf mzData mzXML and pkl) As mentioned above thereare other tools not developed by the PRIDE team that canalso export to PRIDE XML [37] It is expected that as the mzI-dentML format becomes more and more popular the use ofPRIDE XML will decrease in time
The ldquopartialrdquo submission route is aimed at situationswhere processed identification results cannot be generated in
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 935
Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month
mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others
The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps
(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg
the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the
PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or
FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets
In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so
322 Data mining and visualization
PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers
The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services
Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE
33 PASSEL
PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines
331 Data submission and format support
PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass
spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo
332 Data mining and visualization
The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)
The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)
34 PeptideAtlas
The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]
Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]
In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 937
341 Data submission and format support
PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework
342 Data mining and visualization
The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins
Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein
identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)
PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)
35 MassIVE
The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset
351 Data submission and format support
The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface
MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset
The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed
352 Data mining and visualization
The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets
Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]
36 Chorus
Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass
spectral viewers as well as a database search engine for pro-tein sequence identification
The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects
37 GPMDB
GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database
371 Data submission and format support
The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database
372 Data mining and visualization
The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins
When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 939
More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl
38 ProteomicsDB
ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication
381 Data submission and format support
The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]
382 Data mining and visualization
A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be
visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage
39 MaxQB
MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities
391 Data submission and format support
Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface
Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database
392 Data mining and visualization
The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
934 Y Perez-Riverol et al Proteomics 2015 15 930ndash949Ta
ble
1
Co
nti
nu
ed
Rep
osi
tori
esR
awd
ata
Su
pp
ort
for
targ
eted
app
roac
hes
Met
adat
aH
um
anp
rote
inex
pre
ssio
nin
form
atio
n
Sp
ecie
sQ
uan
ti-
fica
tio
nd
ata
Rel
ated
stan
d-a
lon
eto
ols
Web
serv
ices
UR
LU
RL
PaxD
b(M
ay20
14)
ndashndash
Low
leve
l10
482
Pro
tein
s14
345
6Pe
pti
des
Ap
pro
xim
atel
y24
mill
ion
spec
tra
At
hal
ian
aM
mu
scu
lus
Hs
apie
ns
Sc
erev
isia
eD
mel
ano
gast
erE
co
li(K
12)
Mic
rocy
stis
aeru
gin
osa
C
ele
gan
sLe
pto
spir
ain
terr
oga
ns
sero
var
Co
pen
hag
eni
Mt
ub
ercu
losi
s(H
37R
v)S
trep
toco
ccu
sp
yog
enes
M1
GA
SS
chiz
osa
cch
aro
myc
esp
om
be
Bt
auru
sS
dys
ente
riae
Bs
ub
tilis
G
gal
lus
Th
erm
oco
ccu
sga
mm
ato
lera
ns
(EJ3
)M
yco
pla
sma
pn
eum
on
iae
Hal
ob
acte
riu
m(N
RC
ndash1)
St
yph
imu
riu
mLT
2R
no
rveg
icu
sD
ein
oco
ccu
sd
eser
ti(V
CD
115)
S
hig
ella
flex
ner
iS
ynec
ho
cyst
is
Am
ellif
era
Cf
amili
aris
Su
ssc
rofa
X
tro
pic
alis
Os
ativ
a
Xndash
htt
p
pax
ndashdb
org
ap
isea
rch
q=
hu
man
htt
p
pax
-db
org
HP
M(J
un
e20
14)
ndashndash
Low
leve
l10
482
Pro
tein
s29
300
0U
niq
ue
pep
tid
esA
pp
roxi
mat
ely
25m
illio
nsp
ectr
a
Hs
apie
ns
Xndash
ndashh
ttp
h
um
anp
rote
om
emap
org
Xt
he
feat
ure
or
char
acte
rist
icis
sup
po
rted
-t
he
feat
ure
isn
ot
sup
po
rted
i
tw
asn
ot
po
ssib
leto
retr
ieve
the
corr
esp
on
din
gin
form
atio
n
values the analyzed mass spectra (both as raw data andpeak lists) and the related technicalbiological metadataIn PRIDE data are stored as originally analyzed by the re-searchers (the authorrsquos analysis view on the data) supportingmany popular search enginesanalysis workflows Most ofthe terms information and metadata supporting the PRIDEdata are based on ontologies or controlled vocabularies [52]In this context the ldquoOntology Lookup Servicerdquo [53] was devel-oped as a spin-off of PRIDE to enable the querying browsingand navigation of biomedical ontologies Since the inceptionof PX the number of datasets submitted to PRIDE has grownconsiderably (Fig 2) In parallel the size of the experiments interms of spectrum numbers has also increased significantly(Fig 2)
321 Data submission and format support
As a member of the PX consortium PRIDE supports bothldquocompleterdquo and ldquopartialrdquo submissions For ldquocompleterdquo sub-missions processed identification results need to be pro-vided in PRIDE XML (PRIDE original data format) or thePSI standard mzIdentML (version 11) format If mzIdentMLis used the corresponding peak list files referenced by themzIdentML files need to be included as well A ldquocompleterdquosubmission ensures that the processed results data can be in-tegrated in PRIDE and visualized using tools such as PRIDEInspector [54] (see below) Therefore for performing a ldquocom-pleterdquo submission the output files from the analysis soft-ware need to be converted or exported to either mzIdentML(see httpwwwpsidevinfotools-implementing-mzidentmland httpwwwebiacukpridehelparchivesubmissionmzidentml) or PRIDE XML (using PRIDE Converter 2 [55] orother external tools) It is important to highlight that the re-cently implemented support for mzIdentML makes possiblethat data from a growing number of toolsanalysis software(that were never supported via PRIDE XML) can now be fullysupported by PRIDE
The PRIDE Converter 2 tool suite (httppride-converter-2googlecodecom) [55] is an open source andplatform-independent software that allows users to convertseveral search engine output files to PRIDE XML The toolsuite consists of four different applications PRIDE Converter2 PRIDE mzTab generator PRIDE XML merger and PRIDEXML filter All of these tools can be launched using the graph-ical user interface or from the command line PRIDE Con-verter 2 currently supports the output from MASCOT (dat)[56] XTandem (xml) [57] OMSSA (csv) among others Italso supports different mass spectra file formats (mzML dtamgf mzData mzXML and pkl) As mentioned above thereare other tools not developed by the PRIDE team that canalso export to PRIDE XML [37] It is expected that as the mzI-dentML format becomes more and more popular the use ofPRIDE XML will decrease in time
The ldquopartialrdquo submission route is aimed at situationswhere processed identification results cannot be generated in
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 935
Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month
mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others
The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps
(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg
the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the
PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or
FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets
In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so
322 Data mining and visualization
PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers
The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services
Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE
33 PASSEL
PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines
331 Data submission and format support
PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass
spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo
332 Data mining and visualization
The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)
The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)
34 PeptideAtlas
The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]
Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]
In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 937
341 Data submission and format support
PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework
342 Data mining and visualization
The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins
Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein
identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)
PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)
35 MassIVE
The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset
351 Data submission and format support
The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface
MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset
The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed
352 Data mining and visualization
The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets
Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]
36 Chorus
Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass
spectral viewers as well as a database search engine for pro-tein sequence identification
The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects
37 GPMDB
GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database
371 Data submission and format support
The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database
372 Data mining and visualization
The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins
When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 939
More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl
38 ProteomicsDB
ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication
381 Data submission and format support
The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]
382 Data mining and visualization
A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be
visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage
39 MaxQB
MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities
391 Data submission and format support
Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface
Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database
392 Data mining and visualization
The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 935
Figure 2 Bubble chart representation of the size of the PX complete submissions to PRIDE (until May 2014) The x-axis includes monthswith at least one submission since PX submissions started (from March 2012) The y-axis corresponds to the number of PX ldquocompleterdquopublic datasets submitted to PRIDE in each specific month The size of each bubble represents the total number of mass spectra includedin all the datasets in a given month
mzIdentMLPRIDE XML if there is no converterexporteravailable However the ldquopartialrdquo submission mechanism alsoenables any data from any proteomics workflow to be storedin PRIDE As a consequence some datasets are already avail-able coming from workflows such as top-down proteomicsMS imaging or SWATH-MS among others
The PRIDEPX submission process has been recently de-scribed in detail [58] The PX submission tool (httpwwwproteomexchangeorgsubmission) [37] is an open-sourcestandalone tool that provides a user-friendly graphical userinterface for performing the actual data submission througha series of steps
(i) Select all the files needed for submission(ii) Interactively group-related different types of files (eg
the corresponding raw and processed results files)(iii) Ensure a minimum level of metadata (according to the
PX guidelines)(iv) Transfer the files via the Aspera (from version 21) or
FTP file transfer protocols Aspera (httpasperasoftcom) can perform up to 50 times faster than FTP en-abling a convenient way of transferring large datasets
In addition to the PX submission tool datasets containinga high number of files can also be submitted using a com-mand line based alternative (httpwwwebiacukpridehelparchiveaspera) [58] Each dataset becomes publiclyavailable on acceptance or publication of the correspondingmanuscript or when the authors tell PRIDE to do so
322 Data mining and visualization
PRIDE Inspector (httppride-toolsuitegooglecodecom)[54] is an open source standalone tool that can be used toefficiently browse and visualize MS proteomics data PRIDEInspector can be used by researchers before submission andalso by journal editors and reviewers during the manuscriptreview process The latest version available at the moment ofwriting (version 21) supports identification results in PRIDEXML and mzIdentML (used in PX ldquocompleterdquo submissions)It also supports spectra files in a variety of formats (mgf pklms2 mzXML mzData and mzML) PRIDE Inspector en-ables users to visualize and check the data at different levelsIt has different panels devoted to experimental metadata pro-tein peptide and spectrum-centric information Finally theldquosummary chartsrdquo tab provides eight different charts that canbe used to evaluate some aspects of the quality of the datasetApart from the visualization functionality it can access someof the most popular protein databases (UniProt knowledge-base (UniProtKB) Ensembl [59] and the National Center forBiotechnology Information (NCBI) nonredundant database)to retrieve the most up-to-date protein sequences and namesfor the reported protein identifiers
The PRIDE Archive website (httpwwwebiacukpridearchive) provides the web interface to query andretrieve the information in PRIDE It was launched on Jan-uary 2014 and at the moment of writing is still under itera-tive development so new features are being added constantlyBy August 2014 the current version allows querying PRIDE
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services
Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE
33 PASSEL
PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines
331 Data submission and format support
PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass
spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo
332 Data mining and visualization
The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)
The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)
34 PeptideAtlas
The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]
Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]
In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 937
341 Data submission and format support
PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework
342 Data mining and visualization
The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins
Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein
identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)
PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)
35 MassIVE
The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset
351 Data submission and format support
The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface
MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset
The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed
352 Data mining and visualization
The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets
Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]
36 Chorus
Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass
spectral viewers as well as a database search engine for pro-tein sequence identification
The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects
37 GPMDB
GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database
371 Data submission and format support
The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database
372 Data mining and visualization
The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins
When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 939
More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl
38 ProteomicsDB
ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication
381 Data submission and format support
The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]
382 Data mining and visualization
A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be
visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage
39 MaxQB
MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities
391 Data submission and format support
Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface
Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database
392 Data mining and visualization
The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
936 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
using keywords publication species tissues diseases mod-ifications instruments peptide sequences and protein iden-tifiers When a specific dataset is selected the users are di-rected to the dataset summary page which also lists the assays(equivalent to the old PRIDE experiment numbers) related tothe project in the case of ldquocompleterdquo submissions Users candownload all the files via FTP or Aspera andor visualizethe results using PRIDE Inspector In the coming monthsvisualization for peptideprotein identifications will be avail-able in the PRIDE web At present the BioMart interface(httpwwwebiacukpridelegacyprideMartdo) is the eas-iest way to access this information However it is plannedthat it will be soon replaced by new PRIDE web services
Additionally PRIDE data can be accessed using theldquoPRIDE Clusterrdquo webpage (httpwwwebiacukpridecluster) [60] It includes public identified spectra in PRIDEthat have been clustered using the ldquoPRIDE Clusterrdquo algo-rithm (httpscodegooglecomppride-spectra-clustering)The PRIDE Cluster resource currently provides two mainmethods for accessing its data (i) retrieve all clusters thatcontain a given peptide identification and (ii) retrieve all clus-ters with a consensus spectrum similar to a queried spec-trum In addition spectral libraries for several species arealso provided ldquoPRIDE Clusterrdquo is the first quality controlstep attempted in a highly heterogeneous MS proteomicsrepository such as PRIDE
33 PASSEL
PASSEL [30] (httpwwwpeptideatlasorgpassel) supportsthe submission of datasets generated by SRM approaches bystoring the experimental results and the corresponding rawdata The submitted raw data are automatically reprocessedin a uniform manner using mQuest a component of themProphet software suite [61] and the results are loaded intothe database [62] The original files and the correspondingreprocessed results are made available to the communityAdditionally the measured transitions are incorporated intothe SRMAtlas [62] catalog of transitions Detailed metadataand structured sample information are in accordance withthe PX guidelines
331 Data submission and format support
PASSEL uses a web interface for the data submissionprocess (httpwwwpeptideatlasorgsubmit) and is nowsupporting the following data types (httpwwwpeptideatlasorgupload) (i) study metadata such as dataset titlesubmitter contact information sample source sample prepa-ration and instrument used (ii) transition lists describingwhich transitions were measured for each peptide and tar-geted ions along with optional supporting information (col-lision energy expected retention time and expected relativeintensities) This information is available in a tab-separatedfile or in the standard TraML format [50] and (iii) mass
spectrometer output files in mzML [45] or mzXML [48] for-mat If these are not available the vendor formats wiff (ABSCIEX) raw (Thermo) or d (Agilent) are also supported Allthe files supplied by the users are uploaded to a specially cre-ated FTP account and they can be browsed using the ldquoPASSELExperiment Browserrdquo
332 Data mining and visualization
The ldquoPASSEL Experiment Browserrdquo enables filtering basedon fields such as research contact information organismsample and instrument type As mentioned above it is alsopossible to download the original files provided (httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELExperiments)
The ldquoPASSEL Data Browserrdquo provides a description ofeach selected transition group and the data collected for ita link to visualize the trace group using the ldquoChromavisrdquochromatogram viewer and further links to the informationavailable in SRMAtlas [62] and PeptideAtlas [26] in order toextract or compare spectral features of the targeted peptides(httpdbsystemsbiologynetsbeamscgiPeptideAtlasGetSELTransitions)
34 PeptideAtlas
The PeptideAtlas project (httpwwwpeptideatlasorg) [266364] was originally created to serve as the end-point for theTPP processing software [49] In recent years PeptideAtlashas grown as a data reprocessing resource and it has served asa research database for the development of spectral libraries[65] and SRM-related tools [66 67] For instance PeptideAt-las provides information about proteotypic peptides usingdetectability scores [68]
Nowadays PeptideAtlas is one of the biggest and well-curated protein expression data resources The initial Pep-tideAtlas publication reported the identification of 27 ofthe human genes in Ensembl with a protein false discoveryrate (FDR) likely 10 or higher [63] Over the years the Pep-tideAtlas team has developed different tools to control the as-signment of incorrect identifications such as PeptideProphet[69] and ProteinProphet [70] and more recently MAYU [71]to control the protein FDR when different datasets are com-bined This platform and the statistically accurate protocolused to curate the proteinpeptide identification data haveturned PeptideAtlas into a very reliable protein expressiondatabase In 2013 the generated 1 protein-level FDR Hu-man PeptideAtlas had at least one peptide for around 14 000different UniProtKBSwiss-Prot entries [26]
In the context of PX partners have recently started to trackthe reanalysis of original PX datasets by providing RPXDidentifiers to those and linking them to the original repro-cessed datasets By August 2014 several PX datasets rean-alyzed by PeptideAtlas are already publicly available in Pro-teomeCentral
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 937
341 Data submission and format support
PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework
342 Data mining and visualization
The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins
Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein
identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)
PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)
35 MassIVE
The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset
351 Data submission and format support
The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface
MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset
The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed
352 Data mining and visualization
The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets
Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]
36 Chorus
Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass
spectral viewers as well as a database search engine for pro-tein sequence identification
The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects
37 GPMDB
GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database
371 Data submission and format support
The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database
372 Data mining and visualization
The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins
When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 939
More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl
38 ProteomicsDB
ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication
381 Data submission and format support
The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]
382 Data mining and visualization
A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be
visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage
39 MaxQB
MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities
391 Data submission and format support
Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface
Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database
392 Data mining and visualization
The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 937
341 Data submission and format support
PeptideAtlas reprocessed data are organized into differentbuilds [72] each includes data from a single proteome or sub-proteome (see species in Table 1) Each build is generatedwith raw MSMS spectra submitted using the ldquosubmissionformrdquo (httpwwwpeptideatlasorgupload) or the data de-posited into another public repository such as PRIDE Thesespectra are searched against a sequence database a spec-tral library or both Peptide and protein identifications aremapped to a comprehensive reference protein database (forthe latest human builds the searched database is a combina-tion of UniProtKBSwiss-Prot Ensembl and sequences fromthe International Protein Index (IPI)) and postprocessed us-ing the TPP [72] It also annotates each protein and peptidewith supporting data such as genome mappings sequencealignments links to different databases such as GPMDBor the Human Protein Atlas [73] uniqueness of peptidendashprotein mappings observability of peptides predicted ob-servable peptides estimated protein abundances and cross-references to other databases such as RefSeq UniGene andUniProt All the processed results are loaded into SBEAMS(systems biology experiment analysis management system)proteomics that is a proteomics analysis database built as amodule under the SBEAMS framework
342 Data mining and visualization
The PeptideAtlas web search interface can be used to searchfor proteins by protein accession peptide sequence genename keyword or phrase If a specific protein is requestedthe ldquoprotein view pagerdquo summarizes all the informationavailable for that protein The top section provides basic infor-mation about the protein including alternative names as wellas the total number of corresponding spectra (observations)and distinct peptides The following two sections ldquosequencemotifsrdquo and ldquosequencerdquo summarize the peptide coverage ofthe protein Finally a similar diagram to a genome browserview summarizes all the peptides that map either uniquelyor redundantly to the protein including information on seg-ments unlikely to be observed by MS Information about sig-nal peptides and transmembrane domains is also providedwhere available One of the best features of PeptideAtlas isthe Cytoscape [74] plug-in that allows the user to view thedistinct peptides for a particular protein as a network withassociated proteins
Recently PeptideAtlas implemented the ldquoPeptideAt-las Chromosome Explorerrdquo (httpwwwpeptideatlasorgpeptideatlasExplorer) to summarize and classify the iden-tified human proteome using a chromosome-oriented viewThe present version shows the number of protein obser-vations and the UniProt entries by chromosome usinga circular histogram plot PeptideAtlas is actively involved inthe HUPO chromosome-based Human Proteome Project (C-HPP) [75] and provides a dedicated website to access protein
identifications per chromosome (httpwwwpeptideatlasorghupoc-hpp)
PeptideAtlas heavily supports targeted proteomics work-flows in different ways (i) SRMAtlas (httpwwwsrmatlasorg) is a compendium of targeted proteomics assays to detectand quantify proteins using SRMMRM-based proteomicsworkflows (ii) TIQAM (targeted identification for quanti-tative analysis by MRM) [76] which is a desktop applica-tion to facilitate the selection of peptide and transitions Itconsists of three applications ldquoTIQAM-Digestorrdquo ldquoTIQAM-PeptideAtlasrdquo and ldquoTIQAM-Viewerrdquo (iii) the automated andtargeted analysis with quantitative SRM tool (ATAQS) [77]is a software pipeline tool that contains modules to designmanage analyze and validate an MRM assay ATAQS usesFireGoose [78] to connect to various web services such asPeptideAtlas (used to select spectra) TIQAM (to generate insilico peptides for a given protein [76]) PIPE2 (to generatea list of proteins to design an MRM assay and for othervarious analysis tasks) and PABST (peptide atlas best SRMtransition used to generate optimal transitions)
35 MassIVE
The MassIVE data repository (httpmassiveucsdedu) is acommunity resource developed by the Center for Compu-tational Mass Spectrometry (University of California SanDiego) to promote the global-free exchange of MS data Mas-sIVE provides a location for researchers to access public rawdatasets and accompanying files often alongside publicationOne of the key features of MassIVE (still under develop-ment) compared with other resources is that its function-ality is aimed at networking and providing a social platformfor researchers MassIVE users are not only able to browsedownload but also comment on datasets These commentscan be accompanied with new data or new analyses that en-rich the original dataset
351 Data submission and format support
The submission to MassIVE is based on two main steps(i) upload the data files to ProteoSAFe (httpmassiveucsdeduProteoSAFe) and (ii) invoke the MassIVE datasetsubmission workflow for those files The MassIVE teamstrongly recommends that submitters use FTP to upload andorganize their dataset files as opposed to the ProteoSAFe fileupload web interface
MassIVE dataset files are organized into the following cat-egories (i) license filesmdashspecifying how and under whichconditions the dataset files may be downloaded and used(ii) spectrum filesmdashany mass spectrum files constitutingthe main body of a dataset (in mzXML andor raw bi-nary format) (iii) result filesmdashthe output of any search en-gine (iv) sequence databasesmdashany protein or other sequencedatabases that were searched against (fasta format) (v) spec-tral librariesmdashany spectral library files that were searched
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset
The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed
352 Data mining and visualization
The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets
Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]
36 Chorus
Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass
spectral viewers as well as a database search engine for pro-tein sequence identification
The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects
37 GPMDB
GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database
371 Data submission and format support
The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database
372 Data mining and visualization
The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins
When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 939
More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl
38 ProteomicsDB
ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication
381 Data submission and format support
The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]
382 Data mining and visualization
A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be
visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage
39 MaxQB
MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities
391 Data submission and format support
Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface
Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database
392 Data mining and visualization
The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
938 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
if applicable (vi) methods and protocols for any open-format files containing explanations or discussions of theexperimental procedures used to obtain or analyze a givendataset
The MassIVE submission web presents an input form in-cluding all the files to be uploaded and relevant metadataThe metadata fields include the species instruments PTMsand contact information When a dataset is submitted theProteoSAFe validation is performed
352 Data mining and visualization
The ldquoMassIVE datasetsrdquo browser (httpmassiveucsdeduProteoSAFedatasetsjsp) presents a list of all publicdatasets in a tabular format Users can sort and filter byany column Alternatively users can select specific datasetsby checking the boxes next to them MassIVE tried to res-cue as many datasets as possible from the defunct Trancherepository For this imported data the Tranche hash is alsodisplayed enabling the users to search for specific Tranchedatasets
Once data are submitted to the repository it can be sharedwith others (eg reviewers) using password-protected accessto the datasets or be made publicly available by the submitterIn the near future it is planned that anyone with access to agiven dataset will be able to reanalyze it using the many work-flows available in ProteoSAFe After the reanalysis the resultscan then be used to either enrich existing datasets or to createnew ones MassIVE enables on-site reanalysis using a varietyof data workflows standard database searches (MS-GF+) [79]proteogenomics searches against genomicstranscriptomicssequences (ENOSI) [80ndash82] discovery of unexpected modifi-cations (MODa) [83] identification of mixture spectra usingspectral library (MSPLIT) [84] and database (MixDB) search[85] de novo sequencing of peptides (PepNovo) [86] and pro-teins (Meta-SPS) [87] molecular spectral networks (includingboth peptides and metabolites) and top-down protein identi-fication (MS-Align+) [88]
36 Chorus
Chorus (httpchorusprojectorg) is a cloud-based applica-tion that provides scientists the ability to store analyze andshare their MS data regardless of the original raw file formatgenerated The Chorus teamrsquos aim is to create a completecatalogue of the mass spectrometric data in a way that canbe openly accessed and make it freely accessible to boththe scientific community and the general public Chorus wasoriginally announced in 2013 and is enabled by Amazon WebServices Inc (httpawsamazoncom) It is a global cloud-based computing environment that provides customized dataanalysis tools that convert proprietary data formats to a com-mon ldquoMap Reducerdquo format for processing large datasets us-ing parallel and distributed algorithms on the cloud Theinitial data analysis tools include chromatographic and mass
spectral viewers as well as a database search engine for pro-tein sequence identification
The registered users should first request laboratory mem-bership or create their own laboratory After that they cancreate a project an instrument and upload raw files Oncethis is done the users can create experiments and organizethem into the created projects
37 GPMDB
GPMDB [25] (httpgpmdbthegpmorg) is one of the mostwell-known protein expression databases The GPMDBpipeline reprocesses the MSMS data provided by users orraw data stored in other repositories such as PX using thepopular open source search engine XTandem [57] Peptideand protein identifications are generated and stored in XMLfiles which are indexed in a MySQL database
371 Data submission and format support
The data submission process supports MS data in differentformats (dta pkl mgf mzXML and mzData) and it can takeplace via the ldquosimple search pagerdquo Once the data have beenprocessed using XTandem users can choose whether to sub-mit their data or not to GPMDB In addition to XTandemusers can also use a spectral search engine called XHunter(httpxhunterthegpmorg) [89] or the proteotypic peptideprofiler XP3 (httpp3thegpmorg) [90] to analyze theirdata All the identified peptides are matched to the Ensembl[59] genome database
372 Data mining and visualization
The GPMDB main entry point is a web-based search interfacewhere users can search by keywords data sources proteinidentifiers or gene names Also the current GPMDB imple-mentation provides another three alternative pages to quicklyaccess and search the data (i) dataset search by accession (alsocalled gpm httpgpmdbthegpmorggpmnumhtml)(ii) sequence search (also called sequence httpgpmdbthegpmorgseqhtml) (iii) access based on GO terms [91]chromosome Kyoto Encyclopedia of Genes and Genomes(KEGG) pathways [92] BRaunschweig ENzyme DAtabase(BRENDA) [93] tissue ontology (httpgpmdbthegpmorggoindexhtml) and (iv) the amino acid polymorphisms andPTMs search for proteins
When a search is performed the main ldquoprotein tablerdquo listsall the matched proteins and different features are shownsuch as protein accession number of observations in GP-MDB log(e) and an evidence code that rates the current evi-dence for the observation of each protein The ldquoprotein viewrdquoshows all the peptide identifications for a specific protein andthe corresponding sequence coverage to specify the reliabilityof the identification
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 939
More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl
38 ProteomicsDB
ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication
381 Data submission and format support
The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]
382 Data mining and visualization
A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be
visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage
39 MaxQB
MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities
391 Data submission and format support
Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface
Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database
392 Data mining and visualization
The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 939
More recently GPMDB has been actively involved in theC-HPP project [94] In this context human protein identi-fication information in GPMDB is now summarized into acollection of spreadsheets called ldquoGuide to the Human Pro-teomerdquo This guide contains the information organized intoseparate spreadsheets for each chromosome as well as forthe mitochondrial DNA It also contains protein accessionnumbers HGNC (HUGO (Human Genome Organization)Genome Nomenclature Committee) [95] gene names andchromosomal coordinates taken from Emsembl
38 ProteomicsDB
ProteomicsDB (httpwwwproteomicsdborg) is a humanprotein expression database that stores protein and peptideidentifications and quantification values The resource wasjointly developed by the Technical University of Munich thecompany SAP (Walldorf Germany) and the SAP InnovationCenter It was announced in 2013 and it has been recentlyhighlighted as the main output of a study drafting the hu-man proteome [28] It contains information of more than 62projects and more than 300 experiments The proteins iden-tified in the resource map to over 18 000 human genes rep-resenting around 90 of the human proteome However theFDR calculations when different datasets are combined werenot taken into account in the analysis [18] as PeptideAtlasdoes [63 71] It currently contains approximately 70 millionspectra from human cancer cell lines tissues and body flu-ids ProteomicsDB enables real-time analysis and is basedon the SAP HANA platform (httpwwwsaphanacom) forrapid data mining and visualization It has been built to en-able public sharing of datasets as well as to enable users toaccess and review data prior to publication
381 Data submission and format support
The submission pipeline is well structured in three differentlevels (i) projects that contain the general information suchas publication title and an experiment summary (ii) experi-ments that include a name description and scope (the scoperepresents a definition of the experiment type eg PTM orfull tissue proteome) and (iii) experiment filesmdashwhen theusers upload a set of files each file is sent to a verificationprocess before it is definitely accepted or rejected A largenumber of the experiments in the database were down-loaded from repositories such as Tranche PRIDEPX andPeptideAtlas [37] and later reprocessed using two parallelpipelines based on AndromedaMaxQuant [96] and MAS-COT and quantified using MaxQuant [97]
382 Data mining and visualization
A user-friendly web interface allows users browse the humanproteome including protein-level information such asprotein function and expression Protein expression can be
visualized across the complete human body This novelfeature is integrated into the ldquoProtein expression tabrdquo andcomprises the human body map showing the expressionof a given protein in more than 30 tissues organs andbody fluids Also the protein expression of cell lines canbe projected onto the tissue of origin and experimentaldetails for any sample of interest can be readily obtainedThe ldquochromosome viewrdquo shows those proteins identified ineach chromosome region including the description of theproteins length unique peptides unique PSMs (peptidespectrum match) shared PSMs and sequence coverage
39 MaxQB
MaxQB (httpmaxqbbiochemmpgdemxdb) [29] was re-leased in 2012 as a database that stores and displays collec-tions of large proteomics projects and allows joint analysisand comparison MaxQB serves as a generic repository andanalysis platform for high-resolution bottom-up experimentsIt stores details about protein and peptide identifications to-gether with the corresponding high- or low-resolution frag-ment spectra and quantitative information such as proteinratios or label-free derived intensities
391 Data submission and format support
Differently to other databases MaxQB only stores experimentdata generated locally by the Mann group This feature re-duces greatly the data heterogeneity and facilitates the dataanalysis To enable smooth upload of data MaxQB is tightlyintegrated with MaxQuant [97] When a MaxQuant analysisis performed the user of MaxQuant is asked whether shehewants to upload the data or not This new feature will allowthe integration of new datasets from others into MaxQB Al-ternatively the data can be manually uploaded through theweb interface
Additional metadata information should be provided in thesubmission process such as the project name experimentname and workflow parameters All the data are stored in a re-lational database Several human protein sequence databases(UniProt including variants Ensembl and IPI) were uploadedto MaxQB to build a reference protein database
392 Data mining and visualization
The MaxQB web interface allows the users query the databaseby different fields such as gene name organism and sourcedatabase Alternatively an advance query builder can be usedif the user is not familiar with the query syntax Apart of pro-tein detail information and sequence coverage MaxQB canalso display expression information within any of the pro-teomes compared with all other quantified proteins in thatproteome The expression of a given protein is estimatedby the sum of its peptide signals after normalization of the
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
940 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
total proteome signals to each other in MaxQuant The iBAQalgorithm [98 99] based on spectral counting is now imple-mented in MaxQuant and can also be used to estimate pro-tein quantification values One of the major conclusions thatMaxQB data analysis provides is that the peptide rank ordercan be used as a component of the protein identification score
310 MOPED
MOPED (httpmopedproteinspireorg) [31 100] is an ex-panding resource that enables rapid browsing of proteinexpression information coming from humans and severalmodel organisms MOPED provides protein-level expressiondata meta-analysis capabilities and quantitative data fromstandard analyses In order to address the metadata diversitythe MOPED database developed a multi-ldquoomicsrdquo metadatachecklist that has been used for collecting metadata whichhas been made available to the community through DELSA(wwwdelsaglobalorg)
3101 Data submission and format support
In order to start a dataset submission MOPED requiresa minimum set of metadata that must be included at theexperiment level Users must then supply a brief exper-imental description the source organism and the jour-nal reference The MOPED pipeline [101] is based on re-analyzing MS data from other public repositories usingSPIRE (Systematic Protein Investigative Research Environ-ment httpproteinspireorg) [102] SPIRE integrates opensource search tools such XTandem OMSSA and variousstatistical models into its pipeline MOPED estimates proteinabsolute expression and concentration values using spectralcounting Since the probability of identifying peptides de-pends on peptide properties such as the peptide sequencethe APEX quantitative approach weights spectral counts us-ing estimates of these probabilities to improve the accuracyof the provided absolute expression values [101] Each identi-fied protein links to various protein and pathway databasesincluding GeneCards [103] UniProt KEGG and Reactome[104] The current MOPED database contains data from fourof the most studied organisms human mouse Caenorhab-ditis elegans and Saccharomyces cerevisiae
3102 Data mining and visualization
The web interface contains three main panels ldquoprotein abso-lute expressionrdquo ldquoprotein relative expressionrdquo and ldquogene rel-ative expressionrdquo Each of the panels contains the ldquoMOPEDsearch boxrdquo that supports queries by keywords tissues con-ditions and pathways After a search is performed the ldquopro-tein ID and expression summaryrdquo section displays expres-sion data Each protein row in the expression summary table
displays the protein accession description concentration or-ganism localization sequence coverage spectral count andgene name
The interface also features a ldquovisualization panelrdquo knownas the chord diagram It can break down proteins in exper-iments by organism tissue localization and condition Theabsolute and relative expression matrix in these panels showsthe expression of the identified proteins per condition In thefuture MOPED plans to integrate proteomics with transcrip-tomics and metabolomics data through biological pathwaysand networks
311 PaxDb
The first version of the PaxDb resource (httppax-dborg)[32] was published and released in 2012 PaxDb is a databasededicated to integrate information on absolute protein abun-dance levels based on a deep coverage of the proteomeconsistent data postprocessing and enabling comparabilityacross different organisms PaxDb is a meta-resource sinceit takes the information previously published in data reposi-tories such as PRIDE and PeptideAtlas and does not acceptany direct data submissions
3111 Data submission and format support
The current PaxDb pipeline analyzes the spectral countinginformation from PeptideAtlas buildsmdashthat is which pep-tides have been identified and how often over the wholebuild This part of PaxDbrsquos data import is entirely based onthe original scoring and quality cutoffs implemented by Pep-tideAtlas In the case of identified peptide sequences reportedfrom MSMS approaches (eg PRIDE datasets) the currentpipeline remaps each peptide to the corresponding proteinbased on sequence matches using reference genomes fromthe STRING database [105] Protein abundance values areconverted for each dataset into protein abundance estimatesusing a consistent value across different techniques Insteadof using ldquomolar concentrationrdquo or ldquomolecules per cellrdquo thesystem expresses all abundances in ldquoparts per millionrdquo Thismeans that each protein is listed relatively to all other proteinmolecules in the same sample In the case of biochemicalbiophysical or label-free MS experiments parts per millionvalues are directly computed by rescaling the provided abun-dance estimates by their total sum In the case of spectralcounting data the method is based on the likelihood of de-tectability as described earlier [106]
3112 Data mining and visualization
The PaxDb website allows an ad hoc query of a protein familyof interest allowing multiple proteins requests simultane-ously as well as browsing and comparing complete datasets
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 941
In addition the user queries are searched against the anno-tations of all proteins in PaxDb using a full-text search Foreach organism a distinct summary page provides informa-tion on the data origin its coverage and estimated qualityand the distribution of abundance values of each dataset asa histogram Additionally the most abundant proteins in theorganism are also listed The PaxDb protein view shows abrief description of the protein functional role as annotatedby UniProt andor by other model organism databases Thespecies browser allows the navigation through all the pro-teins for a specific model organism Furthermore it showsall the protein abundances in a specific dataset and also forthe whole of the PaxDb datasets (addition of all datasets for agiven specific species) All PaxDb data are freely available
312 Human Proteinpedia
Human Proteinpedia (httpwwwhumanproteinpediaorg)[33] is a public human proteome repository for sharingprotein expression data derived from multiple experimen-tal platforms It incorporates diverse features of the humanproteome including proteinndashprotein interactions enzymendashsubstrate relationships PTMs subcellular localization andexpression of proteins in various human tissues and celllines in diverse biological conditions (including diseasestates) The Human Proteinpedia database falls neatly in theannotation-oriented database category [38] It complementsthe curated human protein reference database (HPRD) [107]with community-provided annotation It collects submittedproteomics data and uses the findings to directly annotate theprotein entries in HPRD with observed PTMs and subcellularlocalization [108] In contrast with the resources previouslydescribed Human Proteinpedia does not only contain MS-based experiments but also other approaches such as coim-munoprecipitation Western blotting fluorescence immuno-histochemistry and protein and peptide microarrays All thedata should be annotated by the users (httppdashprdorg)and it is not reprocessed
3121 Data submission and format support
Users can provide data annotations in four different waysafter registering in the system (i) data on individual pro-teins along with experimental evidence through the use ofweb forms (ii) upload data via the web in a batch mode(iii) sending data through FTPe-mail to the support teamand (iv) distributed annotation service servers setup by thecontributing laboratories for the data upload [109] By June2014 the current version includes data from more than 249laboratories including 2710 distinct experiments more than15 000 proteins almost 2 million peptides around 5 millionspectra and 2906 annotations related to subcellular localiza-tion (Table 1)
The major differences with other repositories are as fol-lows (i) it does not exclusively contain MS-derived dataas mentioned already (ii) data from proteomics experi-ments are viewed in the context of a proteinndashprotein in-teraction resource (HPRD) (iii) it restricts the data to thatderived from human tissues or cell lines and (iv) data an-notation related to various protein features can be donemanually
3122 Data mining and visualization
The query page in Human Proteinpedia (httpwwwhumanproteinpediaorgquery) enables the search by genesymbol protein name accession number type of proteinfeature and type of experimental platform annotated Aftera query the ldquoprotein detail viewrdquo includes a table where eachrow shows a ldquoplatformrdquo (the type of the experiment egMS or immunohistochemistry) and related metadata Inthe case of MS experiments each platform contains a tablecontaining all the peptides identified for each protein All thedata are freely available (httpwwwhumanproteinpediaorgdownload)
313 HPM
The HPM (httpwwwhumanproteomemaporg) [34] hasjust been developed as an output of a recent draft study ofthe human proteome The original proteomics study [34] wascarried out on 30 histologically normal human tissues andprimary cells using high-resolution MS The generated tan-dem mass spectra correspond to proteins encoded by 17 294genes accounting for approximately 84 of the annotatedprotein-coding genes in the human genome However FDRcalculations when different datasets are combined are notdiscussed in enough detail in the original manuscript [18]The aim of the HPM is to make possible to review navigateand visualize the protein expression information evidencesof gene families protein complexes signaling pathways andbiomarkers
3131 Data submission and format support
MSMS data obtained from all different experiments weresearched against the Human RefSeq database using SE-QUEST [110] and MASCOT [56] through the Proteome Dis-coverer platform (Thermo Fisher Scientific) Then q val-ues were estimated using the Percolator algorithm withinthe Proteome Discoverer suite Protein and peptide identi-fications obtained from SEQUEST and MASCOT were con-verted into MySQL tables NCBI RefSeq annotations wereused as additional information about the genes from vari-ous public resources Normalized spectral counts [34] wereused to represent expression of proteins and peptides The
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
942 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
resource was developed as a protein expression database anddo not support the submission of new data from externalusers
3132 Data mining and visualization
The web portal was developed using a three-tier web archi-tecture with presentation application and persistence lay-ers Users are able to search using gene or protein identi-fiers and the information is presented in a tabular formatFor each peptide a high-resolution MSMS spectrum fromthe best scoring identification is shown on the spectrumviewer page using the ldquoLorikeetrdquo JQuery plugin (httpscodegooglecomplorikeet)
314 Other proteomics resources
There are other less widely used resources that will not beexplained here in detail First of all the Cardiac Organel-lar Protein Atlas Knowledgebase (COPaKB httpwwwheartproteomeorgcopa) [111] is a centralized platform ofhigh-quality cardiac proteomics data bioinformatics toolsand relevant cardiovascular phenotypes Currently COPaKBfeatures eight organellar modules comprising 4203 MSMSexperiments from human mouse Drosophila and C elegansas well as expression data of 10 924 proteins in the human my-ocardium COPaKB has an attractive web interface which pro-vides a number of effective workflows to guide cardiovascularinvestigators from the actual proteomics data to the system-atic biomedical interpretation The COPaKB team continuesto develop innovative workflows to help providing a better un-derstanding of protein functions in cardiovascular diseasesCOPaKB will cover additional modules on organelles andcells from cardiovascular-relevant model systems The con-tent of each module will expand with the available publicdata
Pep2pro (httpfgcz-pep2prouzhch) [112113] is a com-prehensive analysis database specifically suitable for perform-ing flexible data analysis Pep2pro is a further development ofthe ldquoAtProteomerdquo resource and provides data from Arabidop-sis thaliana The current pipeline employs PepSplice [114] andTPP tools for peptideprotein identification the characteriza-tion of whole genome hits and the PTMs The database isorganized in assemblies similarly to the PeptideAtlas buildsThe Arabidopsis assembly from 2011 contained more than14 522 protein identifications 141 235 identified peptidesand around 2 million spectra
iProX (Integrated proteome resources httpwwwiproxorg) is a repository jointly developed by Beijing ProteomeResearch Center and other institutions in China It has beenrecently developed reusing part of the PRIDE source code Atthe moment of writing it is not fully operational but containsalready some stored datasets
315 Proteomics information available through
UniProt and neXtProt
UniProt (httpwwwuniprotorg) [115] is among the mostused of the protein sequence and functional annotationproviders Among the UniProt databases is the UniProtKBwhich provides a broad range of protein sequence datasetsfor a large number of species specifically tailored for an ef-fective coverage of the sequence space while maintaining ahigh-quality level of sequence annotations and mappings tothe genomics and proteomics information
MS proteomics data deposited in the main publicrepositories is flowing into UniProtKB to enrich pro-tein sequence annotations at the level of the evidencesupporting the existence of a protein (isoforms and variant-containing sequences included) This information is thus pro-vided to users mainly in two different ways (i) indirectlyvia the UniProtKB protein existence values (httpwwwuniprotorgmanualprotein_existence) that are starting to beassigned also on the PSM-level content publicly providedby proteomics repositories (together with the accompany-ing statistical assessment for each PSM) and (ii) directlythrough explicit links to relevant cross-referenced resources(httpwwwuniprotorgdatabase) which cover PRIDE Pep-tideAtlas MaxQB and PaxDb but also others such as Phos-phoSitePlus [116]
Organism-specific mappings of the peptides reported inthe repositories to the UniProtKB sequences are also sharedwith groups producing genome builds to improve the corre-sponding gene annotations In the future it is planned thatPTMs [117] from proteomics repositories will also be inte-grated in UniProtKB
neXtProt (httpwwwnextprotorg) [43] is a web-basedprotein knowledge platform to support research uniquely onhuman proteins The set of manually curated annotations ex-tracted from UniProtKBSwiss-Prot for human which con-stitutes the heart of the resource is constantly complementedwith quality-filtered carefully selected high-throughput ex-periments from different scientific research areas con-cerning abundance distribution subcellular localizationinteractions and cellular functions neXtProt is part of theconsortium which is driving the C-HPP project ThusneXtProt labels as ldquomissing proteinsrdquo those ones which sofar lack experimental evidence [43]
All the relevant proteomics MS-related information(like for instance PTM-related MS-based papers on N-glycosylation phosphorylation S-nitrosylation ubiquitina-tion and sumoylation) has been integrated into neXtProt andis available via the web interface in the ldquoProteomics viewrdquo ofthe ldquoProtein perspectiverdquo layout
4 Data reuse from public resources
Since the current volume of proteomics data deposition israpidly increasing new approaches based on the reanalysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 943
of the data andor new uses of the stored data are being devel-oped The same public dataset can be analyzed using differentpipelines to discover confirm or highlight new biological ev-idences Resources such as GPMDB and PeptideAtlas havebeen doing this for many years already emphasizing controlof the number of false-positives at both peptide and proteinlevel
Data availability per se in repositories has enabled data re-analysis by third parties triggering a discussion in the fieldabout controversial datasets [118ndash122] Furthermore targetedreanalysis of public data by individual groups is now startingto flourish Remarkably the draft of the human proteomereported recently and available in ProteomicsDB includes abig proportion of reanalyzed public datasets (roughly 40 ofthe MS runs) [28] In addition new PTMs have also beendescribed as a result of reanalysis of public datasets with adifferent purpose in mind For instance two recent studiesdescribed new PTMs after reanalyzing available phosphopro-teomics studies [123124] Some proteogenomics studies havealso made use of public datasets [125]
Data from repositories is often used in the design ofSRMMRM transitions for targeted proteomics approaches[126] A mentioned already GPMDB and PeptideAtlas pro-vide already this functionality Data from PRIDE is being usedby the MRMAID resource with the same purpose in mind[127] Another popular data reuse is the building of spectrallibraries Several repositories build their own libraries (egPeptideAtlas PRIDE) that can be used in spectral searches Inaddition meta-analysis of data in PRIDE (without reprocess-ing the data) has already happened enabling the extraction ofnew knowledge [128ndash132] Finally data in proteomics repos-itories can also be used to improve the content of the proteinsequence databases [133]
In this context PeptideShaker (httppeptide-shakergooglecodecom) is a recently developed Java application[134 135] that allows the postprocessing and visualizationof protein identification experiments and it greatly facilitatesthe reprocessing of PRIDE datasets by using the ldquoPRIDE Re-shakerdquo functionality One of the key features of PeptideShakeris the validation of the results using different values such asFDR and false-negative rate [136] These values are displayedin the ldquoFDRfalse-negative rate plotrdquo and the availability ofthese metrics make possible the generation of a ldquocostbenefitrdquocurve also known as receiver operating characteristic curvewhich enables the users to optimize the quality thresholds
Finally it is also important to highlight that not only thepublicly available raw data files are used for reanalysis pur-poses The availability of output files from pipelinessoftwarecan also help third parties to develop new tools and demon-strate its utility This was the case for the recently developedSpliceVista visualization tool [137]
We strongly encourage proteomics researchers to uploadraw data and processed results to public repositories Beforestarting a new project public data can be used to guide newresearch This is already common practice in the case of tar-geted proteomics workflows
5 Pitfalls and future challenges
Data sharing and dissemination is a nontrivial task since it re-quires substantial investment in infrastructure and softwaredevelopment [138139] As mentioned before one of the aimsof the PX consortium is to provide backup when one of the re-sources has funding problems As a proof of concept PRIDEand the defunct Peptidome joined forces to transfer all datafrom Peptidome to PRIDE [140] In parallel as mentionedalready MassIVE has tried to rescue as many datasets orig-inally stored in Tranche as possible but it has been quite achallenging task
In June 2012 PRIDE started to accept and handle raw dataas part of the PX data workflow PX and other resources haveproven that public data dissemination in a decentralized andwell-structured mode is possible and actually important forthe proteomics community In our opinion the next stepwould be to foster a higher integration for these resources Atpresent the main layer of integration among PX resourceshappens at the level of the metadata
The alternative is to access knowledge bases such asUniProt or neXtProt but the amount of data coming fromMS proteomics repositories in these resources is still lim-ited Ideally it should be possible for users to look forall available data for a given protein or peptide (includ-ing PTMs) in a more straightforward way At present sci-entists need to access the different proteomics resourceswhen they want to get all existing information about a givenprotein
To illustrate this idea we studied the protein expressionevidences for the UniProtKBSwiss-Prot human proteomeincluding only canonical sequences (release 2014_05 20265entries) in seven resources that have a uniform data process-ing pipeline GPMDB PeptideAtlas MaxQB Human Pro-teinpedia PaxDb ProteomicsDB and the HPM This provedto be a far from trivial task (see all the retrieved data in Sup-porting Information including the protein lists taken fromeach resource) A total of 7880 proteins represent the ldquocorerdquoproteome since they are stored in all the resources (Fig 3A)In addition 4244 3074 2008 1135 815 and 824 proteins wereidentified in six five four three two and only one resourcerespectively (Fig 3A) Only 285 proteins were not found inany of these seven resources In fact most of the described re-sources shared more than 60 of the protein identificationsamong them (Supporting Information) In our opinion thisredundancy can be used as a ldquoreliabilityrdquo measure of the evi-dence of protein expression For instance the 13 016 proteinsidentified in GPMDB PeptideAtlas and ProteomicsDB aremore likely to be true-positives than the 1171 proteins iden-tified only by ProteomicsDB (Fig 3B) At the same time theldquoexclusiverdquo identifications in ProteomicsDB are new valuableevidences but probably need further investigation
As mentioned already the identification of false-positivesin big resources andor when different datasets are combinedis still a problem [18 26] and protein expression resourcesshould be as thorough as possible in the statistical analysis
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
944 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
Figure 3 (A) Number of UniProtKBSwiss-Prot human proteins (release 2014_05 20265 entries) observed in different proteomics resourcesthat have a uniform data processing pipeline (GPMDB ProteomicsDB PeptideAtlas HPM PaxDb MaxQB Human Proteinpedia PRIDE isnot included) (B) Venn diagram representing the human protein identifications observed in GPMDB PeptideAtlas and ProteomicsDB (C)Area chart showing the distribution of the number of PRIDE assays for those proteins present in three two and one proteomics resourcesor for those proteins not identified at all
With the identification of a high proportion of the proteins inthe human proteome [283463] new analysis tools should beimplemented to retrieve protein expression evidences fromdifferent resources and to detecthighlight those identifica-tions considered as the most andor least reliable
In our opinion it is also needed to increase the reuse andreanalysis of the available data since they can provide newvaluable scientific knowledge To illustrate this idea Fig 3Cshows the distribution of the number of PRIDE assays forthose proteins present in three two and one proteomics re-sources (Fig 3A) or for those proteins not identified at allIt can be observed that when a protein is identified in threeresources those proteins are likely to be identified in aver-age in 13 different PRIDE assays More interesting is thecase of the least observed proteins For example if a proteinis observed in only one resource it is observed in nine differ-
ent PRIDE assays in average In addition those 285 proteinswithout any evidence can be observed in an average of fivedifferent PRIDE assays Therefore the number of times onegiven protein is reported in PRIDE (especially if not presentin other resources) could be used to select datasets to bereanalyzed
Finally another challenge is that at present studies inte-grating different ldquoomicsrdquo technologies are becoming moreand more popular This type of studies poses a challenge fortraditional repositories (which are usually field-specific) andresearchers since it is not straightforward to link data from dif-ferent approaches for instance MS proteomics and RNAseqdata obtained in the same study Some big institutes suchas the EBI and NCBI have implemented specific databasesfor BioSamples [141] to enable the linking between differentstudies performed using the same sample The use of sample
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 945
identifiers is now starting to facilitate the connection betweendifferent ldquoomicsrdquo datasets
6 Conclusions
In recent years there has been a big progress in thedevelopment of proteomics repositories and protein expres-sion databases However more integration between thosedatabases is required More tools such as PRIDE Inspectorand PeptideShaker should be developed to make easier thevisualization reuse and reanalysis of the data in repositoriesSimultaneously in order to engage users it seems advisableto monitor download activity of public datasets and measurethe community interest on them Since proteomics hasbecome such a fundamental part of biological researchit is expected that the amount of information available inproteomics resources will keep growing in the coming yearsWe encourage researchers to make use of this plethora ofdata
YPR is supported by the BBSRC ldquoPROCESSrdquo grant (refer-ence BBK01997X1) RW is supported by the BBSRC ldquoQuanti-tative Proteomicsrdquo grant (reference BBI00095X1) JAV is sup-ported by the Wellcome Trust (grant number WT101477MA) andthe EU FP7 grants ldquoProteomeXchangerdquo (grant number 260558)and PRIME-XS (grant number 262067)
The authors thank Michael MacCoss (Chorus) ChristophSchaab (MaxQB) Eugene Kolker and Roger Higdon (MOPED)Advani Jayshree (Human Proteinpedia) Ronald Beavis (GP-MDB) Mingcong Wang (PaxDb) and Nuno Bandeira (Mas-sIVE) for providing the original data used in Fig 3 and in theSupporting Information and for the general feedback about theirresources
7 References
[1] Altelaar A F Munoz J Heck A J Next-generation pro-teomics towards an integrative view of proteome dynam-ics Nat Rev Genet 2013 14 35ndash48
[2] Betancourt L H De Bock P J Staes A TimmermanE et al SCX charge state selective separation of trypticpeptides combined with 2D-RP-HPLC allows for detailedproteome mapping J Proteomics 2013 91 164ndash171
[3] Branca R M Orre L M Johansson H J Granholm Vet al HiRIEF LC-MS enables deep proteome coverage andunbiased proteogenomics Nat Methods 2014 11 59ndash62
[4] Ramos Y Gutierrez E Machado Y Sanchez A et al Pro-teomics based on peptide fractionation by SDS-free PAGEJ Proteome Res 2008 7 2427ndash2434
[5] Ramos Y Garcia Y Perez-Riverol Y Leyva A et al Pep-tide fractionation by acid pH SDS-free electrophoresis Elec-trophoresis 2011 32 1323ndash1326
[6] Wisniewski J R Zougman A Nagaraj N Mann M Uni-versal sample preparation method for proteome analysisNat Methods 2009 6 359ndash362
[7] Michalski A Damoc E Hauschild J P Lange O et alMass spectrometry-based proteomics using Q Exactivea high-performance benchtop quadrupole Orbitrap massspectrometer Mol Cell Proteomics 2011 10 M111 011015
[8] UniProt C Update on activities at the Universal ProteinResource (UniProt) in 2013 Nucl Acids Res 2013 41 D43ndashD47
[9] Perez-Riverol Y Wang R Hermjakob H Muller M et alOpen source libraries and frameworks for mass spectrom-etry based proteomics a developerrsquos perspective BiochimBiophys Acta 2014 1844 63ndash76
[10] Beck M Schmidt A Malmstroem J Claassen M et alThe quantitative proteome of a human cell line Mol SystBiol 2011 7 549
[11] Nagaraj N Wisniewski J R Geiger T Cox J et al Deepproteome and transcriptome mapping of a human cancercell line Mol Syst Biol 2011 7 548
[12] Csordas A Ovelleiro D Wang R Foster J M et alPRIDE quality control in a proteomics data repositoryDatabase 2012 2012 bas004
[13] Credit where credit is overdue Nat Biotechnol 2009 27579
[14] DeSouza L V Siu K W Mass spectrometry-based quan-tification Clin Biochem 2013 46 421ndash431
[15] Neilson K A Ali N A Muralidharan S Mirzaei M et alLess label more free approaches in label-free quantitativemass spectrometry Proteomics 2011 11 535ndash553
[16] Allmer J Algorithms for the de novo sequencing of pep-tides from tandem mass spectra Expert Rev Proteomics2011 8 645ndash657
[17] Hoopmann M R Moritz R L Current algorithmicsolutions for peptide-based proteomics data genera-tion and identification Curr Opin Biotechnol 2013 2431ndash38
[18] Ezkurdia I Vazquez J Valencia A Tress M Analyzingthe first drafts of the human proteome J Proteome Res2014 13 3854ndash3855
[19] Gupta N Pevzner P A False discovery rates of proteinidentifications a strike against the two-peptide rule J Pro-teome Res 2009 8 4173ndash4181
[20] Kinsinger C R Apffel J Baker M Bian X et al Recom-mendations for mass spectrometry data quality metrics foropen access data (corollary to the Amsterdam principles)Proteomics 2012 12 11ndash20
[21] Anonymous A home for raw proteomics data Nat Meth-ods 2012 9 419
[22] Distler U Kuharev J Navarro P Levin Y et al Drifttime-specific collision energies enable deep-coverage data-independent acquisition proteomics Nat Methods 201411 167ndash170
[23] Lanucara F Holman S W Gray C J Eyers C E Thepower of ion mobility-mass spectrometry for structuralcharacterization and the study of conformational dynam-ics Nat Chem 2014 6 281ndash294
[24] Riffle M Eng J K Proteomics data repositories Pro-teomics 2009 9 4653ndash4663
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
946 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[25] Craig R Cortens J P Beavis R C Open source systemfor analyzing validating and storing protein identificationdata J Proteome Res 2004 3 1234ndash1242
[26] Farrah T Deutsch E W Omenn G S Sun Z et alState of the human proteome in 2013 as viewed throughPeptideAtlas comparing the kidney urine and plasmaproteomes for the biology- and disease-driven Human Pro-teome Project J Proteome Res 2014 13 60ndash75
[27] Vizcaino J A Cote R G Csordas A Dianes J A et alThe PRoteomics IDEntifications (PRIDE) database and as-sociated tools status in 2013 Nucl Acids Res 2013 41D1063ndashD1069
[28] Wilhelm M Schlegl J Hahne H Moghaddas GholamiA et al Mass-spectrometry-based draft of the human pro-teome Nature 2014 509 582ndash587
[29] Schaab C Geiger T Stoehr G Cox J Mann MAnalysis of high accuracy quantitative proteomics datain the MaxQB database Mol Cell Proteomics 2012 11M111014068
[30] Farrah T Deutsch E W Kreisberg R Sun Z et al PAS-SEL the PeptideAtlas SRM experiment library Proteomics2012 12 1170ndash1175
[31] Montague E Stanberry L Higdon R Janko Iet al MOPED 25mdashan integrated multi-omics resourcemulti-omics profiling expression database now includestranscriptomics data Omics J Integr Biol 2014 18335ndash343
[32] Wang M Weiss M Simonovic M Haertinger G et alPaxDb a database of protein abundance averages acrossall three domains of life Mol Cell Proteomics 2012 11492ndash500
[33] Kandasamy K Keerthikumar S Goel R Mathivanan Set al Human Proteinpedia a unified discovery resource forproteomics research Nucl Acids Res 2009 37 D773ndashD781
[34] Kim M S Pinto S M Getnet D Nirujogi R S et al Adraft map of the human proteome Nature 2014 509 575ndash581
[35] Slotta D J Barrett T Edgar R NCBI peptidome a newpublic repository for mass spectrometry peptide identifica-tions Nat Biotechnol 2009 27 600ndash601
[36] Smith B E Hill J A Gjukich M A Andrews PC Tranche distributed repository and ProteomeCom-monsorg Methods Mol Biol 2011 696 123ndash145
[37] Vizcaino J A Deutsch E W Wang R Csordas Aet al ProteomeXchange provides globally coordinatedproteomics data submission and dissemination NatBiotechnol 2014 32 223ndash226
[38] Martens L Proteomics databases and repositories Meth-ods Mol Biol 2011 694 213ndash227
[39] Vizcaino J A Foster J M Martens L Proteomics datarepositories providing a safe haven for your data and act-ing as a springboard for further research J Proteomics2010 73 2136ndash2146
[40] Mead J A Bianco L Bessant C Recent developmentsin public proteomic MS repositories and pipelines Pro-teomics 2009 9 861ndash881
[41] Mead J A Shadforth I P Bessant C Public proteomicMS repositories and pipelines available tools and biologi-cal applications Proteomics 2007 7 2769ndash2786
[42] The UniProtConsortium Activities at the Universal ProteinResource (UniProt) Nucl Acids Res 2014 42 D191ndashD198
[43] Gaudet P Argoud-Puy G Cusin I Duek P et alneXtProt organizing protein knowledge in the context ofHuman Proteome Projects J Proteome Res 2013 12 293ndash298
[44] Olsen J V Mann M Effective representation and storageof mass spectrometry-based proteomic data sets for thescientific community Sci Signal 2011 4 pe7
[45] Martens L Chambers M Sturm M Kessner D et almzMLmdasha community standard for mass spectrometry dataMol Cell Proteomics 2011 10 R110000133
[46] Jones A R Eisenacher M Mayer G Kohlbacher Oet al The mzIdentML data standard for mass spectrometry-based proteomics results Mol Cell Proteomics 2012 11M111014381
[47] Walzer M Qi D Mayer G Uszkoreit J et al ThemzQuantML data standard for mass spectrometry-basedquantitative studies in proteomics Mol Cell Proteomics2013 12 2332ndash2340
[48] Pedrioli P G Eng J K Hubley R Vogelzang M et al Acommon open representation of mass spectrometry dataand its application to proteomics research Nat Biotechnol2004 22 1459ndash1466
[49] Deutsch E W Mendoza L Shteynberg D Farrah T et alA guided tour of the trans-proteomic pipeline Proteomics2010 10 1150ndash1159
[50] Deutsch E W Chambers M Neumann S Levander Fet al TraMLmdasha standard format for exchange of selectedreaction monitoring transition lists Mol Cell Proteomics2012 11 R111015040
[51] Griss J Jones A R Sachsenberg T Walzer M et alThe mzTab data exchange format communicating MS-based proteomics and metabolomics experimental re-sults to a wider audience Mol Cell Proteomics 2014 piimcpO113036681
[52] Mayer G Jones A R Binz P A Deutsch E W et alControlled vocabularies and ontologies in proteomicsoverview principles and practice Biochim Biophys Acta2014 1844 98ndash107
[53] Cote R Reisinger F Martens L Barsnes H et al Theontology lookup service bigger and better Nucl Acids Res2010 38 W155ndashW160
[54] Wang R Fabregat A Rios D Ovelleiro D et al PRIDEinspector a tool to visualize and validate MS proteomicsdata Nat Biotechnol 2012 30 135ndash137
[55] Cote R G Griss J Dianes J A Wang R et al ThePRoteomics IDEntification (PRIDE) converter 2 frameworkan improved suite of tools to facilitate data submission tothe PRIDE database and the ProteomeXchange consortiumMol Cell Proteomics 2012 11 1682ndash1689
[56] Perkins D N Pappin D J Creasy D M Cottrell JS Probability-based protein identification by searching
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 947
sequence databases using mass spectrometry data Elec-trophoresis 1999 20 3551ndash3567
[57] Craig R Beavis R C TANDEM matching proteins withtandem mass spectra Bioinformatics 2004 20 1466ndash1467
[58] Ternent T Csordas A Qi D Gomez-Baena G et alStandardization and guidelines how to submit MS pro-teomics data to ProteomeXchange via the PRIDE databaseProteomics 2014 in press
[59] Flicek P Amode M R Barrell D Beal K et al Ensembl2014 Nucl Acids Res 2014 42 D749ndashD755
[60] Griss J Foster J M Hermjakob H Vizcaino J APRIDE cluster building a consensus of proteomics dataNat Methods 2013 10 95ndash96
[61] Reiter L Rinner O Picotti P Huttenhain R et almProphet automated data processing and statistical val-idation for large-scale SRM experiments Nat Methods2011 8 430ndash435
[62] Picotti P Lam H Campbell D Deutsch E W et al Adatabase of mass spectrometric assays for the yeast pro-teome Nat Methods 2008 5 913ndash914
[63] Farrah T Deutsch E W Hoopmann M R Hallows J Let al The state of the human proteome in 2012 as viewedthrough PeptideAtlas J Proteome Res 2013 12 162ndash171
[64] Desiere F Deutsch E W Nesvizhskii A I Mallick P et alIntegration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry GenomeBiol 2005 6 R9
[65] Lam H Aebersold R Building and searching tandemmass (MSMS) spectral libraries for peptide identificationin proteomics Methods 2011 54 424ndash431
[66] Picotti P Rinner O Stallmach R Dautel F et al High-throughput generation of selected reaction-monitoring as-says for proteins and proteomes Nat Methods 2010 743ndash46
[67] Deutsch E W Lam H Aebersold R PeptideAtlas aresource for target selection for emerging targeted pro-teomics workflows EMBO Rep 2008 9 429ndash434
[68] Mallick P Schirle M Chen S S Flory M R et al Compu-tational prediction of proteotypic peptides for quantitativeproteomics Nat Biotechnol 2007 25 125ndash131
[69] Keller A Nesvizhskii A I Kolker E Aebersold R Em-pirical statistical model to estimate the accuracy of peptideidentifications made by MSMS and database search AnalChem 2002 74 5383ndash5392
[70] Nesvizhskii A I Keller A Kolker E Aebersold R Astatistical model for identifying proteins by tandem massspectrometry Anal Chem 2003 75 4646ndash4658
[71] Reiter L Claassen M Schrimpf S P Jovanovic M et alProtein identification false discovery rates for very largeproteomics data sets generated by tandem mass spectrom-etry Mol Cell Proteomics 2009 8 2405ndash2417
[72] Farrah T Deutsch E W Aebersold R Using the Hu-man Plasma PeptideAtlas to study human plasma proteinsMethods Mol Biol 2011 728 349ndash374
[73] Uhlen M Bjorling E Agaton C Szigyarto C A et al Ahuman protein atlas for normal and cancer tissues based on
antibody proteomics Mol Cell Proteomics 2005 4 1920ndash1932
[74] Saito R Smoot M E Ono K Ruscheinski J et al Atravel guide to cytoscape plugins Nat Methods 2012 91069ndash1076
[75] Paik Y K Jeong S K Omenn G S Uhlen M et alThe chromosome-centric Human Proteome Project for cat-aloging proteins encoded in the genome Nat Biotechnol2012 30 221ndash223
[76] Lange V Malmstrom J A Didion J King N L et alTargeted quantitative analysis of Streptococcus pyogenesvirulence factors by multiple reaction monitoring Mol CellProteomics 2008 7 1489ndash1500
[77] Brusniak M Y Kwok S T Christiansen M CampbellD et al ATAQS a computational software tool for highthroughput transition optimization and validation for se-lected reaction monitoring mass spectrometry BMC Bioin-form 2011 12 78
[78] Shannon P T Reiss D J Bonneau R Baliga N S TheGaggle an open-source software system for integratingbioinformatics software and data sources BMC Bioinform2006 7 176
[79] Kim S Gupta N Pevzner P A Spectral probabilitiesand generating functions of tandem mass spectra a strikeagainst decoy databases J Proteome Res 2008 7 3354ndash3363
[80] Castellana N E Payne S H Shen Z Stanke Met al Discovery and revision of Arabidopsis genes byproteogenomics Proc Natl Acad Sci USA 2008 10521034ndash21038
[81] Woo S Cha S W Merrihew G He Y et al Pro-teogenomic database construction driven from largescale RNA-seq data J Proteome Res 2014 1321ndash28
[82] Castellana N E Shen Z He Y Walley J W et al An au-tomated proteogenomic method uses mass spectrometryto reveal novel genes in Zea mays Mol Cell Proteomics2014 13 157ndash167
[83] Na S Bandeira N Paek E Fast multi-blind modificationsearch through tandem mass spectrometry Mol Cell Pro-teomics 2012 11 M111010199
[84] Wang J Perez-Santiago J Katz J E Mallick P Ban-deira N Peptide identification from mixture tandem massspectra Mol Cell Proteomics 2010 9 1476ndash1485
[85] Wang J Bourne P E Bandeira N Peptide identificationby database search of mixture tandem mass spectra MolCell Proteomics 2011 10 M111010017
[86] Frank A Pevzner P PepNovo de novo peptide sequenc-ing via probabilistic network modeling Anal Chem 200577 964ndash973
[87] Guthals A Clauser K R Bandeira N Shotgun protein se-quencing with meta-contig assembly Mol Cell Proteomics2012 11 1084ndash1096
[88] Liu X Sirotkin Y Shen Y Anderson G et al Proteinidentification using top-down Mol Cell Proteomics 201211 M111008524
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
948 Y Perez-Riverol et al Proteomics 2015 15 930ndash949
[89] Craig R Cortens J C Fenyo D Beavis R C Using an-notated peptide mass spectrum libraries for protein identi-fication J Proteome Res 2006 5 1843ndash1849
[90] Craig R Cortens J P Beavis R C The use of proteotypicpeptide libraries for protein identification Rapid CommunMass Spectrom 2005 19 1844ndash1850
[91] Ashburner M Ball C A Blake J A Botstein D et alGene ontology tool for the unification of biology TheGene Ontology Consortium Nat Genet 2000 25 25ndash29
[92] Kanehisa M Goto S KEGG Kyoto encyclopedia of genesand genomes Nucl Acids Res 2000 28 27ndash30
[93] Gremse M Chang A Schomburg I Grote A et al TheBRENDA tissue ontology (BTO) the first all-integrating on-tology of all organisms for enzyme sources Nucl AcidsRes 2011 39 D507ndashD513
[94] Lane L Bairoch A Beavis R C Deutsch E W et alMetrics for the Human Proteome Project 2013ndash2014 andstrategies for finding missing proteins J Proteome Res2014 13 15ndash20
[95] Wain H M Bruford E A Lovering R C Lush M J et alGuidelines for human gene nomenclature Genomics 200279 464ndash470
[96] Cox J Neuhauser N Michalski A Scheltema R A et alAndromeda a peptide search engine integrated into theMaxQuant environment J Proteome Res 2011 10 1794ndash1805
[97] Cox J Mann M MaxQuant enables high peptide identi-fication rates individualized ppb-range mass accuraciesand proteome-wide protein quantification Nat Biotechnol2008 26 1367ndash1372
[98] Schwanhausser B Busse D Li N Dittmar G et alGlobal quantification of mammalian gene expression con-trol Nature 2011 473 337ndash342
[99] Schwanhausser B Busse D Li N Dittmar G et al Corri-gendum global quantification of mammalian gene expres-sion control Nature 2013 495 126ndash127
[100] Kolker E Higdon R Haynes W Welch D et al MOPEDmodel organism protein expression database Nucl AcidsRes 2012 40 D1093ndashD1099
[101] Higdon R Stewart E Stanberry L Haynes W et alMOPED enables discoveries through consistently pro-cessed proteomics data J Proteome Res 2014 13 107ndash113
[102] Kolker E Higdon R Morgan P Sedensky M et al SPIREsystematic protein investigative research environment JProteomics 2011 75 122ndash126
[103] Safran M Dalah I Alexander J Rosen N et alGeneCards Version 3 the human gene integrator Database2010 2010 baq020
[104] DrsquoEustachio P Reactome knowledgebase of human bio-logical pathways and processes Methods Mol Biol 2011694 49ndash61
[105] Szklarczyk D Franceschini A Kuhn M Simonovic Met al The STRING database in 2011 functional interaction
networks of proteins globally integrated and scored NuclAcids Res 2011 39 D561ndashD568
[106] Weiss M Schrimpf S Hengartner M O Lercher M Jvon Mering C Shotgun proteomics data from multipleorganisms reveals remarkable quantitative conservation ofthe eukaryotic core proteome Proteomics 2010 10 1297ndash1306
[107] Keshava Prasad T S Goel R Kandasamy K Keerthiku-mar S et al Human protein reference databasemdash2009 up-date Nucl Acids Res 2009 37 D767ndashD772
[108] Goel R Harsha H C Pandey A Prasad T S Humanprotein reference database and Human Proteinpedia as re-sources for phosphoproteome analysis Mol BioSyst 20128 453ndash463
[109] Muthusamy B Thomas J K Prasad T S K Pandey AAccess guide to human proteinpedia Curr Protoc Bioin-formatics 2013 Chapter 1 Unit 121
[110] Eng J K McCormack A L Yates J R An approach tocorrelate tandem mass spectral data of peptides with aminoacid sequences in a protein database J Am Soc MassSpectrom 1994 5 976ndash989
[111] Zong N C Li H Li H Lam M P et al In-tegration of cardiac proteome biology and medicineby a specialized knowledgebase Circ Res 2013 1131043ndash1053
[112] Baerenfaller K Hirsch-Hoffmann M Svozil J Hull Ret al pep2pro a new tool for comprehensive proteomedata analysis to reveal information about organ-specificproteomes in Arabidopsis thaliana Integr Biol 2011 3225ndash237
[113] Hirsch-Hoffmann M Gruissem W Baerenfaller Kpep2pro the high-throughput proteomics data processinganalysis and visualization tool Front Plant Sci 2012 3123
[114] Roos F F Jacob R Grossmann J Fischer B et al Pep-Splice cache-efficient search algorithms for comprehen-sive identification of tandem mass spectra Bioinformatics2007 23 3016ndash3023
[115] UniProt C Activities at the Universal Protein Resource(UniProt) Nucl Acids Res 2014 42 D191ndashD198
[116] Hornbeck P V Kornhauser J M Tkachev S Zhang Bet al PhosphoSitePlus a comprehensive resource for in-vestigating the structure and function of experimentallydetermined post-translational modifications in man andmouse Nucl Acids Res 2012 40 D261ndashD270
[117] Olsen J V Mann M Status of large-scale analysis of post-translational modifications by mass spectrometry MolCell Proteomics 2013 12 3444ndash3452
[118] Asara J M Schweitzer M H Freimark L M PhillipsM Cantley L C Protein sequences from mastodon andTyrannosaurus rex revealed by mass spectrometry Science2007 316 280ndash285
[119] Asara J M Garavelli J S Slatter D A Schweitzer MH et al Interpreting sequences from mastodon and T rexScience 2007 317 1324ndash1325
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom
Proteomics 2015 15 930ndash949 949
[120] Bern M Phinney B S Goldberg D Reanalysis of Tyran-nosaurus rex mass spectra J Proteome Res 2009 8 4328ndash4332
[121] Bromenshenk J J Henderson C B Wick C H StanfordM F et al Iridovirus and microsporidian linked to honeybee colony decline PloSOne 2010 5 e13181
[122] Foster L J Bromenshenk et al (PLoS One 20115(10)e13181) have claimed to have found peptides from aninvertebrate iridovirus in bees Mol Cell Proteomics 201211 A1100063871
[123] Hahne H Kuster B Discovery of O-GlcNAc-6-phosphatemodified proteins in large-scale phosphoproteomics dataMol Cell Proteomics 2012 11 1063ndash1069
[124] Matic I Ahel I Hay R T Reanalysis of phosphopro-teomics data uncovers ADP-ribosylation sites Nat Meth-ods 2012 9 771ndash772
[125] Brosch M Saunders G I Frankish A Collins M Oet al Shotgun proteomics aids discovery of novel protein-coding genes alternative splicing and ldquoresurrectedrdquo pseu-dogenes in the mouse genome Genome Res 2011 21756ndash767
[126] Mohammed Y Domanski D Jackson A M SmithD S et al PeptidePicker a scientific workflow withweb interface for selecting appropriate peptides for tar-geted proteomics experiments J Proteomics 2014 106C151ndash161
[127] Fan J Mohareb F Bond N J Lilley K S Bessant CMRMaid 20 mining PRIDE for evidence-based SRM tran-sitions OMICS 2012 16 483ndash488
[128] Griss J Cote R G Gerner C Hermjakob H Vizcaino JA Published and perished The influence of the searchedprotein database on the long-term storage of proteomicsdata Mol Cell Proteomics 2011 10 M111008490
[129] Klie S Martens L Vizcaino J A Cote R et al Ana-lyzing large-scale proteomics projects with latent semanticindexing J Proteome Res 2008 7 182ndash191
[130] Mueller M Vizcaino J A Jones P Cote R et alAnalysis of the experimental detection of central nervoussystem-related genes in human brain and cerebrospinalfluid datasets Proteomics 2008 8 1138ndash1148
[131] Gonnelli G Hulstaert N Degroeve S Martens L To-wards a human proteomics atlas Anal Bioanal Chem2012 404 1069ndash1077
[132] Foster J M Degroeve S Gatto L Visser Met al A posteriori quality control for the curation andreuse of public proteomics data Proteomics 2011 112182ndash2194
[133] Griss J Martin M OrsquoDonovan C Apweiler R et alConsequences of the discontinuation of the InternationalProtein Index (IPI) database and its substitution by theUniProtKB ldquocomplete proteomerdquo sets Proteomics 2011 114434ndash4438
[134] Barsnes H Vaudel M Colaert N Helsens K et al Com-pomics utilities an open-source Java library for computa-tional proteomics BMC Bioinformatics 2011 12 70
[135] Vaudel M Barsnes H Berven F S Sickmann AMartens L SearchGUI an open-source graphical userinterface for simultaneous OMSSA and XTandemsearches Proteomics 2011 11 996ndash999
[136] Vaudel M Burkhart J M Sickmann A Martens L Za-hedi R P Peptide identification quality control Proteomics2011 11 2105ndash2114
[137] Zhu Y Hultin-Rosenberg L Forshed J Branca R Met al SpliceVista a tool for splice variant identification andvisualization in shotgun proteomics data Mol Cell Pro-teomics 2014 13 1552ndash1562
[138] Martens L Resilience in the proteomics data ecosystemhow the field cares for its data Proteomics 2013 13 1548ndash1550
[139] Perez-Riverol Y Hermjakob H Kohlbacher O MartensL et al Computational proteomics pitfalls and challengesHavanaBioinfo 2012 workshop report J Proteomics 201387 134ndash138
[140] Csordas A Wang R Rios D Reisinger F et al FromPeptidome to PRIDE public proteomics data migration at alarge scale Proteomics 2013 13 1692ndash1695
[141] Gostev M Faulconbridge A Brandizi M Fernandez-Banet J et al The BioSample database (BioSD) at theEuropean Bioinformatics Institute Nucl Acids Res 201240 D64ndashD70
Ccopy 2014 The Authors PROTEOMICS published by Wiley-VCH Verlag GmbH amp Co KGaA Weinheim wwwproteomics-journalcom