ProteomeGRID: towards a high-throughput Proteomics...

1
ProteomeGRID The main objective of ProteomeGRID is to provide a Grid enabled high-throughput proteomics pipeline, encompassing 2-DE acquisition and differential protein expression analysis, through to robotic excision of interesting protein spots. The system provides the upstream automation for current efforts in large-scale mass spectrometric protein identification. To realise this goal, it is vital to provide a generalised framework that can benefit the entire proteomics community as a whole. ProteomeGRID builds on the proTurbo cluster image computing engine, with Grid-enabled versions of novel automatic 2-DE image analysis algorithms, and workflow middleware based upon the Open Services Grid Architecture (OGSA - http://www.globus.org/). ProteomeGRID is detailed in Figure 3. HIGH-THROUGHPUT WITH proTurbo The first step towards this goal is to overcome the computational and communications burden entailed by the image analysis of 2-DE gels, with OGSA enabled cluster computing. We have developed the high- throughput proTurbo framework, which utilises Condor (http://www.cs.wisc.edu/condor/) cluster management, with JPEG-LS lossless image compression, for task farming massive batchs of 2-D gels. Spare capacity is harvested from idle office machines when their users are away, and a novel probabilistic eager scheduler has been developed to maintain high throughput in response to the likelihood of the owners returning. Our results [3] show a 4:1 lossless and 9:1 near lossless image compression ratio, and so network overhead did not affect other users. With 40 workers a 32× speedup was seen, resulting in 80% resource efficiency (Figure 2ab), and the eager scheduler reduced the impact of evictions by 58% (Figure 2cd). Until recently the bioinformatics techniques have been user-assisted spot detection and point pattern matching algorithms, thus requiring many hours of the biochemist’s time per gel pair. The complexity of new automated algorithms [1] requires Grid infrastructure for the processing, archival, standardisation and retrieval of proteomic data and metadata. Particular emphasis needs to be placed on large-scale image mining and statistical cross-validation for reliable, fully automated differential expression analysis, and the development of a statistical 2-DE object model and ontology that underpins the emerging Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) General Proteomics Standard (GPS) [2]. INTRODUCTION The genome is the genetic blueprint of an organism. However, sequencing these genes does not provide enough data to develop new therapies. In fact, genes code the construction of proteins, the ‘chemical building blocks’ that give structure and function to living things. Cells have the same genome but differ by which genes are active and the corresponding proteins that are made. The proteome is complex – the human proteome contains approximately 5 million different proteins. ProteomeGRID: towards a high-throughput Proteomics pipeline Andrew W. Dowsey 1 , Michael J. Dunn 2 and Guang-Zhong Yang 1 1 Royal Society / Wolfson MIC Lab, Imperial College London. 2 Conway Institute of Biomolecular and Biomedical Research, University College Dublin. Large-scale Statistical Expression Analysis (SEA) In traditional techniques, symbolic representation of spots is determined at the very early stages of the processing pipeline. Once such a description is reached intrinsic errors, dependant on the spot modelling and matching algorithms used, are persistent throughout the subsequent processing steps. In practice, if the analyses were modelled statistically and performed simultaneously between sets of technical replicates, then small insignificant expression changes over one pair could become significant when reinforced by the same consistent changes in the others. This permits the creation of a database of probabilistic baselines, or ‘norms’, which characterise confidence levels for protein expression under different experimental settings. Such evidence will build up a statistical formation model of 2-DE, which would then be mined to discover intrinsic trends, increase the acuity of new results and provide valuable feedback to 2-DE scientists on the sensitivity of their experiments. We call this Statistical Expression Analysis (SEA). Figure 3: In ProteomeGRID gels are first submitted to the advanced image normalisation stage, which includes image registration with bias field correction. Normalised gels are then analysed for differential expression by combining probabilities intra-set (between duplicate gels), inter-set (between control and sample sets) and also from the statistical norms retrieved from the database. The disseminated results will then add further statistical acuity to the norms through a feedback loop. It is expected that the results will provide generic interfacing to robotic spot cutting and automated mass spectrometry. Figure 4: (far left) The ProteomeGRID client visualising the processing of a large gel set sent to the proTurbo server. (left) Illustrating the challenges facing 2-DE image analysis. (a) Control gel. (b) The control gel (magenta) overlaid on a treated sample gel (green). The geometric distortions are clearly evident. (c) The deformation mesh calculated by the fully-automated MIR algorithm. (d) The control gel overlayed on the transformed sample gel. A expression stain bias towards the sample gel now becomes noticeable. (e) MIR searches for the global maximum and avoid local maxima by iteratively refining a course mesh on progressively higher sampled images, using a BFGS optimiser on a cross correlation similarity measure. Figure 2: (a) 3540 image registration tasks, performed by 5 to 40 worker machines. Each run was repeated 40 times, and the 5th, 25th, 50th, 75th and 95th percentiles are also shown. (b) Timeline showing the experiment, performed repeatedly over a work day from midnight to midnight. (c) The probabilistic task replication is illustrated by showing a 42 task experiment run on 40 workers. 500 runs were performed each, for task failure probabilities between 0 and 50%. (d) The same test performed without task replication. General purpose computation on graphics processing units The image analysis has been implemented on the consumer nVidia Geforce 6800 Ultra graphics card. Whilst considerably slower than CPU’s (450Mhz vs 3.4Ghz), their ability to process up to 64 values in parallel make them significantly faster for 3D graphics and visualisation. For this reason, there is now a sizeable research community developing scientific, database and imaging applications on GPU’s (http://www.gpgpu.org/) . We have seen a 6x speedup in performance, compared to similarly priced CPU’s. Furthermore, typical office machines rarely exploit the GPU’s potential, so proTurbo can schedule jobs almost continually, even whilst the computer is in use. • 2-DE exhibits physical deformation - prevalent warping algorithms are designed to correct or induce optical distortion, and so cause protein over-expression under dilation, and under-expression during contraction (Figure 4a-c). To preserve volume-invariance, the intensity of each transformed pixel must be normalised by the Jacobian of the mapping at that point. Excluding artefacts, the volume and distribution of protein in the sample and references gels should be the same. Each parameter in the warping should affect the same volume of protein. Currently we subdivide the gels recursively (Figure 3d) by their centre of mass (Figure 4b), as a fast approximation. The major remaining systematic artefact is the inaccuracy in quantification due to staining, loading and focusing errors [4]. We compensate for these anomalies by fitting a piecewise B-spline bias field that maximises the similarity between the gels, every time the registration picks a new warping (Figure 4ef). ADVANCED IMAGE NORMALISATION Raw gels submitted into ProteomeGRID (Figure 4a) must first be calibrated, so that they can be analysed directly. To maintain high-throughput this stage must be fully automatic, and so we based our approach on our MIR algorithm [1], which is fully image based – it requires no recourse to detecting spots (Figure 3b-e). This has been enhanced with the following: References [1] Veeser, S., Dunn, M. J., Yang, G. Z., Proteomics, 2001, 1, 856-870. [2] Orchard, S., Hermjakob, H., Apweiler, R., Proteomics, 2003, 3, 1374-1376. [3] Dowsey, A. W., Dunn, M. J., Yang, G. Z., Proteomics, 2004, 4, 3800-3812. [4] Dowsey, A. W., Dunn, M. J., Yang, G. Z., Proteomics, 2003, 3, 1567-1596. 1 st dimension separation (pH) 2nd dimension separation (Mr) A more complete understanding of disease may be gained by looking at the proteins present within a diseased cell or tissue. Two dimensional polyacrylamide gel electrophoresis (2-D PAGE) is the only method available capable of simultaneously separating thousands of proteins (figure 1). One of the key objectives of the biochemist is to identify the differential expression between control and experimental sample gels. That is, the protein spots that have been inhibited (disappeared), induced (appeared), or have changed abundance. Such information has enormous biological and clinical potential e.g. disease identification, drug synthesis, proteome mapping. Unfortunately 2-D PAGE exhibits much experimental variation. Geometric deformations in the protein pattern, due to gel inhomogeneities and current leakage, hampers the correct matching of corresponding spots, whilst non-uniform stain and overlapping spots contribute to uncertainties in expression quantification. Figure 1: By separating the proteins by pH horizontally and molecular mass vertically we can isolate thousands of protein ‘spots’ and quantify their abundance (expression). Figure 4: (a) Idealised spot, (b) dilated with traditional warping, (c) and then normalised by a volume-invariant warping. (d) Approximating an even subdivision of a gel. (e) Two overlaid gels show regional artefacts of over-expression (magenta) and under- expression (green). (f) Extracted bias field. Dissemination through web services Identified protein Peptide finger- print Spot locations Gels Standardised gel images Image mining for differential expression Advanced image normalisation Update statistical norms Peptide homology lookup (e.g. MS-Fit/Tag) Robotic spot cutting & digestion to peptide fragments high throughput ESI-MS/MS peptide sequence data PSI/GPS model, use of IPI identifiers and ontologies e.g. LSOO Gel meta-data PSI model and ontologies ProteomeGRID Dissemination through web services Identified protein Peptide finger- print Spot locations Gels Standardised gel images Image mining for differential expression Advanced image normalisation Update statistical norms Peptide homology lookup (e.g. MS-Fit/Tag) 2D-E GRID computation Integrated Proteomics database Robotic spot cutting & digestion to peptide fragments MS GRID computation high throughput ESI-MS/MS peptide sequence data PSI/GPS model, use of IPI identifiers and ontologies e.g. LSOO Gel meta-data PSI model and ontologies ProteomeGRID

Transcript of ProteomeGRID: towards a high-throughput Proteomics...

Page 1: ProteomeGRID: towards a high-throughput Proteomics pipelineubimon.doc.ic.ac.uk/isc/public/Hounsfield2005-posters/awd97-poster.pdfretrieval of proteomic data and metadata. Particular

ProteomeGRIDThe main objective of ProteomeGRID is to provide a Grid enabled high-throughput proteomics pipeline, encompassing 2-DE acquisition and differential protein expression analysis, through to robotic excision of interesting protein spots. The system provides the upstream automation for current efforts in large-scale mass spectrometric protein identification. To realise this goal, it is vital to provide a generalised framework that can benefit the entire proteomics community as a whole.

ProteomeGRID builds on the proTurbo cluster image computing engine, with Grid-enabled versions of novel automatic 2-DE image analysis algorithms, and workflow middleware based upon the Open Services Grid Architecture (OGSA - http://www.globus.org/). ProteomeGRID is detailed in Figure 3.

HIGH-THROUGHPUT WITH proTurboThe first step towards this goal is to overcome the computational and communications burden entailed by the image analysis of 2-DE gels, with OGSA enabled cluster computing. We have developed the high-throughput proTurbo framework, which utilises Condor (http://www.cs.wisc.edu/condor/) cluster management, with JPEG-LS lossless image compression, for task farming massive batchs of 2-D gels. Spare capacity is harvested from idle office machines when their users are away, and a novel probabilistic eager scheduler has been developed to maintain high throughput in response to the likelihood of the owners returning.

Our results [3] show a 4:1 lossless and 9:1 near lossless image compression ratio, and so network overhead did not affect other users. With 40 workers a 32× speedup was seen, resulting in 80% resource efficiency (Figure 2ab), and the eager scheduler reduced the impact of evictions by 58% (Figure 2cd).

Until recently the bioinformatics techniques have been user-assisted spot detection and point pattern matching algorithms, thus requiring many hours of the biochemist’s time per gel pair. The complexity of new automated algorithms [1] requires Grid infrastructure for the processing, archival, standardisation and retrieval of proteomic data and metadata. Particular emphasis needs to be placed on large-scale image mining and statistical cross-validation for reliable, fully automated differential expression analysis, and the development of a statistical 2-DE object model and ontology that underpins the emerging Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) General Proteomics Standard (GPS) [2].

INTRODUCTIONThe genome is the genetic blueprint of an organism. However, sequencing these genes does not provide enough data to develop new therapies. In fact, genes code the construction of proteins, the ‘chemical building blocks’ that give structure and function to living things. Cells have the same genome but differ by which genes are active and the corresponding proteins that are made. The proteome is complex – the human proteome contains approximately 5 million different proteins.

ProteomeGRID: towards a high-throughputProteomics pipeline

Andrew W. Dowsey1, Michael J. Dunn2 and Guang-Zhong Yang11Royal Society / Wolfson MIC Lab, Imperial College London. 2Conway Institute of Biomolecular and Biomedical Research, University College Dublin.

Large-scale Statistical Expression Analysis (SEA)In traditional techniques, symbolic representation of spots is determined at the very early stages of the processing pipeline. Once such a description is reached intrinsic errors, dependant on the spot modelling and matching algorithms used, are persistent throughout the subsequent processing steps.

In practice, if the analyses were modelled statistically and performed simultaneously between sets of technical replicates, then small insignificant expression changes over one pair could become significant when reinforced by the same consistent changes in the others. This permits the creation of a database of probabilistic baselines, or ‘norms’, which characterise confidence levels for protein expression under different experimental settings. Such evidence will build up a statistical formation model of 2-DE, which would then be mined to discover intrinsic trends, increase the acuity of new results and provide valuable feedback to 2-DE scientists on the sensitivity of their experiments. We call this Statistical Expression Analysis (SEA).

Figure 3: In ProteomeGRID gels are first submitted to the advanced image normalisation stage, which includes image registration with bias field correction. Normalised gels are then analysed for differential expression by combining probabilities intra-set (between duplicate gels), inter-set (between control and sample sets) and also from the statistical norms retrieved from the database. The disseminated results will then add further statistical acuity to the norms through a feedback loop. It is expected that the results will provide generic interfacing to robotic spot cutting and automated mass spectrometry.

Figure 4: (far left) The ProteomeGRID client visualising the processing of a large gel set sent to the proTurbo server. (left) Illustrating the challenges facing 2-DE image analysis. (a) Control gel. (b) The control gel (magenta) overlaid on a treated sample gel (green). The geometric distortions are clearly evident. (c) The deformation mesh calculated by the fully-automated MIR algorithm. (d) The control gel overlayed on the transformed sample gel. A expression stain bias towards the sample gel now becomes noticeable. (e) MIR searches for the global maximum and avoid local maxima by iteratively refining a course mesh on progressively higher sampled images, using a BFGS optimiser on a cross correlation similarity measure.

Figure 2: (a) 3540 image registration tasks, performed by 5 to 40 worker machines. Each run was repeated 40 times, and the 5th, 25th, 50th, 75th and 95th percentiles are also shown. (b) Timeline showing the experiment, performed repeatedly over a work day from midnight to midnight. (c) The probabilistic task replication is illustrated by showing a 42 task experiment run on 40 workers. 500 runs were performed each, for task failure probabilities between 0 and 50%. (d) The same test performed without task replication.

General purpose computation on graphics processing unitsThe image analysis has been implemented on the consumer nVidia Geforce 6800 Ultra graphics card. Whilst considerably slower than CPU’s (450Mhz vs 3.4Ghz), their ability to process up to 64 values in parallel make them significantly faster for 3D graphics and visualisation.

For this reason, there is now a sizeable research community developing scientific, database and imaging applications on GPU’s (http://www.gpgpu.org/) . We have seen a 6x speedup in performance, compared to similarly priced CPU’s. Furthermore, typical office machines rarely exploit the GPU’s potential, so proTurbocan schedule jobs almost continually, even whilst the computer is in use.

• 2-DE exhibits physical deformation - prevalent warping algorithms are designed to correct or induce optical distortion, and so cause protein over-expression under dilation, and under-expression during contraction (Figure 4a-c). To preserve volume-invariance, the intensity of each transformed pixel must be normalised by the Jacobian of the mapping at that point.

• Excluding artefacts, the volume and distribution of protein in the sample and references gels should be the same. Each parameter in the warping should affect the same volume of protein. Currently we subdivide the gels recursively (Figure 3d) by their centre of mass (Figure 4b), as a fast approximation.

• The major remaining systematic artefact is the inaccuracy in quantification due to staining, loading and focusing errors [4]. We compensate for these anomalies by fitting a piecewise B-spline bias field that maximises the similarity between the gels, every time the registration picks a new warping (Figure 4ef).

ADVANCED IMAGE NORMALISATIONRaw gels submitted into ProteomeGRID (Figure 4a) must first be calibrated, so that they can be analysed directly. To maintain high-throughput this stage must be fully automatic, and so we based our approach on our MIR algorithm [1], which is fully image based – it requires no recourse to detecting spots (Figure 3b-e). This has been enhanced with the following:

References[1] Veeser, S., Dunn, M. J., Yang, G. Z., Proteomics, 2001, 1, 856-870.[2] Orchard, S., Hermjakob, H., Apweiler, R., Proteomics, 2003, 3, 1374-1376.[3] Dowsey, A. W., Dunn, M. J., Yang, G. Z., Proteomics, 2004, 4, 3800-3812.[4] Dowsey, A. W., Dunn, M. J., Yang, G. Z., Proteomics, 2003, 3, 1567-1596.

1st dimension separation (pH)

2nd

dim

ensi

on s

epar

atio

n (M

r)

A more complete understanding of disease may be gained by looking at the proteins present within a diseased cell or tissue. Two dimensional polyacrylamidegel electrophoresis (2-D PAGE) is the only method available capable of simultaneously separating thousands of proteins (figure 1). One of the key objectives of the biochemist is to identify the differential expression between control and experimental sample gels. That is, the protein spots that have been inhibited (disappeared), induced (appeared), or have changed abundance. Such information has enormous biological and clinical potential e.g. disease identification, drug synthesis, proteome mapping.

Unfortunately 2-D PAGE exhibits much experimental variation. Geometric deformations in the protein pattern, due to gel inhomogeneities and current leakage, hampers the correct matching of corresponding spots, whilst non-uniform stain and overlapping spots contribute to uncertainties in expression quantification.

Figure 1: By separating the proteins by pH horizontally and molecular mass vertically we can isolate thousands of protein ‘spots’ and quantify their abundance (expression).

Figure 4: (a) Idealised spot, (b) dilated with traditional warping, (c) and then normalised by a volume-invariant warping. (d) Approximating an even subdivision of a gel.(e) Two overlaid gels show regional artefacts of over-expression (magenta) and under-expression (green). (f) Extracted bias field.

Dissemination through web

services

Identifiedprotein

Peptide finger-print

Spot locations

Gels Standardised gel images

Image mining for differential

expression

Advanced image

normalisation

Update statistical

norms

Peptide homology lookup

(e.g. MS-Fit/Tag)

2D-E GRID computation

Integrated Proteomics database

Robotic spot cutting & digestion to peptide

fragments

MS GRID computation

high throughput

ESI-MS/MS peptide sequence data

PSI/GPS model, use of IPI identifiers and ontologies

e.g. LSOO

Gel meta-data PSI model and ontologies

ProteomeGRID

Dissemination through web

services

Identifiedprotein

Peptide finger-print

Spot locations

Gels Standardised gel images

Image mining for differential

expression

Advanced image

normalisation

Update statistical

norms

Peptide homology lookup

(e.g. MS-Fit/Tag)

2D-E GRID computation

Integrated Proteomics database

Robotic spot cutting & digestion to peptide

fragments

MS GRID computation

high throughput

ESI-MS/MS peptide sequence data

PSI/GPS model, use of IPI identifiers and ontologies

e.g. LSOO

Gel meta-data PSI model and ontologies

ProteomeGRID