Parallel implementations of a 3-d image · 2014. 5. 12. · Parallel implementations of a 3-d image...

15
Parallel implementations of a 3-d image reconstruction algorithm L. Pastor", A. Sanchez^ F. Fernandez", A. Rodriguez" "Dipartimento de Tecnologia Fotonica ^ Dipartimento de Lenguajes y sistemas Informdticos del Monte, Madrid, Spain ABSTRACT This paper compares two different parallel implementations of Feldkamp's cone-beam reconstruction method for 3D tomography. The first approach is based on a vector-parallel shared-memory architecture,and the second on a transputer-based distributed-memory architecture. The experimental results have shown the effectiveness of both models for executing this kind of compute intensive parallel algorithms. 1. INTRODUCTION From a computational point of view, 3D image reconstruction is a very demanding task. For example, for reconstructing an object with N* voxels (volume elements), the cone-beam backprojection operation, the most time- consuming stage on filtered backprojection methods (Feldkamp et al. [4]), requires around 60N* FLOPS. Other 3D reconstruction techniques need even more operations (Smith [15],Grangeat[5]). In order to achieve acceptable reconstruction times for practical resolutions (in the range between 128* to 512*), special purpose hardware can be developed. An alternative approach can be based on exploiting the parallelism inherent to reconstruction algorithms: an in-depth analysis of the problem and its algorithmic solution can help with the definition of the computational task, allowing the achievement of a maximum degree of parallelism, taking Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Transcript of Parallel implementations of a 3-d image · 2014. 5. 12. · Parallel implementations of a 3-d image...

  • Parallel implementations of a 3-d image

    reconstruction algorithm

    L. Pastor", A. Sanchez^ F. Fernandez", A. Rodriguez"

    " Dipartimento de Tecnologia Fotonica

    ^ Dipartimento de Lenguajes y sistemas

    Informdticos

    del Monte, Madrid, Spain

    ABSTRACT

    This paper compares two different parallel implementations of Feldkamp'scone-beam reconstruction method for 3D tomography. The first approach isbased on a vector-parallel shared-memory architecture, and the second on atransputer-based distributed-memory architecture. The experimental results haveshown the effectiveness of both models for executing this kind of computeintensive parallel algorithms.

    1. INTRODUCTION

    From a computational point of view, 3D image reconstruction is a very

    demanding task. For example, for reconstructing an object with N* voxels

    (volume elements), the cone-beam backprojection operation, the most time-

    consuming stage on filtered backprojection methods (Feldkamp et al. [4]),

    requires around 60N* FLOPS. Other 3D reconstruction techniques need even

    more operations (Smith [15], Grangeat[5]). In order to achieve acceptable

    reconstruction times for practical resolutions (in the range between 128* to

    512*), special purpose hardware can be developed.

    An alternative approach can be based on exploiting the parallelism

    inherent to reconstruction algorithms: an in-depth analysis of the problem and

    its algorithmic solution can help with the definition of the computational task,

    allowing the achievement of a maximum degree of parallelism, taking

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • 184 Applications of Supercomputers in Engineering

    advantage of existing parallel machines. In general this last approach is less

    performant, although it has a very good price/performance ratio. Similar

    solutions have been developed in other image processing domains (Li et al.

    [13], Weber [17].

    This paper describes the implementation of Feldkamp's cone-beam

    reconstruction method on two different parallel architectures: a shared-memory

    vector-parallel multiprocessor (Alliant FX/40) and a hierarchical distributed

    memory message passing multiprocessor (T.Node). The first machine was

    selected for implementing reconstruction algorithms in the EC BRTTE project

    'EVA' (Morisseau et al. [14]), developed jointly by INTERCONTROLE

    (France), LETI-CEA (France), CUALICONTROIVDTF-UPM (Spain), MILANO

    RICERCHE (Italy), REGIENOV (France) and FAIREY TECRAMICS (United

    Kingdom).

    This paper is organized as follows: section 2 presents the main stages

    of Feldkamp's reconstruction method. Section 3 offers a brief description of

    architectural features corresponding to considered multiprocessors. Section 4

    describes the implemented solutions. Section 5 presents and compares the major

    experimental results for both considered parallel architectures. Finally, Section

    6 offers some concluding remarks.

    2. FELDKAMP'S CONE-BEAM RECONSTRUCTION METHOD

    Feldkamp's method [4] is a cone-beam geometry extrapolation of fan-beam

    bidimensional reconstruction techniques. It is composed of three main stages:

    - Projection weighting: The projection data is multiplied by coefficients

    dependent only on their position within the projection (the coefficients

    remain constant for all of the acquisitions).

    - Filtering: The weighted data is convolved during this stage with a one

    dimensional filter, such as Shepp-Logan's (Jain [9]).

    - Backprojection: During the method's third stage, the weighted and

    filtered projection data is backprojected using a cone-beam geometry

    (Fig.l).

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • Applications of Supercomputers in Engineering 185

    #SOURCE

    OBJECT

    DETECTOR

    Figure 1 Cone-beam acquisition.

    Feldkamp's method was the first cone-beam practical reconstructionmethod available for 3D tomography. Being an extrapolation of fan-beamtechniques, it is correct only for the object middle plane (the plane containingthe source circular trajectory). For small vertical cone aperture angles, thereconstruction errors remain acceptable. The method's most salient feature isits simplicity, being remarkably efficient from the computational point of view.A sequential algorithm for Feldkamp's method can be found in Jacquet [8].

    In this paper only the backprojection stage has been considered for

    parallelization, for two reasons:

    - The complexity of the cone-beam backprojection is substantially higherthan the complexity of the other two stages together (O(N*) versus

    - The first two stages can be computed at data acquisition time, togetherwith additional preprocessing operations.

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • 186 Applications of Supercomputers in Engineering

    3. CONSIDERED MULTIPROCESSORS

    3.1. Shared-memory paradigm: ALLIANT FX/40

    The available shared memory multiprocessor is an Alliant FX/40 with fourAdvanced Computational Elements (ACE's), one Interactive Processor (IP), 256Kb of shared cache and 32 Mb of main memory (Alliant [2]). The ACE's areprocessors with vector capabilities that are crossbar-connected to the sharedcache which is in turn connected via a high speed bus to the shared memory.

    In the Alliant FX/40 system, concurrent programs can be produced usingthe FX/Fortran compiler (Alliant [1]), which has been especially designed tosupport concurrency, being one of the best vectorizing compilers. Also availableare optimized library functions for computing vector and matrix operations.

    3.2. Distributed-memorv paradigm: TELMAT T.NODEThe T.Node system is a commercial product of Telmat Informatique (Telmat[16]), emerged from the development of the Supernode (ESPRIT projectP1085). It is a loosely coupled MIMD multiprocessor machine based on T800-25 MHz transputers (Inmos [7]), in which the interconnections (called links)among transputers are made via software-controlled switches. This modular andhierarchical architecture is based on reconfigurable nodes (or modules) oftransputers, allowing the interconnexion of up to 1024 processors.

    Each basic node is a reconfigurable network with a maximum of 16worker transputers and an associated control transputer. The four bidirectionallinks of each transputer are connected to a 72x72 crossbar switch, configuredby a program running on the control transputer. An additional control bus, witha master-slave protocol, enables any transputer to communicate with the controlone independently of the links.

    The available hardware configuration has 3 worker boards (or clusters)with 8 transputer per cluster. Each processor has 2 Mb of dynamic RAM. Thehost machine is a Sun 4 workstation.

    The programming language is 3L's Fortran 77 with library functions forspecifying interprocessor communication. The 3L programming environmentconsists of a software toolset for compiling, linking, debugging and runningparallel applications on transputers (3L [18]). Since transputers arereconfigurable processors, it is necessary to describe how they are to beinterconnected for running a particular program. This is done by means of aconfiguration file, which includes the number of needed transputers, how their

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • Applications of Supercomputers in Engineering 187

    links are connected and the way of mapping software processes onto the

    physical network.

    4. PARALLELIZATION ASPECTS IN THE BACKPROJECTION STAGE

    This section describes the main aspects considered in the parallelization ofFeldkamp's algorithm. In particular, we have focused on how thebackprojection step can be efficiently executed in the two parallel architectures

    considered.

    4.1. Code optimizations (for shared-memory architectures)

    The strategy followed in this case considers two main lines:

    1.- Aspects such as code vectorization and parallelization -in particular,nested loops execution in Concurrent Outer (loop), Vector Inner (loop)or COVI mode-, unnecessary or redundant code elimination, operationsfactorization, subroutines expansion, loop collapsing, etc. have beenused when advisable to achieve good results. The COVI execution modefor nested loops achieves the computer's maximum parallel and vectorperformance.

    2.- An important problem found for the optimization of large resolutionreconstructions was the degradation of the system's virtual memoryperformance when large data volumes were used. The evolution of thereconstruction times when the problem size increased showed clearly theunfeasibility of maintaining a monolythic data structure. This problemwas avoided by limiting the amount of memory used by the application,in order to match the computer's main memory size.

    Feldkamp's method uses two data volumes: the acquisition orprojections volume, and the object volume. The first volume is accessedprojection after projection, not being necessary to keep the usedprojections in memory. The object volume, on the other hand, istraversed for each projection. This last volume was decomposed onhorizontal slices to limit the memory requirements of large resolutionreconstructions.

    4.2. Mapping strategies (for distributed-memory architectures)The cone-beam backprojection can be computed by accumulating thecontributions of the projections to each of the points of the considered n slicesS=(Sj,S2,...,sJ of the object to be reconstructed, taking into account the

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • 188 Applications of Supercomputers in Engineering

    superposition effect of the m different projections P=(pt,p2f»>Pm)' The densityof the corresponding slices D=(dj,d2,...,dJ can be computed independently onefrom each other. This stage can be summarized by the following expression:

    H

    which can be transformed into the local recurrent form:

    with initial conditions: d(0,i) = 0 and p(j,0) = p̂ and final result:

    2 "

    (b)

    Figure 2 (a) Backprojection dependence graph (m=6) (n=3) (b) A SFG.

    The set {(j,i) I l

  • Applications of Supercomputers in Engineering 189

    transputer network can be systematically derived using different time-allocation

    functions (Kung [11]), which fulfil the causality constrains. In our problem, we

    have taken into consideration the communication and memory requirements,and index space dimensions. The following allocation function has been

    selected:

    allocation(j, i) = i

    which is equivalent to a projection of the DG in the ./-axis direction (figure 2.b).For the timing function, it is possible to select two permissible options:

    timingl(j,i) = j (with broadcasting)timing2(j,i) = y+i (without broadcasting)

    This systolic-like approach allows us to consider different parallelizationstrategies for the backprojection stage in Feldkamp's algorithm. The chosen

    distributed solution is detailed in the following section.

    4.3 Implementation on distributed memory architectures.

    Obj.Voi.

    WrN

    Figure 3 Selected topology and software processes.

    We have considered different strategies for exploiting parallelism (pipeline,geometric, farm (de Carlini et al. [3], Harp[6], Jane et al.[10]). Due to the

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • 190 Applications of Supercomputers in Engineering

    characteristics of the problem being solved and taking into account the

    limitations of our hardware configuration (especially the number of processorsand the size of the available memory in each transputer), we have adopted apipeline topology as shown in figure 3.

    In this topology, three types of processes can be distinguished: a processthat sequentially performs the weighting and filtering stages if needed (main);many identical processes, one replicated in each worker transputer, whichperform backprojection over independent slices of the target volume(backprojectors); and, finally, a process which sends projections from main tothe first backprojector in the pipeline and collects slices of the reconstructedvolume in the opposite direction (recollector). To summarize, the operations

    carried out by each process type are expressed with the following pseudocode:

    Process main(1) Initialization step

    (2) FOR all backprojectors 6, DO(2.1) Send identifications and slices' limits s, to

    backprojector fc- through recollector process(3) FOR all projections PJ of the target volume IX)

    (3.1) Read projection PJ(3.2) Ponderate projection PJ {if necessary}(3.3) Filter projection PJ {if necessary}(3.4) Send filtered projection PJ to first backprojector of

    the pipeline through recollector process(4) FOR all the slices s, of the reconstructed target volume DO

    (4.1) Receive the normalized slice j, corresponding tobackprojector 6, through the recollector process

    (4.2) Write the slice a, onto disk

    Process backprojector (bj(1) Receive identification and limits of its volume slice a, from

    backprojector 6,.y (or directly from recollector for 6,)(2) Send identification and limits of volume slices s/ (/>/) to

    backprojector b̂(3) FOR all projections PJ of the target volume DO

    (3.1) Receive filtered projection p- from backprojector 6,.,(or directly from recollector for bj)

    (3.2) FOR all the lines /* of a volume slice DO(3.2.1) Perform backprojection operations over /* using

    projection PJ

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • Applications of Supercomputers in Engineering 191

    (3.3) Send filtered projection PJ to backprojector 6,+,

    (4) Normalize reconstructed volume slice a,

    (5) Send reconstructed volume slice a, to backprojector b̂ (to

    be finally received by process main)(6) FOR all reconstructed volume slices % (/>/) DO

    (6.1) Receive slice s, from backprojector b̂

    (6.2) Send slice % to backprojector ft,.,

    Process recollector(1) Receive identifications and limits of volume slices from main(2) Send identifications and limits of volume slices to

    backprojector bj

    (3) FOR all filtered projections PJ DO(3.1) Receive PJ from main(3.2) Send PJ to backprojector bj

    (4) FOR all volume slices j, of reconstructed target volume DO(4.1) Receive j, from backprojector bj

    (4.2) Send s, to main

    5. EXPERIMENTAL RESULTS

    This section presents the experimental results achieved. The tests performedinclude volume reconstructions with resolutions of 32\ 64\ 128\ 160* and 180*voxels, using the two parallel implementations of Feldkamp's reconstruction

    method described in the previous section.

    Figure 4 refers to the Alliant FX/40, and shows processing time as afunction of the number of processors. Each curve represents a different targetvolume resolution. It is interesting to point out that doubling the number ofprocessors cuts the processing time by a factor of almost two, with theexception of the 32* case.

    The following three figures correspond to experiments performed on theTelmat T.Node. Figure 5 represents the same tests using different number ofworker processors (1, 2,4, 8, 16 and 23, respectively). Each processor performsthe cone-beam backprojection on its own assigned volume slice, in parallel withthe remaining processors.

    Figure 5, like figure 4, shows large processing time reductions when thenumber of available processors is increased. In this case, the reductions differsomewhat more from the maxima achievable due to inter-transputer

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • 192 Applications of Supercomputers in Engineering

    Processing Time (seconds)o.uuu5f\f\r\,000

    4f\f\f\,uuu

    3nnn,uuu

    1 000

    n

    *\•̂\

    A^_^ *"\

    ^ *?*..̂.̂- »*̂ kf ».*«.-.— ;̂;;/̂;

    •° ^ *

    . "t 22 3Number of Processors

    Size=64 160 Size=180

    Figure 4 Processing time in Alliant FX/40 as a function of the numberof processors, for various image resolutions.

    communication times and to the presence of algorithm portions inherentlysequential.

    Speedup curves for the 64̂ case are displayed in figure 6: the curvesrepresented are the ideal (linear) speedup, the experimental speedup obtainedby practical time measures, and an adjusted speedup line least-squares fit to theexperimental results. This last line gives a 72% efficiency (measured as thequotient of speedup divided by the number of processors).

    Figure 7 compares communication and backprojection times fordifferent resolution reconstructions using the maximum (best) number ofworkertransputers (23 backprojectors). Communication time does not dependsignificantly on the number of processors, but on the reconstructed volume

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • Applications of Supercomputers in Engineering 193

    Processing Time (seconds)3,000

    2,500

    2,000

    1,500

    1,000

    500 -GK--::.

    Size=32 Size=64 Size=

    4 8 16Number of Processors

    =160Size=180

    23

    Figure 5 Processing time in Telmat T.Node as a function of the numberof processors, for various image resolutions.

    Speedup

    16

    14

    12

    10

    8

    6

    4

    2

    124 8 16 Number of processors

    Linear speedup Experimental speedup Adjusted speedup• .-̂ -- Ar •••

    Figure 6 Speedup for an image with resolution of 64* voxels.

    resolution (in general, the number of projections is much greater than thenumber of volume slices). This observation, together with the backprojection

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • 194 Applications of Supercomputers in Engineering

    Time (seconds)£,OUU

    O AAAC,UUU

    1K.r\f\,500

    1f\f\f\,uuu

    C.f\f\OUU

    n

    >,;>"

    ^. ,**

    ^**

    ,.

    18032 64 128 160Problem size

    Communication Time Backprojegtion Time

    Figure 7 Communication vs. backprojection time in Telmat T.Node.

    Time (seconds)2,500

    128Problem size

    Alliant version TNode version

    Figure 8 Comparative processing time of Alliant FX/40 and TelmatT.Node versions, as a function of different problem sizes.

    time increments when the target volume's resolution grows, explains why it isadvantageous to use a large number of transputers.

    Finally, figure 8 compares the experimental backprojection timesmeasured with the Alliant FX/40 and Telmat T.Node. Each curve has been

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • Applications of Supercomputers in Engineering 195

    obtained using the maximum number of processors available for performing the

    backprojection stage (in our case, 4 and 23 for the shared-memory and

    distributed-memory multiprocessors, respectively).

    6. CONCLUDING REMARKS

    In this paper, two parallel implementations of Feldkamp's method arepresented, having each of them been developed for a very different

    multiprocessor architecture. The experimental results have shown that thebackprojection, the most computationally intensive stage, can be effectively

    parallelized: large speedups have been achieved both in shared and distributedmemory multiprocessors when the number of processors used is increased.

    Comparing both implementations, it can be noted that for all of the testsperformed, the response time of the shared-memory computer is better. This canbe due to the additional speed improvements achieved by using the Alliant'svector capabilities. However, the differences between both machines show asmall percentual decrease when the resolution increases. (It is interesting topoint out that the overall memory available to the user was similar in bothmachines). An aspect in favour of the transputer based solution is that further

    machine upgrades can be performed at a smaller cost.

    Regarding the T.Node version, the use of a pipeline topology has shownto yield a good performance, in particular for large resolutions. Moreover, theprocessing performance can be increased by adding processors to the networkwithout process modifications. An important point in this sense, confirmed bythe experimental results, is that total communication time does not dependsignificantly of the number of transputers used. Similar parallelization strategiesare being applied presently to other reconstruction techniques, such as

    Grangeat's [5].

    ACKNOWLEDGEMENTS

    This work has been partly supported by the Spanish Ministry of Education andScience (CICYT) under grant ROB91-0489 and the European CommunityBRITE Project EVA n* P-2051-4, contract n* R/1B-0285-C.

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • 196 Applications of Supercomputers in Engineering

    REFERENCES

    [1] Alliant Computer Systems Corporation, FXIFortran Language ManualMassachusetts, USA, 1988.

    [2] Alliant Computer Systems Corporation, FXISeries Architecture ManualMassachusetts, USA, 1988.

    [3] de Carlini, U. and Villano, U. Transputers and Parallel Architectures:Message-passing Distributed Systems Ellis Horwood, Chichester,England, 1991.

    [4] Feldkamp, L.A., Davis, L.C. and Kress, J.W. 'Practical cone-beam

    algorithm', Journal Optical Soc. America, Vol.1, N.6, pp. 612-619,1984.

    [5] Grangeat, P. 'Mathematical Framework of the Cone Beam 3DReconstruction via the First Derivative of the Radon Transform',Mathematical Methods in Tomography ed. Herman, G.T., Louis, A.K.and Natterer, F. Springer-Verlag, Heidelberg, Germany, 1991.

    [6] Harp, G. (Ed). Transputer Applications, Pitman, London, England, 1989.

    [7] INMOS Ltd., The Transputer Databook Redwood Brn Ltd., Trowbridge,England, 1989.

    [8] Jacquet, I. Reconstruction d'images 3D par I'algorithme tventailg6n6ralist Thfcse C.N.A.M., Grenoble, France, 1988.

    [9] Jain, A.K. Fundamentals of Digital Image Processing Prentice-Hall,Englewood Cliffs, USA, 1989.

    [10] Jane, M.R., Fawcett, R.J. and Mawby T.P. (Eds). TransputerApplications - Progress & Prospects, Proc. of the Closing Symp. of theSERC/DTI Initiative in the Engineering Applications of Transputers,Reading, IOS Press, Amsterdam, The Netherlands, 1992.

    [11] Kung, S.Y. VLSI Array Processors Prentice-Hall, Englewood Cliffs,USA, 1988.

    [12] Lewis, T.G. and El-Rewini, H. Introduction to Parallel Computing

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

  • Applications of Supercomputers in Engineering 197

    Prentice-Hall, Englewood Cliffs, USA, 1992.

    [13] Li, J. and Miguet, S. 'Parallel Volume Rendering of Medical Images',

    in Parallel Computing: From Theory to Practice (Ed. loosen, W. and

    Milgrom, E.), pp. 332-343, Proceedings of the European Workshop onParallel Computing, Barcelona, Spain, IOS Press, Amsterdam, The

    Netherlands, 1992.

    [14] Morisseau, P. et al. 'X-ray voludensitometry: Application to the testing

    of technical ceramics', Proceedings of the 13th World Conference onNOT, Sao Paulo, Brazil, 1991.

    [15] Smith, B.D. 'Cone-beam tomography: recent advances and a tutorial',

    Optical Engineering, Vol.29, N.5, pp. 524-534, 1990.

    [16] Telmat Informatique, TNode User Manual 1990.

    [17] Webber, H.C. (Ed). Image Processing and Transputers, IOS Press,Amsterdam, The Netherlands, 1992.

    [18] 3L Ltd., Parallel Fortran User Guide Livingston, Scotland, 1988.

    Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517