FAST RADIAL BASIS FUNCTION INTERPOLATION VIA ...ramani/pubs/Gumerov_Duraiswami_SISC...Key...

FAST RADIAL BASIS FUNCTION INTERPOLATION VIAPRECONDITIONED KRYLOV ITERATIONNAIL A. GUMEROV AND RAMANI DURAISWAMI

UNIVERSITY OF MARYLAND, COLLEGE PARK

Abstract. We consider a preconditioned Krylov subspace iterative algorithm presented by Faulet al. (IMA Journal of Numerical Analysis (2005) 25, 1—24) for computing the coefficients of aradial basis function interpolant over N data points. This preconditioned Krylov iteration has beendemonstrated to be extremely robust to the distribution of the points and the iteration rapidlyconvergent. However, the iterative method has several steps whose computational and memorycosts scale as O(N2), both in preliminary computations that compute the preconditioner and in thematrix-vector product involved in each step of the iteration. We effectively accelerate the iterativemethod to achieve an overall cost of O(N logN). The matrix vector product is accelerated via the useof the fast multipole method. The preconditioner requires the computation of a set of closest pointsto each point. We develop an O (N logN) algorithm for this step as well. Results are presented formultiquadric interpolation in R2 and biharmonic interpolation in R3. A novel FMM algorithm forthe evaluation of sums involving multiquadric functions in R2 is presented as well.

Key words. Radial Basis Function Interpolation, Preconditioned Conjugate Gradient, CardinalFunction Preconditioner, Computational Geometry, Fast Multipole Method

AMS subject classification. 65D07, 41A15, 41A58

1. Introduction. Interpolation of scattered data of various types using radial-basis functions has been demonstrated to be useful in different application domains[23, 7, 8, 29]. Much of this work has been focused on the use of polyharmonic andmultiquadric radial-basis functions

£φ : Rd × Rd → R

¤that fit data as

s (x) =NXj=1

λjφ (|x− xj |) + P (x) , x ∈ Rd (1.1)

where several classes of radial basis functions may be chosen for φ. Popular choicesinclude

φ (r) =

⎧⎨⎩r biharmonic (R3)r2n+1 polyharmonic (R3)¡r2 + c2

¢1/2multiquadric (Rd)

.

Here λj are coefficients while P (x) is a global polynomial function of total degree atmost K − 1. The coefficients {λ} and the polynomial functions P (x) are chosen tosatisfy the N fitting conditions

s (xj) = f (xj) , j = 1, . . . ,N

and the constraints

NXj=1

λjQ (xj) = 0,

for all polynomials Q of degree at most K − 1.Let {P1, P2, ..., Pk} be a basis for the polynomials of total degree K − 1. Then

the corresponding system of equations that needs to be solved to find the coefficients

1

λ and the polynomial P, areµΦ PT

P 0

¶µλb

¶=

µf0

¶,

where

Φij = φ (xi − xj) , i, j = 1, . . . , N ; Pij = Pj (xi) , i = 1, . . . , N ; j = N+1, . . . , N+K.

Usually the degree K of the polynomial is quite small, and quite often it is 1, inwhich case the polynomial P (x) is just a constant, and this is the case that will beconsidered in the sequel. Micchelli [24] has shown that this system of equations isnonsingular.

For large N, as is common in many applications, the computational time andmemory required for the solution of the fitting equations is a significant bottleneck.Direct solution of these equations requires O

¡N3¢time and O

¡N2¢memory. It-

erative approaches to solving this set of symmetric equations require evaluations ofdense matrix vector products Φλ which have a O(N2) computational cost and requireO¡N2¢memory. Furthermore, for rapid convergence of the iteration, the system must

be preconditioned appropriately.Considerable progress in the iterative solution of multiquadric and polyharmonic

RBF interpolation was made with the development of preconditioners using cardinalfunctions based on a subset of the interpolating points. This approach was firstproposed by Beatson and Powell [1]. Various versions of the idea have since beentried including those based on choosing a set of data near each point, a mix of nearand far points, and via modified approximate cardinal functions that have a specifieddecay behavior [3]. An alternative group of methods that attempt to speed up thealgorithm via domain decomposition have also been developed [4].

Recently, an algorithm was published by Faul et al. [13], which uses the cardinalfunction idea, and in which for each point a carefully selected set of q points wasused to construct the preconditioner. In informal tests we found that this algorithmappears to achieve a robust performance, and was able to interpolate large problemsin a few iterations, where other cardinal function based strategies sometimes took alarge number of steps. In subsequent discussions we will refer to this algorithm asFGP05. This algorithm reduces the number of iterations required; however, its costremains at O(N2), and it requires O(N2) memory. The purpose of this paper is theacceleration of this algorithm.

Similarly to many other iterative techniques the FGP05 algorithm can be accel-erated by the use of the Fast Multipole Method (FMM) for the matrix vector productrequired at each step, making the cost of each iteration O (N logN) or so [14]. Whilethe possibility of the use of the FMM was mentioned in [13], results were not reported.Fast multipole methods have been developed for various radial basis kernels includingthe multiquadric in Rd [11], the biharmonic and polyharmonic kernels in R3 [19], thepolyharmonic kernels in R4 [6], and Gaussian kernels [17, 26]. Use of the FMM canthus reduce the cost of the matrix vector multiply required in each iteration of theFGP05 algorithm to O(N logN).

However, there is a further barrier to achieving a truly fast algorithm using theFGP05 algorithm, as there is a precomputation stage that requires the constructionof approximate cardinal function interpolants centered at each data point and q of itscarefully selected neighbors (q ¿ N). These interpolants are used to provide a set ofdirections in the Krylov space for the iteration. The procedure for these preliminary

2

calculations reported in [13] has a formal complexity of O(N2). Progress towards re-ducing the cost of this step to O(N logN) was reported by Goodsell in two dimensions[18]. However, his algorithm relied on Dirichlet tessellation of the plane, and does notextend to higher dimensions.

In this paper we present a modification to this preliminary set up stage, usethe FMM for the matrix vector product, and reduce the complexity of the overallalgorithm. Section 2 provides some preliminary information on preconditioning withcardinal functions, details of the Faul et al. algorithm, and some details of the FastMultipole Method that are necessary for the algorithms developed in this paper.Section 3 describes the two improvements that let us achieve improved performancein our modified algorithm: finding the q closest points to each point in the set inO (N logN) time and use of the fast multipole method. Since the latter method iswell discussed in other literature, we only provide some details that are unique to ourimplementation. In particular we present novel FMM algorithms for the multiquadric(and the inverse multiquadric) in R2. Section 4 provides results of the use of thealgorithm. Section 5 concludes the paper.

2. Background.

2.1. Preconditioning using approximate cardinal functions. A long se-quence of papers [1, 3, 4] use approximate cardinal functions to create preconditionersfor RBF fitting. The basic iterative method consists at updating at each iteration stepthe approximation to the function f, s(k)

s(k) (x) =NXj=1

λ(k)j φ (|x− xj |) + a

(k), x ∈ Rd, (2.1)

by specifying new coefficients λ(k)j , j = 1, . . . , N and the constant a(k). The residualat step k is obtained as

r(k)i = f (xi)− s

(k) (xi) , i = 1, . . . , N,

at the data points xi.To establish notation, and following FGP05, we denote as s∗

the function that arises at the end of the iteration process.The cardinal or Lagrange functions corresponding to any basis play a major role

in interpolation theory. The cardinal function zl corresponding to a particular datapoint i, satisfy the Kronecker conditions

zl (xj) = δlj , j, l = 1, . . . , N, (2.2)

i.e., zl takes on the value unity at the lth node, and zero at all other nodes. It canitself be expressed in terms of the radial basis functions

zl (x) =NXj=1

ζljφ (|x− xj |) + al, (2.3)

where the ζij are the interpolation coefficients, and al the corresponding constant. Ifwe now form a matrix of all the interpolation coefficients Tji = ζij , then we haveµ

Φ 1T

1 0

¶µTa

¶=

µI0

¶, (2.4)

3

where I is the identity matrix, and 1 is a row vector of size N containing ones, and a arow vector containing the coefficients al. Thus the matrix of interpolation coefficientscorresponding to the cardinal functions is the inverse of the matrix Φ, and if thesecoefficients were known, then it would be easy to solve the system. On the otherhand, computing each of these sets of coefficients for a single cardinal function zl

(corresponding to a column of T ) by a direct algorithm is an O¡N3¢procedure, and

there are N of these functions.The approach pursued by Beatson and Powell [1] was to compute approximations

to these cardinal functions (so that a column ζlj) is approximated by a sparse set ofcoefficients, say q in number, corresponding to the cardinality condition being satisfiedat the point l and q− 1 other chosen points. Then the matrix T corresponding to theapproximate cardinal function coefficients is used as a preconditioner in an appropriateKrylov subspace algorithm. In this case the complexity of the preconditioning stepis O

¡Nq3

¢and for q ¿ N,and independent of N, this strategy, provided it converges

rapidly, can lead to a fast algorithm.Early versions of these algorithm were applied to relatively modest sized prob-

lems and the points at which the cardinality conditions were satisfied were chosen asneighbors of the point corresponding to the row index. This strategy was observed tonot work for larger sets, and various modifications were tried. The algorithm of Faulet al. [13] presents a variation that works on all tried data sets. It is described in thenext section.

Remark 2.1. The FMM applies to matrices whose entries can be characterizedas radial basis functions. For smoothly varying functions, the approximate cardinalfunctions are likely to be good approximations to the true cardinal functions. As such,the idea of using approximate cardinal functions to construct preconditioners may beof practical use in a large number of problems where the FMM can be used to acceleratematrix vector products, by using the radial basis function itself, or in the singular case,some appropriately mollified version of the FMM radial basis function.

2.2. Algorithm of Faul et al.

2.2.1. Choice of points for fitting cardinal functions. The FGP05 algo-rithm implements a Krylov method for the solution of Eq. (1.1). In this algorithmthe point sets on which the approximate cardinal functions are constructed are cho-sen in a special way, so that a rapidly converging preconditioned conjugate gradientalgorithm is achieved. This algorithm was extensively tested in [13], and its conver-gence properties established theoretically. We also tested our implementation of itpractically, and found it to exhibit remarkably fast and robust convergence.

Each row of the matrix has associated with it a particular set of q or fewerpoints. The N +1 × N +1 system of equations 2.4 can be expressed as an equivalentN − 1 × N − 1 system, by eliminating λn from the system using

λn = −N−1Xj=1

λj .

To build a preconditioner we need to consider N − 1 sets of approximate RBF in-terpolants and associated points, Ll, l = 1, . . . ,N − 1, which we call L-sets. Eachset contains the point corresponding to the row index we are working with and otherpoints. To choose the first of these sets, using the inter point distances, the pointsthat are closest to each other chosen, and one of these points is marked as the firstpoint; q−2 further closest neighbors of the marked points are chosen, and the marked

4

point is removed from consideration. In the remaining N − 1 points, again the closestpair of points is chosen, one of these points is marked, and q− 2 further closest neigh-bors of the marked points are chosen. At each step a marked point is selected fromthe closest pair, a further set of closest points chosen, and a set for the marked pointbuilt. In this way the algorithm proceeds to build sets of q points, and the procedurecontinues till only q points remain, which forms the (N − q)th set. A further q − 2sets of points are constructed from these points by taking q − 1, q − 2, . . . , 2 pointsfrom the remaining points, and in each case setting one of these points as the markedpoint, and removing it. In this way a total of N − 1 groups of points are established.

The procedure described above, though indicated as preferred, was not what is ul-timately implemented in FGP05. Rather an approximate version was used. However,the procedure is nonetheless O(N2), as it involves computation of pairwise distancesbetween points. A further crucial aspect of the FGP05 procedure is that at eachstep the algorithm requires deletion of points being considered making the process ofconstructing the list of closest points a dynamic one.

Now using these point sets, radial-basis function fits of the Lagrange function overthese point sets, of the form

zl (x) =Xj∈Ll

ζljφ (x− xj) + const., zl (xj) = δlj , l = 1, . . . , N − 1,

are constructed. The coefficients of the Lagrange function fit ζlj are used to definethe search directions in a conjugate gradient algorithm that is described below. Inthis algorithm, the function zl (defined as zl with the constant part dropped) is used

zl (x) =Xj∈Ll

ζljφ (x− xj) .

By construction Xj∈Ll

ζlj = 0.

2.2.2. Semi norm and inner product. Given functions s and t at the datainterpolation locations {x} , and having these functions interpolated using radial basisfunctions as

s (x) =NXj=1

λjφ (x− xj) + α, t (x) =NXi=1

µiφ (x− xi) + β,

FGP05 define the following semi-norm

kskφ = −¡λTΦλ

¢ 12 , ktkφ = −

¡µTΦµ

¢ 12

and inner product between two functions as

hs, tiφ =1

2

³ks+ tk2φ −

³ksk2φ + ktk

2φ

´´= −

1

2

³(λ+ µ)T Φ (λ+ µ)−

¡λTΦλ

¢−¡µTΦµ

¢´= −

1

2

¡λTΦµ+ µTΦλ

¢= −

NXj=1

NXi=1

λiφ (xi − xj)µj = −NXi=1

λit (xi) = −NXj=1

µjs (xj) .

If the function values at the interpolation points are known, these inner products canbe computed without any matrix-vector products, i.e., in O (N) operations.

5

2.2.3. Krylov subspace. The Krylov subspace considered is that induced bythe operator Ξ acting on functions s that are expressible via a RBF expansion. Thisoperator is defined using the N − 1 approximations to the cardinal functions centeredat the N − 1 points that are constructed zl, and is given as

(Ξs) (x) =N−1Xl=1

hzl, siφ

kzlk2φ

zl (x) .

2.2.4. The preconditioned conjugate gradient iteration. Let s∗ be thefunction that is sought as the solution. Define an initial guess s(1) as

λ(1)j = 0, j = 1, . . . , N ; α1 =

1

2(min (f) + max (f)) ,

which leads to the initial residual

r(k)i = fi − s

(k) (xi) , r = {ri} . (2.5)

Two functions expressed as rBF sums are used in the iteration:

t(k) =NXj=1

τ(k)j φ (|x− xj |) ,

NXj=1

τ(k)j = 0,

d(k) =NXj=1

δ(k)j φ (|x− xj |) ,

NXj=1

δ(k)j = 0.

The first function, t, is constructed using the RBF approximation to the cardinalfunctions on the specially chosen FGP05 points, and the value of the residual at thecurrent iteration at these points :

µ(k)l =

1

ζll

Xi∈Ll

ζlir(k)i , τ

(k)j = µ

(k)l ζlj .

It is shown that

t(k) = Ξ³s∗ − s(k)

´,

while the second function, d(k) satisfies

d(1) = t(1), d(k) = t(k) −

t(k), d(k−1)

®φ

d(k−1), d(k−1)®φ

d(k−1).

Because of this definition, the conjugacy criterion holdsDd(k), d(k−1)

Eφ=Dt(k), d(k−1)

Eφ−

t(k), d(k−1)

®φ

d(k−1), d(k−1)®φ

Dd(k−1), d(k−1)

Eφ= 0.

The solution is advanced by a step length γk along the conjugate direction d(k), sothat where

λ(k+1)j = λ

(k)j + γkd

(k), γk =

PNi=1 δ

(k)i r

(k)iPN

i=1 δ(k)i d(k) (xi)

6

while the constant is updated to minimize the maximum element in the new residualvector. The choice γk can be shown to be minimize

°°s∗ − s(k+1)°°φalong d(k).

Crucial to the implementation is the construction of the approximate cardinalfunctions. It is indicated in [13] that a Fortran implementation is available from Prof.Powell. Our Matlab implementation of the FGP05 algorithm is available online athttp://www.umiacs.umd.edu/~ramani/fmm/FGP_RBF.

2.3. The fast multipole method for accelerating matrix vector prod-ucts. There are several papers and books that introduce the fast multipole methodin details [14, 15, 2], and our purpose is not to reproduce that literature here. How-ever, certain details of geometric data structures in the fast multipole method will beimportant in the initial computations necessary for the preconditioner. Our imple-mentation of the FMM for polyharmonic radial basis functions is described in [19],where we also provided operational and memory complexity.

The algorithm consists of two main parts: the preset step, which includes build-ing the FMM data structure (building and storage of the neighbor lists, etc.) andprecomputation and storage of all translation data. The data structure is generatedusing the bit interleaving technique described in [21], which enables spatial ordering,sorting, and bookmarking. While the FMM algorithm is designed for two indepen-dent data sets (N arbitrary located sources and M arbitrary evaluation points), forthe RBF fitting we will have the same source and evaluation sets of length N . Fora problem size N, the cost of building the data structure based on spatial orderingis O(N logN), where the asymptotic constant is much smaller than the constants inthe O(N) asymptotics of the main algorithm. The number of levels could be arbi-trarily set by the user or found automatically based on the clustering parameter (themaximum number of sources in the smallest box) for optimization of computations ofproblems of different size.

Of particular use in developing the O (N logN) algorithm for the preliminarysteps are the octree data structures that are employed in the FMM. A good introduc-tion to these are provided in Greengard’s thesis [14]. Fast algorithms for computingneighbor lists, point assignment to boxes and other necessary operation in the FMMusing bit-interleaving and de-interleaving operations are provided in [21].

3. Fast algorithms. The original form of the FGP05 algorithm has a complexityof O(N2) due to two reasons: first, the step of setting the data structure requiresO(N2) operations due to pairwise distance search to form O(N) subsets of lengthO(q) for approximation of cardinal functions centered at each point, and, second,the step of matrix vector multiplication of O(N × N) matrix by a O(N) vector,also requires O(N2) operations. As mentioned before the second step can be doneusing the FMM for O(N logN) operations. In the subsections below we describeefficient O(N logN) algorithms for both steps, which bring the FGP iterative methodto O(N logN) complexity.

3.1. Data structure for the FGP05 algorithm. While the problem of de-termining the closest points in a data set is a relatively well studied problem incomputational geometry, no standard computational geometry algorithms that solvethe particular problem required in the FGP05 algorithm. As mentioned earlier, a rea-son for the dramatic performance of the algorithm, as discussed in [13] is the carefulselection of the points over which the cardinal function is selected. As such a selectionmight be useful in the construction of preconditioners with other radial basis functionapplications with the FMM, we provide details of this algorithm below.

7

3.1.1. Problem statement. While the selection of the point sets has beendescribed above in words, we will provide a more precise mathematical statementhere, mainly in the interest of establishing notation.

Given a set X of N points in d dimensional Euclidean space, xi ∈ Rd, i = 1, ...,Nand q < N build N−1 sets Lj ⊂ X, j = 1, ..., N−1 satisfying the following conditions:

• For j = 1, ..., N − q + 1 the sets Lj consist of q points xij1 , ....,xijq ∈ Xwhile for j = N − q + 2, ...,N − 1 the sets Lj consist of N − j + 1 points,xij1 , ....,xij,N−j+1 ∈ X. The number of entries in the set is called its “power”,so Pow(Lj) = q for j = 1, ...,N − q + 1 and Pow(Lj) = N − j + 1 forj = N − q + 2, ..., N − 1.

• Each set Lj is associated with a point xij1 ∈ Lj , called the set center. Allset centers are different, ij1 6= ik1, for k 6= j. Further we introduce sets of setcenters Cj =

©xi11 , ....,xij1

ª, j = 1, ...,N − 1, Pow (Cj) = j.

• The set L1 consists of points xi11 , ....,xi1q ∈ X such that

|xi11 − xi12 | = minxi,xk∈X,i6=k

|xi − xk| , (3.1)

|xi11 − xi1l | 6 minxk∈X\L1

|xi11 − xk| for l = 3, ..., q. (3.2)

In other words, the distance between the center of this set and the secondelement is minimal among all pairwise distances in X. Besides the set centerthe set consists of the q − 1 closest points to the set center.

• The set Lj, j = 1, ...,N − 1 consists of points xij1 , ....,xijp ∈ X\Cj−1, p =Pow(Lj), such that¯

xij1 − xij2¯= minxi,xk∈X\Cj−1,i6=k

|xi − xk| , (3.3)¯xij1 − xijl

¯6 minxk∈X\(Lj∪Cj−1)

¯xij1 − xk

¯for l = 3, ..., q. (3.4)

In other words, the distance between the center of this set and the secondelement is the minimal among all pairwise distances in X excluding the centersof the sets L1, ...,Lj−1.Besides the set center the set consists of q − 1 closestpoints to the set center for j = 1, ..., N − q + 1 and N − j + 1 closest pointsto the set center for j = N − q + 2, ..., N − 1.

Remark 3.1. If we construct the complete inter-point distance matrix (requiringcomputation of N (N − 1) /2 distances), this problem can be solved simply. Howeverthis requires O

¡N2¢time and memory.

3.1.2. General algorithm. There exist fast closest neighbor search algorithmsin d dimensional space of complexity O(logN) per query. These algorithms are basedon hierarchical space subdivision using k-d trees or similar ideas [25]. The cost ofbuilding a tree data structure is O(N logN). Furthermore, an initial spatial orderingof the data requires O(N logN) operations. However, these structures do not enableeasy deletion of points.

Deletion is enabled in the heap data structure, which also requires O(N logN)construction cost [10]. Organization of the data using heaps enables O(logN) inser-tion/deletion operations for a single entry. However even with these preliminary stepsthe design of the actual algorithm requires several additional data structures whichtake care of the dynamic nature of the data (the deletions of the set center at eachstep). These additional structures account for the modifications of the initial set Xdue to deletion operations and enable O(logN) complexity for any operation on the

8

modified data set. Such a performance is needed for these operations, as they mustbe repeated O(N) times as the cardinal point sets are built for the N − 1 centers.

We assume the initial set X consists of points randomly distributed over thecomputational domain, which can be enclosed by a d dimensional cube of side D. Inpractice this set may not be random, as the data may come imbued with a certainorder because of the way it was generated. Such an order may not be desirable forthe performance of the preconditioner. This is why in [13] the authors suggested aninitial random permutation of all initial data. We follow this approach and to avoidadditional complexity in notations we assume that such a random permutation isperformed as a first step and points in X are already arranged in random order.

Let X0 be a spatially ordered set of points from X. Spatial ordering in d dimen-sions, when d is not large (say d = 2 or d = 3, which is the case for most applicationsof the FMM) can be performed efficiently using the bit-interleaving technique. Weuse the same technique to generate the FMM quadtree or octree data structure. Adetailed description of the bit interleaving technique can be found in [22, 21]. Theordered set X0 can be characterized by a permutation index Π. So if xi is the i thelement of X, then the jth element of X0, where j = Π(i) corresponds to the samepoint. Thus, as Π is available there is no need for additional memory to store X0. Thespatially ordered list is needed to provide the following operations with a complexityO(logN) for any point:

1. deletion, and,2. q-closest point search in the subset of remaining points after the deletion ofpreviously considered point centers.These procedures are described below separately.

The second virtual set which we introduce is H, which consists of the elementsof X partially ordered according to some ranking criterion R using the heap datastructure [10]. This set also can be characterized completely by the permutationindex H. We will operate only with first k elements of H which thus form a subsetHk ⊂ H. The top (j = 1) element of Hk has the highest rank. Elements below thiselement are arranged in a binary tree, with each leaf being outranked by its parent(“heap property”). The heap structure is needed to provide the following operationsof complexity O(logN):

1. find the highest ranked element;2. delete the highest ranked element;3. change the rank of an element.After performing any of these operations, the heap structure (characterized by

a permutation index H) can be recovered for a cost of O(logN) operations. Theranking which we introduce can be applied only to elements of Hk. The rank Ri ofelement xi ∈ Hk is the nearness of the point to the set center. So the the closestneighbor to the top element of Hk is closer (or at the same distance) than the closestneighbors to other points from Hk.

We need to introduce two more lists or indices. The first list can be denoted asC and C (i) is the index of the closest neighbor to xi if both these points are in Hk.The second list can be denoted as C∗, where

C (C∗ (j, i)) = i for j = 1, ...,Mki . (3.5)

In some sense this mapping is inverse to C (i). However, as the closeness relationshipis not unique, a point xi can be the closest neighbor to several points. Since thismapping is not one-to-one, C∗ (:, i) provides the indices of all Mk

i points to which

9

xi is the closest neighbor. To bound the size of the array (and the memory andoperations), we need to bound the quantity Mk

i . There exists a strong bound forMki 6 K(d), which shows that for a given distribution of points in Rd the number of

entries in M is O(N). We present this result in the form of the following theorem.Theorem 3.2. Let X be a set of N points in Rd, (d > 1), with the elements

denoted xi ∈ Rd, i = 1, ...,N . Let C be a list of closest neighbors to points i = 1, ..., N ,and let Mi be the number of entries of point i in this list. Then Mi 6 K(d), whereK(d) is the kissing number for dimensionality d1 .

Proof. Without any loss of generality we can assign index i = 1 to a given pointand assign indices 2, ...,M + 1 to points for which this point is the closest neighbor.Denote Dj = |xj − x1| and Dmax = maxj=2,...,M+1Dj . Consider then a sphere ofradius Dmax centered at x1. For each point xj we will correspond point yj on thesphere surface: yj = x1 + (Dmax/Dj) (xj − x1), j = 2, ...,M + 1. Let us show that|yj − yk| > Dmax for j, k = 2, ...,M + 1, j 6= k. Indeed, since for these j and k wehave |xj − xk| > max (Dj , Dk) (otherwise x1 is not the closest point to xj or xk) weobtain

2 (xj − x1)·(xk − x1) = |xj |2+|xk|

2−|xj − xk|2 6 D2

j+D2k−[max (Dj , Dk)]

2 6 DjDk.

Therefore,

|yj − yk|2 = |yj − x1|

2 + |yk − x1|2 − 2 (yj − x1) · (yk − x1)

= 2D2max −

D2max

DjDk2 (xj − x1) · (xk − x1) > 2D2

max −D2max = D

2max.

Now we see that there are M points on the sphere surface the pairwise distancebetween which is not less than the radius of this sphere. Introducing spheres of radiusR = Dmax/2 centered at x1and yj , j = 2, ...,M + 1 we can see that this is equivalentto placing of M spheres of radius R around the sphere of the same radius centered atx1. It is well known that this number cannot exceed the kissing number.

Remark 3.3. The exact kissing number is well known for dimensions d 6 9 andd = 24, while for other dimensions bounds for K(d) are established. Particularly, wehave K(1) = 2, K(2) = 6, K(3) = 12.

The procedure of building sets satisfying the above requirements then can bereduced to procedures of q-closest neighbor search and deletion of found set centersfrom the data structures.

Initialization procedure.• Build a hierarchical space subdivision, using the 2d-trees, and initialize datastructure for neighbor search and deletion procedures;

• Set k = N ; Mki = 0, i = 1, ..., N .

• For j = 1 : N1. Get the closest neighbor for point j (let this be i). This is provided bythe q-neighbor search procedure for q = 1.

2. Set C (j) = i, d (j) = |xj − xi|, Mki =M

ki + 1; C

∗ ¡Mki , i¢= j.

• Based on ranking R = −d build the heap permutation index H.Main algorithm.• For i = 1 : N − 1

1. Find the size of set Li: q0 = min(q,N − i+ 1).

1The kissing number is defined as the maximum number of unit spheres which can surround agiven unit sphere without intersection (excluding the touching of points on the boundary) [30].

10

2. Find the index of the set center: j = H(1). This is the top ranked pointin the heap. Find the index of the closest to j neighbor: m = C(j).

3. Find all the entries to Li. The first entry is xj , and the rest are q0 − 1closest neighbors, provided by the q-neighbor search procedure for q =q0 − 1.

4. Delete point j from the hierarchical data structure for neighbor search.Delete point j from the heap. The last procedure sets the heap lengthto k = k − 1 and updates the heap permutation index H.

5. Modify Mkm and C∗

¡Mkm,m

¢. As m is the closest to j it is listed in

C∗ (:,m). So set Mkm =M

km− 1 and squeeze C

∗ (:,m) to exclude j (thiswill change the indexing in C∗ (:,m)).

6. Update lists C and C∗ and the heap permutation index H as follows7. For n = 1 :Mk

j

(a) Get the closest neighbor, s, for point p = C∗(n, j). This is providedby the q-neighbor search procedure for q = 1.

(b) Set C (p) = s, d (p) = |xp − xs|, Mks =M

ks + 1; C

∗ ¡Mks , s

¢= p.

(c) Update the heap permutation index for new rank of point p, R(p) =−d (p).

3.1.3. Data structure for deletion operation and q-closest neighborsearch using 2d-trees. For fast neighbor search we employ 2d-tree data structures,which for d = 2 and d = 3 are known as the quadtree and octree, respectively [27]. Inthe 2d-tree we have one box at level 0 (which is a cube including all points from X), 2dcubical boxes at level 1 obtained by division by half of the initial (parent) box alongeach dimension, and so on up to level lmax, where we have 2lmaxd boxes. As the dataare sorted using the bit-interleaving techniques they can be easily boxed due to thehierarchical numbering system [21], which also provides a fast way to obtain parentsand neighbors for all boxes. Of course, not all the boxes may be occupied, and the“empty” boxes are not listed in the data structure. Selection of lmax is dictated by theparameter q and one of the possibilities to determine lmax is to define it as the level atwhich the finest box contains not more than q+1 points. For every box we maintaina list of points which reside in this box. This list is dynamic and provided by anadditional data structure described below. The same relates to the list of non-emptyboxes neighboring to the given box. For the neighbor search procedure we use, the socalled 2-neighborhoods [22], which are cubic domains consisting at most of 5d boxeswhere 5d − 1 neighbor boxes surround the central (given) box.

The selection of 2-neighborhoods is motivated by the space dimensions d = 2 andd = 3, which are in the focus of the present paper. Assume that there are q + 1points located in some box, then for d = 2, 3 we can guarantee that for any point inthe dataset, its q closest (Euclidean) neighbors are located inside the 2-neighborhood.Indeed, the maximum distance between some point and its qth closest neighbor doesnot exceed Dd1/2, where D is the box size. At the same time the minimum distancefrom this point to the boundary of the 2-neighborhood is 2D. As d1/2 6 2 (whichalso means that the 2-neighborhood can be used for d = 4) we have such a guaranteeand can use the search algorithm described below.

A complication is that the general algorithm above needs an efficient deletionprocedure, so the search required to form the ith set Li is performed only on the setX\Ci−1. We resolve this issue by introducing linked lists [10] for all points and boxesin the neighborhood of each box. The point linked list, Lp (2, N), is an array, whereLp (1, i) is the index of point from the ordered set X0 preceding point i and Lp (2, i)

11

is the index of point next to i. This list is initialized by setting Lp (1, i) = i − 1 andLp (1, i) = i + 1 and is updated each time as some point i should be deleted. Notethat all points {i} residing in box j can be characterized by the indices of the firstand the last point in this box (bookmarks), as the point linked list is available.

The data structure for handling the neighborhoods is the following. For eachbox, i, we have a list of boxes in the 2-neighborhood of this box, Nneigh (:, i), wherethe first index takes values from 1 to 5d. The boxes in this neighborhood are linkedin a similar way as the points in the original set. This is provided by the linkedlists Ln (:, i), where Ln (1, i) is the relative index of the first element, Ln (2, i) is thelink from the first element to the second, and so on. This array is initialized asLn (m, i) = m, m = 1, ...,Nn (i), i = 1, ...Nb,tot, where Nn (i) is the number of boxesin the neighborhood of box i and Nb,tot is the total number of boxes. For convenienceswe also maintain the arrays Np (i), which shows the number of points in box i, B(j),which is the index of the box containing point j, andNpn (i), which is the total numberof points in the neighborhood of box i.

With these arrays the deletion of point j (index in X0) and the q-closest neigh-bor search procedures for point j (index in X0) can be described by the followingalgorithms.

Deletion procedure.• Set i = B(j); np = Np(i)− 1; exit if np < 0 (nothing to delete), else:• Update the linked list Lp: a = Lp (1, Lp(j)) , b = Lp (2, Lp(j)), Lp (1, Lp(j)) =b, Lp (2, Lp(j)) = a; (note that this should be modified for the first and thelast points; also we need to update the bookmarks if the deleted point is thefirst or the last point in the box); Np(i) = np.

• Set l = lmax and perform the following loop while l > 1 and np = 0 (thisprocedure excludes all boxes, which become empty due to point deletion):1. For m = 1 : Nn(i) (go through all boxes which are in the neighborhoodof a box i)a. p = Nneigh(Ln(m, i), i) (p 6= i); Npn (p) = Npn (p) − 1. Update the

linked lists Ln (:, p).2. l = l− 1; i = Parent(i); np = Np(i)− 1; Np(i) = np (also check/updatebookmarks for box i if np > 0).

For the sake of clarity, we drop details of our procedure for updating Ln (:, p),which are rather obvious, as one should reconnect the broken links caused due tothe exclusion of box i. Note then that the major complexity of this algorithm isdue to the loop over all boxes, for which a given box i lie in its neighborhood. Thisrequires at most

¡5d − 1

¢2simple operations (repairing links). Also it may appear

that this operation is needed for all parent boxes. If lmax ∼ logN the total cost willbe O(logN), with a remark, that the updates of the neighborhood data structure areneeded only when the deletion of a point results in an empty box, which then shouldbe deleted. This situation appears relatively rarely. Otherwise, the deletion of pointis computationally cheap as it requires just repairing of a broken link (2 operations)and a check/update of bookmarks (2 operations).

q-Closest neighbor search.• Set i = B(j), l = lmax, np = Np (i).• While k < q+ 1 and l > 1 perform the following loop (determine the level atwhich the box containing point j has at least q + 1 entries):1. l = l − 1; i = Parent(i); k = Np(i).

• Determine Dmax and Dmin which are the maximum and the minimum dis-

12

tances from j to the boundary of box i.• Determine all distances from point j to points in box i, which are differentfrom j. Put them into the distance array dist and sort in the ascending order.

• If dist(q) > Dmin perform the following procedure for boxes p 6= i from the2-neighborhood of i :1. Determine D0

min, which is the minimum distance from j to the boundaryof box p.

2. If D0min < Dmax determine all distances from point j to points in box p.

Union them with array dist. Sort dist in the ascending order.An informal description of the algorithm is the following. First, we are trying to

find a box big enough to contain q neighbors of point j. For this purpose we considerthe parent of the box, if the box does not contain sufficient number of points. Thisprocedure is finite (at most ∼ lmax . logN). After such a box found we introduce twothreshold distances Dmax and Dmin as defined above. It may happen that the closestq points to j occupying box i are closer than Dmin. In this case there is no need tocheck points from the neighborhood, since they are further than that q points, whichprovide the solution to the problem. On the other hand, we know that there are atleast q neighbors located at distance smaller than Dmax. Hence there is no need tocheck the boxes from the neighborhood with D0

min > Dmax. It is remarkable that thisconstraint substantially reduces the number of boxes to check. For example, for d = 2we should check at most 11 neighbor boxes, instead of 24, and for d = 3 we have 55instead of 124 (5d − 1). The numbers in practice are even smaller, as not all pointsare located in the worst position and the statistical means can be substantially lower.

The above algorithm does not have O(logN) guaranteed complexity for a singlepoint search and an arbitrary distribution. For very non-uniform distributions anadditional research is needed as the cost of search can substantially vary from pointto point and it is not clear theoretically what is the overall complexity of the aboveadaptive scheme.2 However, guarantees can be provided for uniform and some othertypical distributions. Indeed, in the case of a uniform distribution we can assumethat all boxes in the neighborhood has approximately the same number of points ∼ q.Therefore the search is performed on the set of size O(5dq) which does not dependon N , and so the logN complexity arises only from the first step of the procedure (inpractice for larger q this step is cheaper than the computation of distances due to itssmall asymptotic constant).

3.2. Fast multipole acceleration of the matrix vector product.

3.2.1. Polyharmonic kernels in 3D. An approach for efficient computation ofsums with biharmonic kernel in R3, φ (r) = r, using the FMM is described in detailsin [19], where we also show how this method can be extended to computation ofpolyharmonic kernels in R3, φ (r) = r2n+1. The major result there can be formulatedas follows. The FMM for kernels φ (r) = r2n+1, n = 0, 1, ... can be efficiently reducedto application of n + 2 FMMs for the Laplace equation in R3 (with kernel φ (r) =r−1) for which the FMM is well developed and studied. In practice the cost for thepolyharmonic kernel is even smaller, as the same data structure is needed, directsummation in the neighborhood is only needed once, and only a slight modificationof the translation operators is needed.

The efficiency of the FMM is determined by the length of function representation,

2This remark applies equally to the FMM, where for carefully contrived distributions theO(N logN) complexity may not be achieved.

13

which for polyharmonic kernel φ (r) = r2n+1 can be referred as (n+ 2) p2, wherep is the truncation number for the equivalent Laplace equation determined by theprescribed accuracy of computations, and the cost of a single translation, which forthe fastest practical methods available is O(p3) (with different asymptotic constants [9,20]) (O(p2 log p) methods with larger asymptotic constant are also available, e.g. [16,12]). In tests used in the present paper we employed our version for summations withthe biharmonic kernel [19], where we used rotational-coaxial translation decompositionof the translation operators, which result in O(p3) complexity for a single translation.

3.2.2. A novel FMM for the multiquadric kernel in R2. A fast multipolemethod for the multiquadric kernel, φ (r) =

¡r2 + c2

¢1/2, in Rn was presented in [11].

While this algorithm can be used for the multiquadric function in R2, we present anew FMM based approach to evaluate sums with the multiquadric kernel.

First, we note that representation of general functions via expansion coefficientsor samples in two dimensions requires double sums or O(p2) function representationlength for a truncation number p (e.g. via truncated Taylor series expansions). Incontrast functions that are constrained to satisfy some partial differential equation,such as the harmonic function, can be expressed with just O (p) coefficients in R2.General translation operators for functions with a O(p2) representation can then berepresented via a p2 × p2 truncated matrix, which results in a computational costof O(p4) per translation. For large p, methods that employ the convolutional char-acteristics of the translation can exploit the Fast Fourier Transform to reduce theasymptotic complexity O(p2 log p). However, these methods will in practice not beuseful for typical problems, and further suffer an inefficiency as they require zeropadding. (See [16] for a discussion of this issue). The proposed method also usesO(p2) representation — however, the translation cost can be reduced to O(p3), whichis faster than the direct translation for p ≥ 3.

The essence of the method is to extend the space dimensionality from 2 to 3 andseparate the source and receiver points, even though in R2 these points can be thesame. Indeed, let xi and yj be the source and receiver in two dimensions. Theirinteraction is described then as

φ (|yj − xi|) =³|yj − xi|

2 + c2´1/2

=¯y0j − x

0i

¯; y0j = yj + ciz, x

0i = xi + 0iz,

where y0j and x0i are points in R3 and iz is the unit vector orthogonal to the original

plane on which xi and yj are located. It is seen then that this transform leaves thesources on the original plane, while it shifts the locations of all evaluation points by aconstant displacement c, so y0j is located on the plane parallel to the original one (thecase c = 0 is a particular, but not special, case). Therefore the problem of computa-tion of sums with the multiquadric kernel in R2 is equivalent to the computation ofsums with the biharmonic kernel in R3 with the sources and receivers having speciallocations (placed on parallel planes).

Remark 3.4. The use of an increased dimensionality should not frighten us and,in fact, this does not increase either the memory, or the computational cost. Indeed,in the FMM that we use all “empty” boxes are skipped.

Remark 3.5. As the sets of sources and receivers are separated by distancenot less than c there is no sense in dividing the space up to levels where the size ofthe boxes are less than c/2 (if the division goes deeper there will be no sources inthe neighborhood of the receivers and the computational cost will just increase due toadditional translation operations).

14

Remark 3.6. This method also can be successfully applied for computations ofsums with the inverse multiquadric kernel φ (r) =

¡r2 + c2

¢−1/2in R2. Sums involving

this kernel can be reduced to a summation with the Laplacian, kernel 1/r in R3.

4. Numerical tests. To check the accuracy and performance of the fast algo-rithms and compare them with [13] we implemented the described algorithms and theFGP05 algorithm in Fortran 90 in double precision arithmetic and performed sometests on an Intel Xeon 3.2 GHz processor with 3.5 GB RAM. Performance tests wereconducted using two basic random point distributions for R2 (with the multiquadrickernel) and R3 (with the biharmonic kernel). The parameter c for the multiquadrickernel was varied from 0 to a value based on the characteristic distance between neigh-boring points. The first distribution coincides with test case A of [13]. In this casepoints were uniformly distributed inside a d dimensional unit sphere, we refer to thiscase also as “A”. The second distribution is substantially non-uniform, as the pointsin this case were uniformly distributed over the surface of a d dimensional unit sphereand so were on manifolds of lower than d dimensions. Such distributions are typicalfor interpolation of curves and surfaces. We refer to this case as “E”. In both casesthe entries of the right hand side vector were generated as random numbers uniformlydistributed on [−1, 1]. We also performed several illustrative computations for somestandard problems appearing in 2D and 3D computer graphics.

4.1. Preconditioner tests. There are new elements in the present algorithmwhich are related both to construction of the preconditioner (or approximate cardinalfunctions) and to the approximate FMM based matrix vector multiplication. We firststudied the effect of the new preconditioner in isolation in cases where the matrixvector product was performed in the straightforward way.

The first test was to ensure that our implementation of the original FGP05 al-gorithm is correct. We found that we were able to produce results consistent withthose reported in [13]. We next modified the algorithm to use the L-sets produced byour algorithm for selecting the cardinal function interpolation points. The differencebetween our algorithm and that of [13] is that we impose the selection of closest pointsexactly where as this is done approximately there. Table 1 shows the number of iter-ations required to converge to an accuracy ² = 10−10 for the test case A for differentN and q. For each case we did several tests trying to find averages, the variation fromtest to test was rather small (usually within 1 iteration), so we put in the table valuesfor a single random realization. The stability of the number of iterations for differentrealizations was also pointed out in [13].

The results for case d = 2, c = 0 are very close to that reported in [13] (we checkedalso the q = 20 and q = 40 reported in [13], with similar results, but do not reportthem here to conserve space). As one can see from the table, the number of iterationsfor preconditioners produced by the both algorithms are almost the same.

The time required to build the L-sets and compute approximate cardinal functionapproximation is almost insensitive to parameter c of the multiquadric kernel, whileit does depends on the point distribution. Table 2 shows the CPU time (in seconds)required to build preconditioners using the original and the present algorithm forfour cases (A and E, d = 2 and d = 3) for q = 30 and different N . These resultsare also presented in Figure 4.1. As before, we report the timing for a single randomrealization, while several entries into the table were recomputed several times to ensurethat the variations in timing are rather small (within a couple of percent).

It is clearly seen that for N < Nb, which is of order of thousands, the original

15

Problem N q = 10 q = 30 q = 50Present FGP05 Present FGP05 Present FGP05

d = 2, c = 0 200 15 16 8 8 7 7d = 2, c = 0 500 18 19 9 9 8 7d = 2, c = 0 1000 18 20 10 9 8 8d = 2, c = 0 2000 21 22 10 10 9 9d = 2, c = 0 5000 23 26 11 11 10 10d = 2, c = 0 10000 25 27 13 12 11 10

d = 2, c = N−1/2 200 20 20 8 8 7 7d = 2, c = N−1/2 500 23 24 11 10 9 8d = 2, c = N−1/2 1000 25 26 11 11 9 9d = 2, c = N−1/2 2000 27 28 11 12 9 10d = 2, c = N−1/2 5000 31 33 12 12 11 10d = 2, c = N−1/2 10000 35 36 13 13 11 11

d = 3, c = 0 200 22 23 11 12 8 9d = 3, c = 0 500 29 30 14 14 10 11d = 3, c = 0 1000 36 39 17 16 12 11d = 3, c = 0 2000 41 44 19 19 13 13d = 3, c = 0 5000 57 58 23 23 15 15d = 3, c = 0 10000 68 74 26 26 17 17

Table 4.1Number of iterations required for test problem A of [13], implemented using the cardinal point

set algorithm of [13] and our algorithm.

N → 300 1000 3000 10000 30000 100000 300000 1000000(A) d = 2 :Present 0.2 0.7 2.1 7.2 22 81 256 854FGP05 0.1 0.5 2.2 17 157 2445 (19900) (221100)(A) d = 3 :Present 0.2 0.8 3.0 12 43 184 623 2130FGP05 0.1 0.5 2.1 15 139 2211 (19900) (221100)(E) d = 2 :Present 0.2 0.6 1.8 6 18 62 188 623FGP05 0.1 0.5 2.4 18 177 2816 (19900) (221100)(E) d = 3 :Present 0.2 0.8 2.7 9.2 32 110 330 1125FGP05 0.1 0.6 2.4 17 155 2451 (19900) (221100)

Table 4.2The time required for initial computations in the FGP05 iterative algorithm

algorithm performs better, as we incur an additional overhead due to the more com-plex data structure needed in our algorithm. However for N > Nb the execution timefor the original algorithm scales as O(N2), while the present algorithm scales almostlinearly with N . The scaling constant and the value of N at which the present al-gorithm starts to perform linearly depends on the problem dimensionality and pointdistribution. The problem dependence of the present algorithm is stronger than forthe original algorithm. We note also that due to large time required by the original al-gorithm for large N we did not compute the values for N > 100000 and put estimatedtimes in parenthesis based on the fit line shown in the figure. It is seen that for largeN the ratio of times for the algorithms tested can be several orders of magnitude.

Table 1 shows that the number of iterations decreases as the size q of the L-

16

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06Number of points, N

Tim

e to

bui

ld p

reco

nditi

oner

(s)

A, d=2A, d=3E, d=2E, d=3

y=ax2

y=bx

Present

FGP'05

Fig. 4.1. Dependences of the CPU time for construction of the preconditioner (q = 30) foralgorithm FGP05 (shown by the light markers) and the present algorithm (the dark markers) for 4different problems. The solid lines show the linear function and the dashed line shows the quadraticfunction in the log-log coordinates.

sets increases. Thus, it may preferable to use a larger q to reduce the iterationtime. On the other hand, the time required to build the L-sets and compute cardinalfunction approximations increases with increasing q, which is the limiting factor touse larger q. This time depends on N and the particular distribution. Theoretically,the time to build the L-sets for a fixed distribution and N should grow almost linearlywith q (asymptotically as q log q; the logarithmic complexity appears due to sortingprocedure used for determination of closest neighbors). The time to solve a systemof q equations to build the approximate cardinal function grows ∼ q3 as we used astandard LAPACK symmetric solver routine for this purpose. Figure 4.2 presentsactual computation times for problems A and E and d = 2 and 3 at N = 100000. Itis seen that the time depends on the problem dimensionality and the largest timesobtained for d = 3. We note also that an effective dimensionality of problem E forlarge N is rather d−1, than d, which explains the smaller times observed for problemE compared to problem A.

4.2. Tests of the algorithm with the FMM. Speeding up the matrix vectorproduct using approximate methods, such as the FMM, raises several issues on theaccuracy, convergence, and the speed of the algorithm. The accuracy of the FMMis controlled by the truncation number p. For multiquadric functions the error ofmatrix vector multiplication in the L∞ norm, ²N , decays exponentially at growing p,²N ∼ Nη−p, assuming that they contribute to the error in the worst way (no errorcancellation). Actual errors usually are orders of magnitude lower than theoretical

17

0

100

200

300

400

500

10 30 50 70 90Power of the L-sets, q

Tim

e to

bui

ld p

reco

nditi

oner

(s)

A, d=2A, d=3E, d=2E, d=3

N=100000

Fig. 4.2. Dependences of the time required to construct preconditioner for problems A and Eon the size of the L-sets q. The number of points for all cases is the same, N = 100000.

bounds and depend on the problem dimensionality and particular point distributions.Concerning convergence of the iterative algorithm we found for given prescribed ac-curacy empirically numbers p∗ (N) , p > p∗(N). Here p∗ depends on the problem thenumber of iterations to converge to any prescribed accuracy is the same as in the casewhen direct matrix vector multiplication is used. Moreover the norm of the residualsin this process do not depend on p. Despite this, because of conditioning of the matrix,the solution may converge to a solution that is not exact. The error of this solutioncan be evaluated for smaller N as we can check it using one straightforward matrixvector multiplication to evaluate the residual and the results can be extrapolated tolarger N as we know qualitatively the dependence of the matrix vector product onN and p. Of course, the error criterion used to stop the iterative process, should beselected in some consistent way with this error, and this is an item of current research(see e.g., [28]). This whole issue is also application dependent since often the data tobe fit are noisy [8], and the fitting equations may themselves be regularized.

Note then the tuning (optimization) of the overall algorithm is not a simple pro-cedure. Indeed, for some prescribed error, the truncation number should be a functionof N (e.g. p ∼ logCN). Based on p, the optimal level of space subdivision, lmax, forthe FMM should be selected. The complexity of the FMM for optimal level dependson the complexity of translation procedures. For example, for the rotational-coaxial

18

N → 1000 3000 10000 30000 100000 300000 1000000(A) d = 2, c = 0 :Present 1.1 3.0 13 43 163 618 2512FGP05 0.7 4.1 42 424 5477 (49320) (547700)(A) d = 2, c = N−1/2 :Present 1.1 3.7 15 50 217 740 3319FGP05 0.8 4.6 47 430 6463 (49320) (547700)(A) d = 3, c = 0 :Present 1.9 5.5 36 126 736 3460 18660FGP05 0.7 5.5 61 634 5248 (49320) (547700)(E) d = 3, c = 0 :Present 1.2 3.6 15 69 324 1152 5332FGP05 0.8 4.1 38 384 5486 (49320) (547700)

Table 4.3Total CPU time required by the present and FGP05 algorithms.

translation decomposition in some range of p for three dimensional problems this canbe O(p3/2N) = O(N log3/2N). Furthermore one should select an appropriate q forthe preconditioner. As the time to build the preconditioner depends on q and thenumber of iterations decays with q one can expect a minimum for the cost. We foundthat there can be several local minima (e.g. we found four for N = 100000, d = 2,c = 0, problem A in range 15 6 q 6 30). Dependence of the number of iterations onN is also important. An accurate addressing of these issues is outside the scope ofthis paper. Further fundamental studies of the FMM induced matrix perturbationson the accuracy of the overall solution are required, we will further limit ourselvesby presentation of several results of the algorithm performance for some test cases.Further, Faul et al. [13] indicate that their intuition is that the choice of q should beindependent of problem size.

Table 3 shows the CPU times required for computation of some test problems ofdifferent size. Here the CPU time includes both the time to compute the precondi-tioner and run the iterative process. In all cases the prescribed error was ² = 10−3,and p was increasing for increasing N . For example, for test case A, d = 2, c = N−1/2

we changed p from 16 to 28 as N changed from 1000 to 1000000. The parameter q inall cases was 30 (the number suggested by [13]), which provided an acceptably smallnumber of iterations, except of case A, d = 3, c = 0, where a substantial growth in thenumber of iterations was observed for N > 100000. For these cases we set q = 100,which provided a substantially smaller number of iterations, and overall savings inthe computational time.

Results presented in the table are also plotted in Figure 4.3. The table shows thatfor all examples the cross-over point between the FGP05 and the present algorithm didnot exceed N =3000. One can see also that for larger N the CPU time for the FGP05in the computed range can be fitted by a O(N2) dependence (in fact, it should largerthan this estimate as the number of iterations does slightly grow with N). We usedthis fit to estimate the CPU time for the FGP05 algorithm for N > 100000, whichin fact underestimates the time a bit, and these are shown in Table 3 in parentheses.For the present algorithms the data can be fitted by a O(Nα) dependence (one canalso use fits of type O(N logβ N), which should hold if the number of iterations growswith N as some power of logN). The constant α for our computations varies fromα = 1.15 to α = 1.4, and as we mentioned there are two basic reasons for this:growth of the truncation number p and of the number of iterations. Also, somecomputational overhead for larger N contribute to the total CPU time reported. As

19

we mentioned above optimization problems can be solved more accurately than someheuristic which we used for the illustration cases. In any case, one can see substantialspeed ups achieved using the present method, which make practical computations formillion point size sets (larger problems can be computed using parallel computationsand larger memory).

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+03 1.E+04 1.E+05 1.E+06Number of Points, N

CP

U T

ime

(s)

A, d=2, c=0A, d=2, c=c(N)A, d=3, c=0E, d=3, c=0

y=bx

y=ax2y=cx1.4

y=dx1.15

FGP'05

Present

Preconditioner+Matrix-Vector Product

Fig. 4.3. The total CPU time required for solution of different problems. The empty markersshow computations using algorithm FGP05, while the filled markers show computations using thepresent algorithm. The straight lines with different angular coefficients are ploted for conveninenceof data analysis.

Performance of the algorithms for higher accuracy computations is shown in Table4. Here for the FMM computations we set p = 31 for all N and the iterative processfor both algorithms was terminated when the residual error reached ² = 10−8. Otherparameters were the same as for cases reported in Table 3. For higher accuracy, moreiterations are needed and the times reported in Table 4 are larger than in Table 3.Further, the FMM based algorithms becomes slower due to an increase of p. Thisshifts the cross-over point for the algorithms towards higher values. Nonetheless thecross-over point for the two dimensional case A is about N = 3000, while for the threedimensional case E it is somewhere between 3000 and 10000, so the present algorithmis more efficient in terms of time than the FGP05 for problems of size N > 10000.

The effect of the parameter c in the multiquadric RBF on the convergence of it-erative processes using the present method (fast preconditioner + FMM with p = 30)and the original FGP05 algorithm is shown in Figure 4.4. This parameter has the

20

N → 1000 3000 10000 30000 100000 300000 1000000

(A) d = 2, c = N−1/2 :Present 2.3 6.4 28 100 390 1443 6103FGP05 0.9 6.2 64 665 8696 (76500) (850000)(E) d = 3, c = 0 :Present 8.5 13 38 161 841 3002 9460FGP05 0.9 5.7 59 575 8087 (76500) (850000)

Table 4.4Total CPU time required by the present and FGP05 algorithms for higher accuracy computations.

dimensions of length, and forces the function to a constant for large values (and thusuninfluenced by neighbors), while it acts as a regularization parameter for derivativeevaluation for small values of c.It is seen that while there is some differences in thenorm of the residual after certain number of iterations both methods performs ap-proximately the same in terms of the dependence of the number of iterations on c(while the present algorithm was faster). As c increases and q is fixed the number ofiterations grows. This growth is not substantial when c is of the order of characteris-tic distance between the data points, while it is significant for larger c. Moreover wefound that if c is larger than some critical c the process, it seems, does not converge(at least after a hundred iterations or so). This critical c depends on q and one of therecipes to compute interpolants with larger c is to use larger q, which is also seen fromTable 1. The same behavior is observed for the original FGP05 algorithm. A detailedstudy on optimization of parameters c, q, and p which provide best performance ofthe present algorithm can be interesting.

Finally, Figure 4.5 provides illustrations of solution of some problems from 3Dand 2D computer graphics using multiquadric RBFs and the present algorithm. In thefirst example bunny (34834 point set) was extended to size 104502 (for points alongthe normal inside and outside) and after interpolation coefficients were determined theinterpolant was evaluated at ∼ 8000000 points (regular spatial grid 201 × 201 × 201)from which isosurfaces were found using standard routines. Here we used the samesettings as for problem A with d = 3 and c = 0. The computation of preconditionertook 310 s, and the iteration process 283 s. Evaluation was made using the FMM for(which required one matrix vector multiplication) and took 395 s.

In the second example 86% of 375050 pixels in a 2D color image were removedrandomly. The remaining (N =51204) pixels were used to interpolate data back tothe grid using the 2D multiqudaric kernel with c = N−1/2. The results of recoveryare presented in the figure. Here we used the same setting as for problem A withd = 2 and c = N−1/2. The preconditioner should be build only once as it is suitablefor all three RGB channels and it took about 30 s to perform this task. The iterationprocess took 75 s per channel and the final matrix-vector multiplication took 19 s.Note that three digit accuracy, that we imposed for solution is sufficient to determinethe RGB intensities as they vary between 0 to 255 for each channel.

5. Discussion and conclusion. For many dense linear systems of size the avail-ability of an algorithm based on the fast multipole method (FMM) provides a matrixvector product with time and memory complexity that is O (N logN). This reducesthe cost of the matrix vector product step in the iterative solution of linear systems,and for an iteration requiring k steps, a complexity O (kN logN) can be expected.To bound k, appropriate preconditioning strategies must be used. However, manyconventional preconditioning strategies rely on sparsity in the matrix, and applying

21

1.E-08

1.E-06

1.E-04

1.E-02

1.E+00

1.E+02

1.E+04

0 20 40 60 80Iteration

Nor

m o

f Res

idua

l

q=30N=10000d=2Case A

c=0.05

c=0.03

c=0.01

FGP'05

Present

Fig. 4.4. Convergence of the iterative processes using the present (the dark markers) and theoriginal FGP05 algorithm (the light markers) for different values of parameter c in the multiquadricRBF.

them to these dense matrices requires computations that may have a formal time ormemory complexity of O(N2), which negates the advantage of the FMM. With ourconstruction, all steps in the FGP05 algorithm have been accelerated and to have asimilar formal complexity, and in practice converge in relatively few iterations. Thesuccess of this preconditioner based on the geometry of point sets for a matrix thatis capable of being accelerated using the FMM should not be surprising, as the FMMitself is based on the use of computational geometry to partition the matrix vectorproduct.

The idea of constructing preconditioned Krylov algorithms based on the cardinalfunctions may have wider applicability. For any regular FMM kernel function, asthe discussion in Section 2.1 shows, we can construct an approximate inverse to thecardinal function expressed in terms of the kernel function. This idea bears furtherexploration.

6. Acknowledgements. We would like to thank Prof. David Mount for pro-viding us some pointers to material on computational geometry.

REFERENCES

[1] R.K. Beatson and M.J.D. Powell, “An iterative method for thin plate spline interpolation that

22

Original unstructured dataRBF interpolationto regular spatial grid

Original data 86% random data loss RBF interpolation

Fig. 4.5. Illustrations of use of the RBF for 3D and 2D computer graphics problems.

employs approximations to Lagrange functions,”. Numerical Analysis 1993. (D. F. Griffiths& G. A. Watson, eds). London: Longmans, pp. 17—39, 1994.

[2] R. K. Beatson and L. Greengard. “A short course on fast multipole methods.” In M. Ainsworth,J. Levesley, W.A. Light, and M. Marletta, editors, Wavelets, Multilevel Methods and El-liptic PDEs, pages 1—37. Oxford University Press, 1997.

[3] R. K. Beatson, J. B. Cherrie, and C. T. Mouat, “Fast fitting of radial basis functions: methodsbased on preconditioned GMRES iteration,” Advances in Comput. Math., 11, 1999, 253—270.

[4] R. K. Beatson, W. A. Light, S. Billings “Fast solution of the radial basis function interpolationequations: domain decomposition methods,” SIAM Journal on Scientific Computing, Vol.22, pp. 1717-1740, 2000

[5] R.K. Beatson, J.B. Cherrie, and D.L. Ragozin, “Polyharmonic splines in Rd: tools for fastevaluation, In: A. Cohen, C. Rabut, and L. Schumaker (eds.) Curve and Surface Fitting:Saint-Malo 1999, Vanderbuilt University Press, Nashville, TN, 2000.

[6] R. K. Beatson, J. B. Cherrie, and D. L. Ragozin, “Fast evaluation of radial basis functions:Methods for four-dimensional polyharmonic splines”, SIAM J. Math. Anal., vol. 32, no. 6,pp. 1272-1310, 2001.

[7] J. Bloomenthal, C. Bajaj, J. Blinn, M.P.Cani-Gascuel, A. Rockwood, B. Wyvill, and G. Wyvill,Introduction to Implicit Surfaces, Morgan Kaufmann Publishers, San Francisco, CA, 1997.

[8] J.C. Carr, R.K. Beatson, J.B. Cherrie, T.J. Mitchell, W.R. Fright, B.C. McCallum, and T.R.Evans, “Reconstruction and representation of 3D objects with radial basis functions,” ACMSIGGRAPH 2001, Los Angeles, CA, 2001, 67-76.

[9] H. Cheng, L. Greengard and V. Rokhlin, “A fast adaptive multipole algorithm in three dimen-sions,” J. Comput. Phys. vol. 155, pp. 468—498, 1999.

[10] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms, MITPress, 2004.

[11] J. B. Cherrie, R. K. Beatson, and G. N. Newsam, “Fast evaluation of radial basis functions:Methods for generalized multiquadrics in Rn”, SIAM J. Sci. Comput. , vol. 23, no. 5, pp.1549-1571, 2002.

[12] W. D. Elliott and J. A. Board. Fast Fourier transform accelerated fast multipole algorithm.SIAM Journal on Scientific Computing, 17(2):398—415, 1996.

[13] A. C. Faul, G. Goodsell, M. J. D. Powell, “A Krylov subspace algorithm for multiquadric

23

interpolation in many dimensions,” IMA Journal of Numerical Analysis (2005) 25, 1—24.[14] L. Greengard and V. Rokhlin, A fast algorithm for particle simulations, J. Comput. Phys., 73,

325-348, 1987.[15] L. Greengard, The Rapid Evaluation of Potential Fields in Particle Systems, MIT Press, Cam-

bridge, MA, 1988.[16] L. Greengard and V. Rokhlin, On the efficient implementation of the fast multipole algorithm,

Tech. Rep. RR-602, Department of Computer Science, Yale University, 1988.[17] L. Greengard and J. Strain. The fast Gauss transform. SIAM J. Sci. Statist. Comput., 12, 79—

94, 1991.[18] G. Goodsell, “On finding p-th nearest neighbours of scattered points in two dimensions for

small p,” Computer Aided Geometric Design, vol. 17, pp. 387-392, 2000.[19] N.A. Gumerov and R. Duraiswami, “Fast multipole method for the biharmonic equation in

three dimensions,” Journal of Computational Physics, Volume 215, Issue 1, 10 June 2006,Pages 363-383

[20] N.A. Gumerov and R. Duraiswami, “Comparison of the efficiency of translation operators usedin the fast multipole method for the 3D Laplace equation,” University of Maryland De-partment of Computer Science Technical Report, CS-TR-4701; UMIACS Technical Report,UMIACS-TR—2005-09, 2005.

[21] N.A. Gumerov and R. Duraiswami, Fast Multipole Methods for the Helmholtz Equation inThree Dimensions, Elsevier, Oxford, UK, 2005.

[22] N.A. Gumerov, Ramani Duraiswami, and Eugene A. Borovikov, “Data structures, optimalchoice of parameters, and complexity results for generalized multilevel fast multipole meth-ods in d dimensions,” University of Maryland Department of Computer Science TechnicalReport, CS-TR-4458, 2003.

[23] R. L. Hardy, “Theory and applications of the multiquadric-biharmonic method,” Comput.Math. Appl., 19, 163—208, 1990.

[24] C. A. Micchelli, “Interpolation of scattered data: distance matrices and conditionally positivedefinite functions.” Constr. Approx., 2, 11—22, 1986.

[25] S. Maneewongvatana and D. M. Mount, “Analysis of Approximate Nearest Neighbor Searchingwith Clustered Point Sets, ” in Data Structures, Near Neighbor Searches, and Methodol-ogy: Fifth and Sixth DIMACS Implementation Challenges, eds. M. H. Goldwasser, D. S.Johnson, C. C. McGeoch, in the DIMACS Series in Discr. Math. and Theoret. Comp. Sci.,Vol. 59, AMS, 105-123, 2002.

[26] V.C. Raykar, C. Yang, R. Duraiswami, and N.A. Gumerov. “Fast computation of sums ofGaussians in high dimensions,” Technical Report CS-TR-4767, Department of ComputerScience, University of Maryland, College Park, 2005.

[27] H. Samet, The Design and Analysis of Spatial Data Structures, Addison—Wesley, New York,1989.

[28] V. Simoncini and D.B. Szyld, “Theory of inexact Krylov subspace methods and applicationsto scientific computing,” SIAM J. Sci. Comput., Vol. 25, pp. 454—477, (2003)

[29] G. Turk and J.F. O’Brien, Modelling with implicit surfaces that interpolate, ACM Trans. onGraphics, 21(4), 2002, 855-873.

[30] E.W. Weisstein. “Kissing Number.” From MathWorld—A Wolfram Web Resource.http://mathworld.wolfram.com/KissingNumber.html

24

FAST RADIAL BASIS FUNCTION INTERPOLATION VIA ...ramani/pubs/Gumerov_Duraiswami_SISC...Key...

Documents

Transcript of FAST RADIAL BASIS FUNCTION INTERPOLATION VIA ...ramani/pubs/Gumerov_Duraiswami_SISC...Key...