UNC Gillings School of Global Public...

64
Intrinsic Regression Models for Medial Representation of Subcortical Structures * Xiaoyan Shi, Hongtu Zhu, Joseph G. Ibrahim, Faming Liang, Jeffrey Lieberman, and Martin Styner * Address for correspondence and reprints: Hongtu Zhu, Ph.D., [email protected]. Department of Bio- statistics, Gillings School of Global Public Health, 3109 McGavran-Greenberg Hall, Campus Box 7420, Uni- versity of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7420. H. Zhu is Professor of Biostatis- tics (E-mail: [email protected]), J. G. Ibrahim is Alumni Distinguished Professor of Biostatistics (E-mail: [email protected]), and X. Shi was Ph.d student (E-mail: [email protected]), Department of Biostatistics and Biomedical Research Imaging Center, University of North Carolina at Chapel Hill, NC 27599-7420. F. Liang is Professor of Statistics (E-mail: [email protected]), Department of Statistics, Texas A & M University, College Station, TX 77843-3143. Jeffrey Lieberman is Lawrence C. Kolb Professor of Psychiatry (E-mail: jlieber- [email protected]), Department of Psychiatry, Columbia University Medical Center, 1051 Riverside Drive, New York, New York 10032, U.S.A. M. Styner is Assistant Professor (E-mail: [email protected]), Department of Computer Science and Psychiatry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599. This work was supported in part by NIH grants UL1-RR025747-01, R21AG033387, P01CA142538-01, MH086633, GM 70335, and CA 74015 to Drs. Zhu and Ibrahim, DMS-1007457 and DMS-1106494 to Dr. Liang, and Lilly Research Laboratories, the UNC NDRC HD 03110, Eli Lilly grant F1D-MC-X252, and NIH Roadmap Grant U54 EB005149-01, NAMIC to Dr. Styner. We thank the Editor, an associated editor, and two references for help suggestions, which have improved the present form of this article. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF or the NIH.

Transcript of UNC Gillings School of Global Public...

  • Intrinsic Regression Models for Medial Representation of

    Subcortical Structures ∗

    Xiaoyan Shi, Hongtu Zhu, Joseph G. Ibrahim,

    Faming Liang, Jeffrey Lieberman, and Martin Styner

    ∗Address for correspondence and reprints: Hongtu Zhu, Ph.D., [email protected]. Department of Bio-

    statistics, Gillings School of Global Public Health, 3109 McGavran-Greenberg Hall, Campus Box 7420, Uni-

    versity of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7420. H. Zhu is Professor of Biostatis-

    tics (E-mail: [email protected]), J. G. Ibrahim is Alumni Distinguished Professor of Biostatistics (E-mail:

    [email protected]), and X. Shi was Ph.d student (E-mail: [email protected]), Department of Biostatistics

    and Biomedical Research Imaging Center, University of North Carolina at Chapel Hill, NC 27599-7420. F. Liang

    is Professor of Statistics (E-mail: [email protected]), Department of Statistics, Texas A & M University,

    College Station, TX 77843-3143. Jeffrey Lieberman is Lawrence C. Kolb Professor of Psychiatry (E-mail: jlieber-

    [email protected]), Department of Psychiatry, Columbia University Medical Center, 1051 Riverside

    Drive, New York, New York 10032, U.S.A. M. Styner is Assistant Professor (E-mail: [email protected]),

    Department of Computer Science and Psychiatry, University of North Carolina at Chapel Hill, Chapel Hill, NC

    27599. This work was supported in part by NIH grants UL1-RR025747-01, R21AG033387, P01CA142538-01,

    MH086633, GM 70335, and CA 74015 to Drs. Zhu and Ibrahim, DMS-1007457 and DMS-1106494 to Dr. Liang,

    and Lilly Research Laboratories, the UNC NDRC HD 03110, Eli Lilly grant F1D-MC-X252, and NIH Roadmap

    Grant U54 EB005149-01, NAMIC to Dr. Styner. We thank the Editor, an associated editor, and two references

    for help suggestions, which have improved the present form of this article. The content is solely the responsibility

    of the authors and does not necessarily represent the official views of the NSF or the NIH.

  • Abstract

    The aim of this paper is to develop a semiparametric model for describing the variability

    of the medial representation of subcortical structures, which belongs to a Riemannian man-

    ifold, and establishing its association with covariates of interest, such as diagnostic status,

    age and gender. We develop a two-stage estimation procedure to calculate the parameter

    estimates. The first stage is to calculate an intrinsic least squares estimator of the parameter

    vector using the annealing evolutionary stochastic approximation Monte Carlo algorithm

    and then the second stage is to construct a set of estimating equations to obtain a more ef-

    ficient estimate with the intrinsic least squares estimate as the starting point. We use Wald

    statistics to test linear hypotheses of unknown parameters and establish their limiting dis-

    tributions. Simulation studies are used to evaluate the accuracy of our parameter estimates

    and the finite sample performance of the Wald statistics. We apply our methods to the

    detection of the difference in the morphological changes of the left and right hippocampi

    between schizophrenia patients and healthy controls using medial shape description.

    Keywords: Intrinsic least squares estimator; Medial representation; Semiparametric model; Wald

    statistic.

    2

  • 1 Introduction

    The medial representation of subcortical structures provides a useful framework for describing

    shape variability in local thickness, bending, and widening for subcortical structures (Fletcher

    et al., 2004). In the medial representation framework, a geometric object is represented as a set

    of connected continuous medial primitives, called medial atoms. See Figure 1 for a hippocampus

    example. For 3-dimensional objects, these medial atoms are formed by the centers of the inscribed

    spheres and by the associated spokes from the sphere centers to the two respective tangent points

    on the object boundary. Specifically, a medial atom m = (OT , r, sT0 , sT1 )

    T is formed by a position

    O, the center of the inscribed sphere; a radius r, the common spoke length; and (s0, s1), the two

    unit spoke directions (Pizer et al., 2003; Styner et al., 2004). A medial atom can be regarded as

    a point on a Riemannian manifold, M(1) = R3×R+×S2×S2, where S2 is the sphere in R3 with

    radius one. A medial representation model consisting of K medial atoms can be described as the

    direct product of K copies of M(1), i.e., M(1)K =∏K

    i=1M(1). The existing statistical analytical

    methods for the medial representation include principal geodesic analysis, the estimation of

    extrinsic and intrinsic means, and a permutation test for comparing medial representation data

    from two groups (Fletcher et al., 2004). The scientific interests of some neuroimaging studies,

    however, typically focus on establishing the association between subcortical structure and a set of

    covariates, particularly diagnostic status, age, and gender, thus requiring a regression modeling

    framework for medial representation.

    There are several challenging issues including multiple directions on S2 and the complex

    correlation structure among different components of M(1) in developing medial representation

    regression models with a set of covariates. Although there is a sparse literature on regression

    modeling of a single directional response and a set of covariates of interest (Mardia and Jupp,

    1983; Jupp and Mardia 1989), these regression models of directional data are based on particular

    parametric distributions, such as the von Mises-Fisher distribution (Mardia, 1975; Mardia and

    1

  • Jupp, 1983; Presnell et al., 1998). For instance, existing circular regression models assume that

    the angular response follows the von Mises-Fisher distribution with either the angular mean ηi or

    the concentration parameter κi being associated with the covariates xi (Gould, 1969; Johnson and

    Wehrly, 1978; Fisher and Lee, 1992). However, it remains unknown whether it is appropriate

    to directly apply these parametric models for a single directional measure to simultaneously

    characterize the two spoke directions at each atom, which are correlated. Moreover, the two

    spoke directions may be correlated with other components of each atom and this provides further

    challenges in developing a parametric model to simultaneously model all components of each atom

    of the medial representation.

    Figure 1: (a) A medial representation model m = (OT , r, sT0 , sT1 )

    T at an atom, where O is the

    center of the inscribed sphere, r is the common spoke length, and (s0, s1) are the two unit spoke

    directions; (b) a skeleton of a hippocampus with 24 medial atoms; (c) the smoothed surface of

    the hippocampus.

    The rest of this paper is organized as follows. In Section 2, we formulate the semiparametric

    regression model and introduce the two-stage estimation procedure for estimating the regression

    coefficients. Then, we establish asymptotic properties of our estimates and then develop Wald

    statistics to carry out hypothesis testing. Simulation studies in Section 3 are used to assess

    the finite sample performance of the parameter estimates and Wald test statistics. In Section

    2

  • 4, we illustrate the application of our statistical methods to the detection of the difference in

    morphological changes of the hippocampi between schizophrenia patients and healthy controls

    in a neuroimaging study of schizophrenia.

    2 Theory

    2.1 Inverse Link functions

    Suppose we have an exogenous q × 1 covariate vector xi and a medial representation for a

    particular sub-cortical structure, denoted by Mi = {mi(d) : d ∈ D}, for the i−th subject, where

    d represents an atom of the medial representation. For notational simplicity, we temporarily

    drop atom d from our notation. We formally introduce a semiparametric regression model for

    medial representation responses and covariates of interest from n subjects. The regression model

    involves modeling a conditional mean of a medial representation response mi at an atom given

    xi, denoted by µi(β) = µ(xi,β), where β is a p× 1 vector of regression coefficients in B ⊂ Rp.

    Thus, µ(·, ·) is a map from Rq × Rp to M(1) and µi(β) = (µoi(β)T , µri(β),µ0i(β)T ,µ1i(β)T )T ,

    which is a 10×1 vector and µoi(β), µri(β), µ0i(β), and µ1i(β) are the ‘conditional means’ of the

    location Oi, the radius ri, and the two spoke directions s0i and s1i respectively, given xi, for the

    i-th subject. Note that for spoke directions, we borrow the term conditional mean for random

    variables in Euclidean space.

    We need to formalize the notion of conditional mean explicitly. For the location component

    of a medial representation, we may set µoi(β) = (g1(xi,β1), g2(xi,β2), g3(xi,β3))T , where gk(·, ·)

    is a known inverse link function and βk is a pk × 1 coefficient vector for k = 1, 2, 3. There are

    many different ways of specifying gk(xi,βk). The simplest one is the linear inverse link function

    gk(xi,βk) = xTi βk. We may also represent gk(xi,βk) as a linear combination of basis functions

    {ψj(xi) : j = 1, . . . , J}, such as B-splines, that is gk(xi,βk) =∑J

    j=1 ψj(xi)βkj, in which βkj is

    the j-th component of βk. In this way, we can approximate a nonlinear function of xi using the

    3

  • linear combination of basis functions. For the radius component, we may use µri(β) = g4(xi,β4),

    where β4 is a p4×1 coefficient vector for a medial representation radius. Since a radius is always

    positive, a natural inverse link function is g4(xi,β4) = exp(xTi β4), among other possible choices.

    As the two spoke directions at each atom of a medial representation are spherical responses, we

    develop a link function µ0i(β) ∈ S2 for the first spoke direction at a specific atom for notational

    simplicity. Let xi,d be a qd × 1 vector of all the discrete covariates, xi,c are a qc × 1 vector

    of all the continuous covariates and their potential interactions with xi,d, β5d and β5c are the

    regression parameters corresponding to xi,d and xi,c, respectively, and β5 contains all unknown

    parameters in β5d and β5c. From now on, all covariates have been centered to have mean zero.

    We assume that all first spoke directions associated with the same discrete covariate vector xi,d

    are concentrated around a center on the sphere given by

    g5(xi,d,β5d) = (sin(θ(xi,d)) cos(φ(xi,d)), sin(θ(xi,d)) sin(φ(xi,d)), cos(θ(xi,d)))T , (1)

    where θ(xi,d) and φ(xi,d) are, respectively, the colatitude and the longitude, and β5d includes all

    unknown parameters θ(xi,d) and φ(xi,d) for different xi,d.

    We then describe the stereographic projection of projecting µ0i(β) on the plane with base

    point g5(xi,d,β5d), denoted by Tst;g5(xi,d,β5d)(µ0i(β)) (Downs, 2003). A graphic illustration of the

    stereographic projection T−1st;(0,0,1)(u, v,−1) is given in Figure 2 (a). The stereographic projec-

    tion Tst;g5(xi,d,β5d)(µ0i(β)) is defined as the point of intersection for the plane passing through

    g5(xi,d,β5d) with the normal vector g5(xi,d,β5d), which is given by g5(xi,d,β5d)T{(u, v, w)T −

    g5(xi,d,β5d)} = 0 for (u, v, w) ∈ R3, and the line passing through −g5(xi,d,β5d) and µ0i(β):

    µ0i(β)− t{g5(xi,d,β5d) +µ0i(β)} for t ∈ (−∞,∞). With some calculation, it can be shown that

    Tst;g5(xi,d,β5d)(µ0i(β)) is given by

    Tst;g5(xi,d,β5d)(µ0i(β)) =2µ0i(β)

    1 + µ0i(β)Tg5(xi,d,β5d)

    − g5(xi,d,β5d){µ0i(β)Tg5(xi,d,β5d)− 1}

    1 + µ0i(β)Tg5(xi,d,β5d)

    .

    Let R be a rotation matrix in SO(3) such that RT = R−1 and det(R) = 1, where det(R) denotes

    the determinant of R and SO(3) is the set of 3× 3 rotation matrices. By applying the rotation

    4

  • matrix R to both g5(xi,d,β5d) and µ0i(β), we have

    Tst;Rg5(xi,d,β5d)(Rµ0i(β)) = RTst;g5(xi,d,β5d)(µ0i(β)). (2)

    We consider a specific rotation matrix for rotating s1 = (s1,u, s1,v, s1,w)T ∈ S2 to s2 =

    (s2,u, s2,v, s2,w)T ∈ S2, denoted by Rs1,s2 , such that Rs1,s2s1 = s2. We need to calculate η =

    arccos(sT1 s2) = arccos(s1,us2,u + s1,vs2,v + s1,ws2,w) and s3 = s1× s2/ ‖s1 × s2‖ = (s3,u, s3,v, s3,w)T ,

    where s1× s2 = (s1,vs2,w − s1,ws2,v, s1,ws2,u− s1,us2,w, s1,us2,v − s1,vs2,u)T and ‖·‖ is the Euclidean

    norm of a vector. Then, Rs1,s2 is given bys23,ucη + cos(η), s3,us3,vcη − s3,w sin(η), s3,us3,wcη + s3,v sin(η)

    s3,us3,vcη + s3,w sin(η), s23,vcη + cos(η), s3,vs3,wcη − s3,u sin(η)

    s3,us3,wcη − s3,v sin(η), s3,vs3,wcη + s3,u sin(η), s23,wcη + cos(η)

    , (3)where cη = 1− cos(η).

    The inverse link function µ0i(β) is explicitly given as follows. By letting R = Rg5(xi,d,β5d),(0,0,−1)T

    in (2), in which (0, 0,−1)T is the south pole of S2, we have

    Tst;(0,0,−1)T (Rg5(xi,d,β5d),(0,0,−1)Tµ0i(β)) = Rg5(xi,d,β5d),(0,0,−1)TTst;g5(xi,d,β5d)(µ0i(β)). (4)

    We assume that

    Tst;(0,0,−1)T (Rg5(xi,d,β5d),(0,0,−1)Tµ0i(β)) = (xTicβ5c,−1)T , (5)

    where β5c is a qc × 2 matrix. Let T−1st;(0,0,−1)T be the inverse map of the stereographic projection

    mapping from the plane with base point (0, 0,−1) back to S2 such that

    T−1st;(0,0,−1)T ((u, v,−1)) =

    (4u

    u2 + v2 + 4,

    4v

    u2 + v2 + 4,u2 + v2 − 4u2 + v2 + 4

    ).

    Please see Fig. 2 (a) for details. Note that Rg5(xi,d,β5d),(0,0,−1)T ∈ SO(3), the inverse link function

    µ0i(β) is given by

    µ0i(β) = R(0,0,−1)T ,g5(xi,d,β5d)T−1st;(0,0,−1)T ((x

    Ti,cβ5c,−1)T ). (6)

    5

  • When β5c = 0 indicating no continuous covariate effect, µ0i(β) reduces to g5(xi,d,β5d). Sim-

    ilarly, for the second spoke direction, we introduce β6d and β6c as the regression parameters

    corresponding to xi,d and xi,c, respectively, and then we define g6(xi,d,β6d) and µ1i(β), respec-

    tively, as the center associated with the same discrete covariate vector xi,d and the inverse link

    function by following (1) and (6). We have discussed various inverse link functions for µ(xi,β),

    but these link functions can be misspecified for a given data set. To avoid such misspecification,

    we may estimate these inverse link functions nonparametrically. It is a topic for future research.

    2.2 Intrinsic regression model

    Now, we introduce a definition of a residual to ensure that µi(β) is the proper conditional

    mean of mi given xi. For instance, in a classical linear model, the response is the sum of

    the regression function and the residual, and the conditional mean of the response equals the

    regression function. Given two points mi and µi(β) on the manifold, we need to define the

    residual or difference between them. At µi(β), we have the tangent space of M(1), denoted by

    Tµi(β)M(1), which is a Euclidean space representing a first order approximation of the manifold

    M(1) near µi(β). We calculate the projection of mi onto Tµi(β)M(1), denoted by Lµi(β)

    (mi),

    as follows:

    Lµi(β)(mi) = (Oi − µoi(β), log(ri/µri(β)),Lµ0i(β)(s0i)

    T ,Lµ1i(β)(s1i)

    T )T , (7)

    where Lµki(β)(ski) = arccos(µki(β)

    T ski)s̃ki/||s̃ki||, in which s̃ki = ski − {µki(β)T ski}µki(β) for

    k = 0, 1. Thus, Lµi(β)(mi) can be regarded as the residual or difference between mi and µi(β)

    in Tµi(β)M(1). Geometrically, Lµi(β)

    (mi) is associated with the Riemannian Exponential and

    Logarithm maps on M(1).

    We introduce the Riemannian Exponential and Logarithm maps on M(1). Let the tangent

    vector θ = (θo, θr,θs0 ,θs1)T ∈ TmM(1), where θo ∈ R3 is the location tangent component,

    θr ∈ R is the radius tangent component, and θs0 and θs1 ∈ R3 are the two directional tangent

    6

  • components. Let γm(t;θ) be the geodesic on M(1) passing through γm(0;θ) = m ∈ M(1) in

    the direction of the tangent vector θ ∈ TmM(1). The Riemannian Exponential map, denoted

    by Expm(·), maps the tangent vector θ at m to a point m1 ∈ M(1) and Expm(θ) = γm(1;θ).

    The Riemannian Logarithm map, denoted by Lm(m1), maps m1 ∈M(1) onto the tangent vector

    θ = Lm(m1) ∈ TmM(1). The Riemannian Exponential map and Logarithm map are inverses of

    each other, that is Expm(Lm(m1)) = m1.

    Because a medial representation is the product space of several spaces, the Riemannian

    Exponential/Logarithm map for M(1) is the product of the Riemannian Exponential/Logarithm

    maps for each space. Let m = (OT , r, sT0 , sT1 )

    T and m1 = (OT1 , r1, s

    T0,1, s

    T1,1)

    T be two points in

    M(1) and θ ∈ TmM(1). We give the explicit form of the Exponential and Logarithm maps for

    each space of interest. For the space of locations, Expo(θo) = O + θo, and Lo(O1) = O1 −O.

    For the space of radiuses, Expr(θr) = r exp(θr) and Lr(r1) = log(r1/r). For the space S2,

    Exps0(θs0) = cos(‖θs0‖2)s0 + sin(‖θs0‖2)θs0/ ‖θs0‖2 . Let s̃0,1 = s0,1 − (sT0 s0,1)s0 6= 0. If s0 and

    s0,1 are not antipodal (s0 6= −s0,1), we can get Ls0(s0,1) = arccos(sT0 s0,1)s̃0,1/ ‖s̃0,1‖2 . Thus, for

    the space M(1), the Riemannian Exponential and Logarithm maps are, respectively, given by

    Expm(θ) = (OT + θTo , r exp(θr),Exps0(θs0)

    T ,Exps1(θs1)T )T , (8)

    Lm(m1) = (OT1 −OT , log(r1/r),Ls0(s0,1)T ,Ls1(s1,1)T )T . (9)

    Although the Lµi(β)(mi) ∈ Tµi(β)M(1) are in different tangent spaces, we can use parallel

    transport to translate them to the same tangent space at an overall base point, denoted by

    B(β). We choose B(β) = (0, 0, 0, 1,g5(β5d)T ,g6(β6d)

    T )T , where g5(β5d) and g6(β6d) are the

    mean directions of g5(xi,d,β5d) and g6(xi,d,β6d) for all possible xi,d, respectively. We use parallel

    transport formulated by a rotation matrix,

    R(µi(β)⇒ B(β)) = diag{I3, 1,Rµ0i(β),g5(β5d),Rµ1i(β),g6(β6d)}, (10)

    to translate Lµi(β)(mi) ∈ Tµi(β)M(1) into {R(µi(β)⇒ B(β))}Lµi(β)(mi) ∈ TB(β)M(1). An

    illustration of the parallel transport is given in Figure 2 (b). Finally, we define the rotated

    7

  • residual of mi with respect to µi(β) as

    Ei(β) = {R(µi(β)⇒ B(β))}Lµi(β)(mi) for i = 1, . . . , n. (11)

    The Ei(β) are uniquely defined in the same tangent space TB(β)M(1), which is a Euclidean space.

    The intrinsic regression model for medial representations M(1) at an atom is then defined by

    E{Ei(β) | xi} = 0, E[{R(µi(β)⇒ B(β))}Lµi(β)(mi) | xi] = 0 (12)

    for i = 1, . . . , n, where the expectation is taken with respect to the conditional distribution of

    Ei(β) given xi (Le, 2001). In model (12), the nonparametric component is the distribution of mi

    given xi, which is left unspecified, while the parametric component is the mean function µi(β),

    which is assumed to be known. Moreover, our model (12) does not assume a homogeneous

    variance across all atoms and subjects. This is also desirable for real applications, because

    between-subject and between-atom variabilities can be substantial.

    At atom d, let Ei(β, d) be {R(µi(β, d) ⇒ B(β, d))}Lµi(β,d)(mi(d)), where µi(β, d) is the

    conditional mean of mi(d) given xi. Model (12) leads to an intrinsic regression model for M(1)K

    given by

    E{Ei(β, d) | xi} = 0 (13)

    for all d ∈ D and i = 1, . . . , n. As a comparison, consider a multivariate regression model

    Yi = Xiβ + �i and E(�i | xi) = E(Yi −Xiβ | xi) = 0, where Yi is a py × 1 vector and Xi is a

    py× p design matrix depending on xi. It is clear that Ei(β, d) is closely related to �i = Yi−Xiβ

    in the multivariate regression model and thus the intrinsic regression model (13) for M(1)K can

    be regarded as a generalization of a standard multivariate regression.

    The key advantage of translating tangent vectors on different tangent spaces to the same

    tangent space is that we can directly apply most multivariate analysis techniques in Euclidean

    space to the analysis of Ei(β) (Anderson, 2003). By using parallel transport to obtain Ei(β),

    we can explicitly account for correlation structure among Ei(β) and then construct a set of

    8

  • estimation equations to calculate a more efficient parameter estimate. Please refer to the next

    section for details.

    2.3 Two-stage estimation procedure

    We propose a two-stage estimation procedure for computing parameter estimates for the semi-

    parametric medial representation regression model (12) as follows.

    Stage 1 is to calculate an intrinsic least squares estimate of the parameter β, denoted by β̂I ,

    by minimizing the square of the geodesic distance,

    β̂I = argminβDn(β) = argminβ

    n∑i=1

    Dn,i(β) = argminβ

    n∑i=1

    dist{mi,µi(β)}2, (14)

    where Dn,i(β) = dist{mi,µi(β)}2 and dist{mi,µi(β)} is the shortest distance between mi and

    µi(β) on M(1). Since Dn(β) can be written as the sum of four terms: D(1)n (β) =

    ∑ni=1{Oi −

    µoi(β)}T{Oi−µoi(β)}, D(2)n (β) =

    ∑ni=1[log(ri)−log{µri(β)}]2, D

    (3)n (β) =

    ∑ni=1[arccos{sT0iµ0i(β)}]2

    andD(4)n (β) =

    ∑ni=1[arccos{sT1iµ1i(β)}]2, we can minimizeD

    (k)n (β) for k = 1, 2, 3, 4 independently

    when they do not share any common parameters.

    Computationally, we develop an annealing evolutionary stochastic approximation Monte

    Carlo algorithm (Liang, 2011) for obtaining β̂I , whose details can be found in the supplementary

    report. Moreover, according to our experience, the traditional optimization methods including

    the quasi-Newton method do not perform well for optimizing Dn(β) and strongly depend on the

    starting value of β. When µi(β) takes a relatively complicated form, Dn(β) is generally not

    concave and can have multiple local modes. For instance, since µ1i(β) is a nonlinear function

    of β and D(4)n (β) may not be a concave function of β over B, our prior experiences have shown

    that the quasi-Newton method for optimizing D(4)n (β) can easily converge to local minima.

    The estimate β̂I is closely associated with the intrinsic mean (Bhattacharya and Patrange-

    naru, 2005) and does not involve the concept of parallel transport. If we replace |arccos(s)|2 by

    1 − s in D(3)n (β) and D(4)n (β), then our fitting procedure in Stage 1 is effectively a maximum

    9

  • likelihood estimation for a model with the Fisher-distributed errors on the sphere and thus β̂I is

    an extrinsic estimate. It will be shown in Theorem 1 below that β̂I is a consistent estimate, but

    β̂I is not efficient, since it does not account for the correlation among the different components

    of medial representations.

    Stage 2 is to calculate a more efficient estimator of β, denoted by β̂E, which is a solution of

    n∑i=1

    ĥE(xi)V̂−1Ei(β) = 0, (15)

    where ĥE(xi) = ∂βµi(β̂I){R(µi(β̂I)⇒ B(β̂I))}−1 = ∂βµi(β̂I){R(B(β̂I)⇒ µi(β̂I))}, V(β) =∑n

    i=1 Ei(β)Ei(β)T/n, and V̂ = V(β̂I).

    The equation (15) in Stage 2 is invariant to the rotation matrix R(B(β) ⇒ P0), where

    P0 = (0, 0, 0, 1, 0, 0, 1, 0, 0, 1)T representing the center at the origin (0, 0, 0)T , the unit radius

    r = 1, and the two spoke directions pointing towards the north pole (0, 0, 1)T . Specifically, we

    can use the rotation matrix R(B(β) ⇒ P0) to rotate Ei(β) to {R(B(β) ⇒ P0)}Ei(β) for all i.

    Correspondingly, ĥE(xi) and V−1 are, respectively, changed to ĥE(xi){R(B(β) ⇒ P0)}T and

    {R(B(β)⇒ P0)}V−1{R(B(β)⇒ P0)}T . Thus, after applying the rotation R(B(β)⇒ P0), we

    can show that ĥE(xi)V−1Ei(β) equals

    ĥE(xi){R(B(β)⇒ P0)}T{R(B(β)⇒ P0)}V−1{R(B(β)⇒ P0)}T{R(B(β)⇒ P0)}Ei(β),

    which is independent of R(B(β)⇒ P0).

    Model (12) is a conditional mean model (Chamberlain, 1987; Newey, 1993). The conditional

    mean model implies that E{h(xi)Ei(β)} = E[h(xi)E{Ei(β) | xi}] = 0 for any vector function

    h(·), which may depend on β. After some algebraic calculations, it can be shown that calculating

    β̂I is equivalent to solving ∂βDn(β) = −2∑n

    i=1 ∂βµi(β)R(B(β) ⇒ µi(β))Ei(β) = 0, that is,

    hI(xi) = ∂βµi(β)R(B(β) ⇒ µi(β)). However, it has been shown (Chamberlain, 1987; Newey,

    1993) that the optimal function has the form hopt(xi,β) = E{∂βEi(β) | xi}var{Ei(β) | xi}−1,

    which achieves the semiparametric efficiency bound for β. Therefore, hI(xi) is not an optimal

    function and thus the intrinsic least squares estimate in Stage 1 is not an efficient estimator.

    10

  • Since E{∂βEi(β) | xi} and var{Ei(β) | xi} for each β do not have a simple form, we must esti-

    mate them nonparametrically, which leads to a nonparametric estimate of hopt(x,β), denoted by

    ĥopt(x,β). Although we may solve the estimating equations Fn(β) =∑n

    i=1 ĥopt(xi,β)Ei(β) = 0

    to calculate the efficient estimator of β, it can be computationally challenging to solve Fn(β)

    since nonparametrically, estimating the 8× p matrix E{∂βEi(β) | xi} and the 8× 8 inverse ma-

    trix of var{Ei(β) | xi} can be very unstable for a relatively small sample size. Thus, we replace

    var{Ei(β) | xi} by var{Ei(β)} and approximate E{∂βEi(β) | xi} by ∂βµi(β)R(B(β)⇒ µi(β)).

    Moreover, in order to avoid calculating ∂βµi(β)R(B(β) ⇒ µi(β)) and var{Ei(β)} during

    each numerical iteration, we calculate them at β̂I and then construct the objective function∑ni=1 ĥE(xi)V̂

    −1Ei(β) = 0 for calculating β̂E. The two-stage estimation procedure leads to sub-

    stantial computational efficiency, since solving the complex estimating equations (15) is relatively

    easy starting from β̂I . An alternative way is to directly minimize {∑n

    i=1 ∂βµi(β)R(B(β) ⇒

    µi(β))V(β)−1Ei(β)}2, which is much more complex than Dn(β) and thus is computationally

    difficult.

    As a comparison between β̂E and β̂I , we consider a multivariate nonlinear regression model

    Yi = F(xi,β) + �i with E(�i | xi) = E{Yi − F(xi,β) | xi} = 0 and var(�i | xi) = Σ, where

    F(xi,β) is a vector of nonlinear functions of xi and β. In this case, Ei(β) = �i = Yi −F(xi,β),

    β̂I = argminβ∑n

    i=1{Yi − F(xi,β)}T{Yi − F(xi,β)}, and ĥE(xi) = ∂βF(xi, β̂I). Then, Σ can

    be estimated by using V̂ =∑n

    i=1{Yi−F(xi, β̂I)}{Yi−F(xi, β̂I)}T/n. Equation (15) reduces to∑ni=1 ĥE(xi)V̂

    −1{Yi−F(xi,β)} = 0, whose solution is just β̂E. Under mild conditions, it can be

    shown that compared with β̂I , β̂E is a more efficient estimator of β and its asymptotic covariance

    is given by {∑n

    i=1 ĥE(xi)V̂−1ĥE(xi)

    T}−1. In the context of highly concentrated spoke data, our

    intrinsic regression model reduces to the multivariate nonlinear regression model and similar

    to the multivariate nonlinear regression model, the two-stage approach can increase statistical

    efficiency in estimating β.

    11

  • 2.4 Asymptotic properties

    We establish consistency and asymptotic normality of β̂I and β̂E. The following assumptions are

    needed to facilitate the technical details, although they are not the weakest possible conditions.

    Assumption A1. The data {zi = (xi,mi) : i = 1, · · · , n} form an independent and identical

    sequence.

    Assumption A2. β∗ is an interior point of the compact set B ⊂ Rp and is the unique solution

    for the model, E {hE(x)E(β)} = 0 , where hE(x) = ∂βµi(β∗){R(B(β∗) ⇒ µi(β∗))}V(β∗)−1.

    Moreover, β∗ is an isolated point of the set of all minimizers of the mapD(β) = E[dist{m,µ(x,β)}2]

    on B, denoted by IB.

    Assumption A3. In an open neighborhood of β∗, µ(x,β) has a second-order continuous

    derivative with respect to β and ||Lµ(β)(m)||, ||∂µLµ(β)(m)||, ||∂βµ(x,β)|| and ||∂

    2

    βµ(x,β)||

    are bounded by some integrable function G(z) with E{G(z)2}

  • as n→∞, where Ip is a p× p identity matrix and → denotes convergence in distribution.

    (c) Under assumptions A1-A4, we have

    [n∑i=1

    {ĥE(xi)V̂−1Ei(β̂E)}⊗2]−1/2{n∑i=1

    ĥE(xi)V̂−1∂βEi(β̂E)

    T}(β̂E − β∗)→ N(0, Ip) (17)

    as n→∞.

    Theorem 1 has several important applications. Theorem 1 (a) establishes the consistency of

    β̂E and β̂I . According to Theorems 1 (b) and (c), we can consistently estimate the covariance

    matrices of β̂E and β̂I . For instance, the covariance matrix of β̂E, denoted by Σ̂E, can be

    approximated by

    {n∑i=1

    ĥE(xi)V̂−1∂βEi(β̂E)

    T}−1[n∑i=1

    {ĥE(xi)V̂−1Ei(β̂E)}⊗2]{n∑i=1

    ĥE(xi)V̂−1∂βEi(β̂E)

    T}−T . (18)

    Moreover, we can use Theorem 1 (c) to construct confidence cones of β̂E and its functions. Since

    Theorem 1 only establishes the asymptotic properties of β̂E when the sample size is large, these

    properties may be inadequate to characterize the finite sample behavior of β̂E for relatively small

    samples. In the case of small samples, we may have to resort to higher order approximations,

    such as saddlepoint approximations and bootstrap methods (Butler, 2007; Davison and Hinkley,

    1997).

    Our choices of which hypotheses to test are motivated by scientific questions, which involve

    a comparison of medial representation components across diagnostic groups. These questions

    usually can be formulated as testing linear hypotheses of β as follows:

    H0 : Aβ = b0 vs. H1 : Aβ 6= b0, (19)

    where A is an r× p matrix of full row rank and b0 is an r× 1 specified vector. We test the null

    hypothesis H0 : Aβ = b0 using a Wald test statistic Wn defined by

    Wn = (Aβ̂E − b0)T

    (AΣ̂EAT )−1(Aβ̂E − b0). (20)

    We are led to the following theorem.

    13

  • Theorem 2. If the assumptions A1-A4 are true, then the statistic Wn is asymptotically distributed

    as χ2(r), a chi-square distribution with r degrees of freedom, under the null hypothesis H0.

    An asymptotically valid test can be obtained by comparing sample values of the test statistic

    with the critical value of a χ2(r) distribution at a pre-specified significance level α. However, for a

    small sample size n, we observed relatively low precision of the chi-square approximation. Instead,

    we calibrate Wn with a critical value of F1−αr,n−rr(n− 1)/(n − r), which leads to a slightly higher

    precision of the F approximation, where F 1−αr,n−r is the upper α-percentile of the Fr,n−r distribution.

    That is, we reject H0 if Wn ≥ F 1−αr,n−rr(n− 1)/(n− r), and do not reject H0 otherwise. The reason

    that the F approximation outperforms the chi-square approximation is due to the fact that the

    F approximation explicitly accounts for sample uncertainty in estimating the covariance matrix

    of Aβ̂E.

    3 Simulation studies and real data

    3.1 Double directional data with covariates

    We generated double directional responses as follows:

    Rµ0i(β),(0,0,−1)TLµ0i(β)

    (s0i) = E0i, Rµ1i(β),(0,0,−1)T Lµ1i(β)(s1i) = E1i,

    where µ0i(β) and µ1i(β) were set according to (6), in which xi,d’s were fixed at 1 and xi,c’s were

    independently simulated from a N(0, 1) distribution. It is assumed that both µ0i(β) and µ1i(β)

    were, respectively, centered around g5(xi,d,β5d) = (u0, v0, w0)T and g6(xi,d,β6d) = (u1, v1, w1)

    T

    according to (1) such that

    u01− w0

    = β5d,1 = 1.2,v0

    1− w0= β5d,2 = 1.2,

    u11− w1

    = β6d,1 = 0.8, andv1

    1− w1= β6d,2 = 0.8.

    In addition, we imposed two constraints as follows:

    β5c = (β5c,1, β5c,2)T = β6c = (β6c,1, β6c,2)

    T = (1, 1)T .

    14

  • We generated the errors E0i and E1i in T(0,0,−1)(S2) from a 4-dimensional normal distribution,

    N(0, 0.5Σ) with Σ being specified as

    Σ =

    Σ0 Σ01Σ01 Σ1

    ,Σ0 = Σ1 = 1 ρ1

    ρ1 1

    , Σ01 = ρ2 1 ρ1

    ρ1 1

    .Subsequently, we rotated E0i onto the tangent space Tµ0i(β)(S

    2) and E1i onto the tangent space

    Tµ1i(β)(S2), and then we used the Exp map defined in the supplementary report to obtain the

    responses s0i and s1i. We set n = 40, 80, and 120, ρ1 = ρ2 = 0.5, and then we simulated

    2000 datasets for each case to compare the biases and the root-mean-square error of the two

    estimates: β̂I and β̂E. As seen in Table 1, β̂E has smaller root-mean-square error than β̂I for

    every component of β, but some components of β̂E can be more biased.

    We also calculated the mean of the estimated standard error estimates and the relative

    efficiencies for all the components in β̂E and evaluated the finite sample performance of the

    Wald statistic Wn for hypothesis testing. The results are quite similar to those from the single

    directional case in the supplementary file, so we did not present them here to preserve space.

    3.2 Schizophrenia study of the hippocampus

    We consider a neuroimaging dataset about the medial representation shape of the hippocampus

    structure in the left and right brain hemisphere in schizophrenia patients and healthy controls,

    collected at 14 academic medical centers in North America and western Europe. The hippocam-

    pus, a gray matter structure in the limbic system, is involved in processes of motivation and

    emotions, and plays a central role in the formation of memory.

    In this study, 238 first-episode schizophrenia patients (53 female, 185 male; mean/standard

    deviation age, female 25.1/5.69 years; male 23.6/4.55 years) were enrolled who met the fol-

    lowing criteria: age 16 to 40 years; onset of psychiatric symptoms before age 35; diagnosis of

    schizophrenia, schizophreniform, or schizoaffective disorder according to DSM-IV criteria; and

    15

  • Table 1: Bias (×10−3) and MS (×10−2) of β̂I and β̂E for double directional case. Bias denotes the

    bias of the mean of the estimates; MS denotes the root-mean-square error. For each parameter,

    the first row is for β̂I and the second is for β̂E. Moreover, the constraints β5c,1 = β6c,1 and

    β5c,2 = β6c,2 are imposed.

    n = 40 n = 80 n = 120

    Bias MS Bias MS Bias MS

    β5d,1 = 1.2 3.15 13.26 4.35 10.04 4.22 7.75

    3.40 13.10 4.36 9.82 3.98 7.60

    β5c,1 = β6c,1 = 1 9.29 19.19 1.74 12.76 7.43 10.31

    8.93 18.02 0.89 12.09 7.27 9.81

    β5d,2 = 1.2 9.44 13.69 2.05 10.19 0.86 7.80

    9.81 13.29 0.88 9.59 0.43 7.69

    β5c,2 = β6c,2 = 1 6.90 18.55 5.00 13.08 0.64 10.53

    6.74 17.50 5.67 12.44 0.62 9.99

    β6d,1 = 0.8 5.18 16.85 3.23 9.74 2.49 7.93

    5.69 12.91 3.10 9.65 2.69 7.76

    β6d,2 = 0.8 2.34 14.84 1.31 9.78 0.86 8.47

    1.32 13.06 0.98 9.71 0.91 8.07

    16

  • various treatment and substance dependence conditions. 56 healthy control subjects (18 female,

    38 male; mean/standard deviation age, female 24.8/3.30 years; male 25.3/4.21 years) were also

    enrolled. Neurocognitive and magnetic resonance imaging (MRI) assessments were performed at

    the first visit time.

    The brain MRI data were first aligned to the Montreal Neurological Institute (MNI) space.

    Hippocampi were segmented in the MNI space and then their medial representations were recon-

    structed from those binary segmentations (Styner et al., 2004). Subsequently, these hippocampus

    medial representations were realigned by using a rigid body variation of the standard Procrustes

    method. The resulting alignment leads to a shape representation that is invariant to translation

    and rotation, but not to scale. Scaling information is retained for studying changes in overall

    size or volume.

    The aim of our study was to investigate the difference of medial representation shape between

    schizophrenia patients and healthy controls while controlling for other factors, such as gender and

    age. The response of interest was the hippocampus medial representation shape at the 24 medial

    atoms of the left and right brain hemisphere (Figure 1). Covariates of interest were Whole Brain

    Volume (WBV), race including Caucasian, African American and others, age in years, gender,

    and diagnostic status including patient and control.

    The covariate vector is xi = (1, genderi, agei, diagi, race1i, race2i,WBVi)T , where diag is the

    dummy variable for patients versus healthy controls, and race1 and race2 are, respectively,

    dummy variables for Caucasians and African Americans versus other races. For the loca-

    tion component on the medial representation, we set µO(x,β) = (xTβ1,x

    Tβ2,xTβ3)

    T , where

    βk (k = 1, 2, 3) are 7 × 1 coefficient vectors. For the radius component on the medial rep-

    resentation, we set µr(x,β) = exp(xTβ4), where β4 is a 7 × 1 coefficient vector. For the

    directional components on the medial representation, we used µ0(xi,β) as defined in (6), in

    which xi,d = (genderi, diagi, race1i, race2i)T , xi,c = (agei,WBVi)

    T , β5 = (βT5d,β

    T5c)

    T for s0 and

    β6 = (βT6d,β

    T6c)

    T for s1. Therefore, we have the coefficient vector β = (βT1 ,β

    T2 ,β

    T3 ,β

    T4 ,β

    T5 ,β

    T6 )

    T .

    17

  • Then we used the two-stage estimation procedure to obtain estimates of β and conducted hy-

    pothesis testing using Wald statistics. Since the primary goal of the study is to investigate the

    difference of medial representation shape between schizophrenia patients and healthy controls,

    we paid special attention to the terms in β associated with diagnostic status.

    First, we examined the overall diagnostic status effect on the whole medial representation

    structure. The p-values of the diagnostic status effects across the atoms of both the left and

    right reference hippocampi are shown in the first row (a) and (b) of Figure 3. The false discovery

    rate approach (Benjamini and Hochberg, 1995) was used to correct for multiple comparisons, and

    the corresponding adjusted p-values are shown in the first row (c) and (d) of Figure 3. There was

    a large significant area in the left hippocampus and also some in the right hippocampus. The

    significance area remains almost the same after correcting for multiple comparisons, but with an

    attenuated significance level.

    We also examined each component on the medial representation separately. For the radius

    component of the medial representation, we presented the p-values of the diagnostic status effects

    across the atoms in the second row (a) and (b) of Figure 3 and the adjusted p-values in the second

    row (c) and (d). Before correcting for multiple comparisons, we observed a significant diagnostic

    status difference in the medial representation thickness at the central atoms near the posterior

    side in the left hippocampus and in some areas in the right hippocampus, whereas we did not

    observe much of a significant diagnostic status effect after correcting for multiple comparisons.

    For the location component of the medial representation, we showed the p-values of the

    diagnostic status effects in the third row (a) and (b) of Figure 3 and the corresponding adjusted

    p-values in the third row (c) and (d). We observed significant diagnostic status differences mainly

    located around the anterior and lateral side of the left hippocampus though with clearly reduced

    significance after correcting for multiple comparisons. Similar lateral results have also been

    observed by Narr et al. (2004).

    Similarly, for the two spoke directions on the medial representation, the p-values of the di-

    18

  • agnostic status effects are shown in the last row (a) and (b) of Figure 3 and the corresponding

    adjusted p-values are shown in the last row (c) and (d). Before correcting for multiple compar-

    isons, there was some significant area around the anterior, posterior, and the medial side of the

    left hippocampus, but not much in the right hippocampus. There was still some significance for

    the diagnostic status effect around the same areas in the left hippocampus after correcting for

    multiple comparisons, but nothing in the right hippocampus. The posterior orientation effect of

    hippocampal differences in schizophrenia has also been shown by Styner et al. (2004) and basi-

    cally constitutes a local bending change in that region. The anterior effect is novel and located

    at the intersection of the hippocampal Cornu Ammonis 1 and Cornu Ammonis 2 regions.

    We also examined the overall age effect on the whole medial representation structure. The

    color-coded p-values of the age effect across the atoms of both the left and right reference hip-

    pocampi are shown in the first row (a) and (b) of Figure 4. The false discovery rate approach was

    used to correct for multiple comparisons, and the corresponding adjusted p-values are shown in

    the first row (c) and (d) of Figure 4. There was a large significant area in the right hippocampus

    and also some in the left hippocampus. The significance area remains almost the same after

    correcting for multiple comparisons, but with an attenuated significance level.

    Additionally, we looked at each component on the medial representation separately. For the

    radius component of the medial representation, the color-coded p-values of the age effect across

    the atoms are shown in the second row (a) and (b) of Figure 4 and the adjusted p-values are

    shown in the second row (c) and (d). Before correcting for multiple comparisons, there was a

    small age effect in the medial representation thickness at the central atoms near the posterior

    side in the left hippocampus and in some areas in the right hippocampus. However, there was

    not much of a significant diagnostic status effect after correcting for multiple comparisons.

    For the location component of the medial representation, the color-coded p-values of the age

    effect are shown in the third row (a) and (b) of Figure 4 and the corresponding adjusted p-values

    are shown in the third row (c) and (d). Significant age effects were mainly located around the

    19

  • anterior and lateral side of the left hippocampus though with clearly reduced significance after

    correcting for multiple comparisons.

    For the two spoke directions on the medial representation, we showed the color-coded p-values

    of the age effect in the last row (a) and (b) of Figure 4 and the corresponding adjusted p-values

    are in the last row (c) and (d). Even after correcting for multiple comparisons, we observed

    significant areas around the anterior, posterior, and the medial side of the right hippocampus

    and some areas in the left hippocampus.

    Finally, following suggestions from a reviewer, we examined the overall diagnostic status effect

    without accounting for other factors. The p-values of the diagnostic status effects are shown in

    Figure 5. Inspecting Figure 5 reveals a small significant area in the left and right hippocampi

    before and after correcting for multiple comparisons. Comparing with Figure 3, we feel that such

    attenuation in Figure 5 may be caused by omitting other factors such as age that are believed

    to be associated with the variability of the medial representation of subcortical structures.

    4 Discussion

    We have proposed a semiparametric model for describing the association between the medial

    representation of subcortical structures and covariates of interest, such as diagnostic status, age

    and gender. We have developed a two-stage estimation procedure to calculate the parameter

    estimates and used Wald statistics to test linear hypotheses of unknown parameters. We have

    used extensive simulation studies and a real dataset to evaluate the accuracy of our parameter

    estimates and the finite sample performance of the Wald statistics.

    Many issues still merit further research. The two-stage estimation procedure can be easily

    modified to simultaneously estimate all parameters across all atoms and imposing some struc-

    tures (e.g., spatial smoothness) on the matrix of regression parameters across all atoms while

    accounting for the correlations between different components of different atoms. This general-

    20

  • ization requires a good estimate of the covariance matrix of Ei(β) across all atoms. We may

    consider a shrinkage estimator of the covariance matrix of all Ei(β) as a linear combination of

    the identity matrix and the sample covariance matrix V(β) (Ledoit and Wolf, 2004). Moreover,

    for the matrix of regression parameters across all atoms, we may consider its sparse low-rank

    matrix factorization to identify the underlying latent structure among all atoms (Witten, Tibshi-

    rani, and Hastie, 2009; Dryden and Mardia, 1998; Fletcher et al., 2004), which will be a topic of

    our future research. It is interesting to develop Bayesian models for the joint analysis of medial

    representation data of subcortical structures (Angers and Kim, 2005; Healy and Kim, 1996).

    References

    Andrews, D. W. K. (1992), “Generic uniform convergence,” Econometric Theory, 8, 241-257.

    Andrews, D. W. K. (1994), “Empirical Process Methods in Econometrics,” Handbook of Econo-

    metrics, Volume IV. Edited by Engle, R. F. and McFadden, D. L., 2248-2292.

    Andrews, D. W. K. (1999), “Consistent Moment Selection Procedures for Generalized Method

    of Moments Estimation,” Econometrica, 67, 543-564.

    Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.), Wiley

    Series in Probability and Statistics.

    Angers, J. F. and Kim, P. T. (2005), “Multivariate Bayesian Function Estimation,” Ann.

    Statist., 33, 2967-2999.

    Benjamini, Y. and Hochberg, Y. (1995), “Controlling the False Discovery Rate: a Practical

    and Powerful Approach to Multiple Testing,” Journal of the Royal Statistical Society, Ser.

    B 57, 289-300.

    21

  • Bhattacharya, R. N. and Patrangenaru, V. (2005), “Large Sample Theory of Intrinsic and

    Extrinsic Sample Means on Manifolds II,” Ann. Statist, 33, 1225-1259.

    Butler, R. W. (2007). Saddlepoint Approximations with Applications. New York, Cambridge

    University Press.

    Chamberlain, G. (1987), “Asymptotic Efficiency in Estimation with Conditional Moment Re-

    strictions,” J. Economet., 34, 305-334.

    Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. New

    York, Cambridge University Press.

    Downs, T. D. (2003), “Spherical Regression,” Biometrika, 90, 655-668.

    Dryden, I. L and Mardia, K. V. . (1998). Statistical Shape Analysis. Wiley, Chichester.

    Fisher, N. I. and Lee, A. J. (1992), “Regression Models for an Angular Response,” Biometrics,

    48, 665-677.

    Fletcher P. T., Lu C., Pizer S. M. and Joshi S. (2004), “Principal Geodesic Analysis for the

    Study of Nonlinear Statistics of Shape,” Medical Imaging, 23, 995-1005.

    Gould, A. L. (1969), “A Regression Technique for Angular Variates,” Biometrics, 25, 683-700.

    Healy, D. M. and Kim, P. T. (1996), “An Empirical Bayes Approach to Directional Data and

    Efficient Computation on the Sphere,” Ann. Statist., 24 232–254.

    Jennrich R. (1969), “Asymptotic Properties of Nonlinear Least Squares Estimators,” Ann. of

    Math. Statist., 40, 633-643

    Johnson, R. A. and Wehrly, T. E. (1978), “Some Angular-linear Distributions and Related

    Regression Models,” J. Am. Statist. Assoc., 73, 602-606.

    22

  • Jupp, P. E. and Mardia, K. V. (1989), “A Unified View of the Theory of Directional Statistics,

    1975-1988,” International Statistical Review, 57, 261-294.

    Le, H. (2001), “Locating Frechet means with an application to shape spaces,” Adv. Appl. Prob.,

    33, 324-338.

    Ledoit, O. and Wolf, M. (2004), “A Well-conditioned Estimator for Large-dimensional Covari-

    ance Matrices,” Journal of Multivariate Analysis, 88, 365-411.

    Liang, F. (2011), “Annealing Evolutionary Stochastic Approximation Monte Carlo for Global

    Optimization,” Statistics and Computing, 21, 375-393.

    Mardia, K. V. (1975), “Statistics of Directional Data (with Discussion),” J. R. Statist. Soc. B,

    37, 349-393.

    Mardia, K. V. and Jupp, P. E. (1983). Directional Statistics, Academic Press, John Wiley.

    Narr K. L., Thompson P. M., Szeszko P., Robinson D., Jang S., Woods R. P., Kim S., Hayashi

    K. M., Asunction D., Toga A. W. and Bilder R. M. (2004), “Regional Specificity of

    Hippocampal Volume Reductions in First-episode Schizophrenia,” NeuroImage, 21, 1563-

    75.

    Newey, W. K. (1993), “Efficient Estimation of Models with Conditional Moment Restrictions.”

    In Econometrics, vol. 11 of Handbook of Statistics, 419-454, Amsterdam: North Holland.

    Pizer S. M., Fletcher T., Fridman Y., Fritsch D. S., Gash A. G., Glotzer J. M., Joshi S., Thall

    A., Tracton G., Yushkevich P. and Chaney E. L. (2003). “Deformable M-Reps for 3D

    Medical Image Segmentation,” International Journal of Computer Vision, 55, 85-106.

    Presnell B., Morrison S. P. and Littell R. C. (1998), “Projected Multivariate Linear Models for

    Directional Data,” J. Am. Statist. Assoc., 93, 1068-1077.

    23

  • Styner, M., Lieberman, J. A., McClure, R. K., Weinberger, D. R., Jones, D. W. and Gerig,

    G. (2005), “Morphometric Analysis of Lateral Ventricles in Schizophrenia and Healthy

    Controls Regarding Genetic and Disease-specific factors,” Proc. Natl. Acad. Sci. USA,

    102, 4872-4877.

    Styner, M., Lieberman, J. A., Pantazis, D. and Gerig, G. (2004). “Boundary and Medial Shape

    Analysis of the Hippocampus in Schizophrenia,” Medical Image Analysis, 8, 197-203.

    van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes.

    Springer-Verlag, New York.

    Witten, D.M., Tibshirani, R. and Hastie, T. (2009), “Penalized Matrix Decomposition, with

    Applications to Sparse Principal Components and Canonical Correlation Analysis,” Bio-

    statistics, 10, 515-534.

    24

  • Appendix: Proofs of Theorems 1 and 2

    We need the following lemma throughout the proof of Theorems 1 and 2.

    Lemma 1. (i) Under Assumption A1, if f(z,β) is a vector of continuous functions in β for any

    β in a compact set B and z, then

    limδ→0

    P ( supβ,β′∈B,||β′−β||2 �) = 0 ∀� > 0. (21)

    (ii) In addition to the assumptions in (i), if f(z,β) also satisfies supβ∈B ||f(z,β)||2 ≤ G1(z)

    and E {G1(z)}

  • Proof of Lemma 2. It follows from the triangle inequality that

    dist(m,µ(x,β′))2 ≤ dist(m,µ(x,β))2 + dist(µ(x,β),µ(x,β′))2

    + 2dist(µ(x,β),µ(x,β′))dist(m,µ(x,β)).

    Using the Schwarz inequality and the assumptions of Lemma 2, we have

    D(β′) ≤ D(β) + E(β,β′) + 2√D(β)E(β,β′)

  • O

    (u,v,-1)

    ))1,,((1)1,0,0(;

    vuT Tst

    (a) Stereographic projection

    O

    x

    y

    z

    x

    y

    z

    (b) Parallel transport

    )(, sLR ANA

    )(sLA

    N

    Figure 2: Graphic illustration of (a) stereographic projection and (b) parallel transport. In

    panels (a) and (b), N and O denote the north pole (0, 0, 1) and the origin (0, 0, 0), respectively,

    and the red dash lines are the x, y, and z-axes. In panel (a), the red point (u, v,−1) is a selected

    point on the plane z = −1 and the green point T−1st;(0,0,−1)T ((u, v,−1)) is the inverse map of the

    stereographic projection mapping from (u, v,−1) back to S2. In panel (b), the point A is on S2,

    LA(s) is in TAS2, and RA,NLA(s) ∈ TNS2 is the parallel transport of LA(s) from A to the north

    pole N.

    27

  • Figure 3: The coded p−value maps of the diagnostic status effects from the schizophrenia study

    of the hippocampus: rows 1, 2, 3, and 4 are for the whole medial representation structure, radius,

    location, and two directions, respectively: at each row, the uncorrected p−value maps for (a) the

    left hippocampus and (b) the right hippocampus; the corrected p−value maps for (c) the left

    hippocampus and (d) the right hippocampus after correcting for multiple comparisons.

    28

  • Figure 4: The color-coded p−value maps of the age effect from the schizophrenia study of the

    hippocampus: row 1, 2, 3, and 4 are for the whole medial representation structure, radius,

    location, and two directions, respectively: at each row, the uncorrected p−value maps for (a) the

    left hippocampus and (b) the right hippocampus; the corrected p−value maps for (c) the left

    hippocampus and (d) the right hippocampus after correcting for multiple comparisons.

    29

  • Figure 5: The coded p−value maps of the diagnostic status effects without accounting for other

    factors from the schizophrenia study of the hippocampus: rows 1, 2, 3, and 4 are for the whole

    medial representation structure, radius, location, and two directions, respectively: at each row,

    the uncorrected p−value maps for (a) the left hippocampus and (b) the right hippocampus;

    the corrected p−value maps for (c) the left hippocampus and (d) the right hippocampus after

    correcting for multiple comparisons.

    30

  • Local Polynomial Regression for Symmetric Positive

    Definite Matrices

    Ying Yuan, Hongtu Zhu †, Weili Lin, J. S. Marron

    University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.

    Summary. Local polynomial regression has received extensive attention for the nonparametric

    estimation of regression functions when both the response and the covariate are in Euclidean

    space. However, little has been done when the response is in a Riemannian manifold. We develop

    an intrinsic local polynomial regression estimate for the analysis of symmetric positive definite

    (SPD) matrices as responses that lie in a Riemannian manifold with covariate in Euclidean space.

    The primary motivation and application of the proposed methodology is in computer vision and

    medical imaging. We examine two commonly used metrics, including the trace metric and the Log-

    Euclidean metric on the space of SPD matrices. For each metric, we develop a cross-validation

    bandwidth selection method, derive the asymptotic bias, variance, and normality of the intrinsic

    local constant and local linear estimators, and compare their asymptotic mean square errors. Sim-

    ulation studies are further used to compare the estimators under the two metrics and to examine

    their finite sample performance. We use our method to detect diagnostic differences between

    diffusion tensors along fiber tracts in a study of human immunodeficiency virus.

    1. Introduction

    Symmetric positive-definite (SPD) matrix-valued data occur in a wide variety of important

    applications. For instance, in computational anatomy, a SPD deformation vector (JJT )1/2

    is computed to capture the directional information of shape change decoded in the Jacobian

    matrices J at each location in an image (Grenander and Miller, 2007). In diffusion tensor

    imaging (Basser et al., 1994), a 3× 3 SPD diffusion tensor, which tracks the effective diffusionof water molecules, is estimated at each voxel (a 3 dimensional (3D) pixel) of an imaging space.

    In functional magnetic resonance imaging, a SPD covariance matrix is calculated to delineate

    [email protected]. Address for correspondence and reprints: Hongtu Zhu, Ph.D., Department of

    Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill,

    Chapel Hill, NC 27599-7420.

  • 2 Yuan, Zhu, et al.functional connectivity between different neural assemblies involved in achieving a complex cog-

    nitive task or perceptual process (Fingelkurts et al., 2005). In classical multivariate statistics,

    a common research focus is to model and estimate SPD covariance matrices for multivariate

    measurements, longitudinal data, and time series data among many others (Pourahmadi, 2000;

    Anderson, 2003).

    Despite the popularity of SPD matrix-valued data, only a handful of methods have been

    developed for the statistical analysis of SPD matrices as response variables in a Riemannian

    manifold. In the medical imaging literature (Fletcher and Joshi, 2007; Batchelor et al., 2005;

    Pennec et al., 2006), various image processing methods have recently been developed to segment,

    deform, interpolate, extrapolate and regularize diffusion tensor images (DTIs). Schwartzman

    (2006) proposed several parametric models for analyzing SPD matrices and derived the distri-

    butions of several test statistics for comparing differences between the means of the two (or

    multiple) groups of SPD matrices. Kim and Richards (2010) developed a nonparametric esti-

    mator for the common density function of a random sample of positive definite matrices. Zhu

    et al. (2009) developed a semi-parametric regression model with SPD matrices as responses

    in a Riemannian manifold and the covariates in a Euclidean space. Barmpoutis et al. (2007)

    and Davis et al. (2010) proposed tensor splines and local constant regressions for interpolat-

    ing DTI tensor fields, but they did not address several important issues of analyzing random

    SPD matrices including the asymptotic properties of the nonparametric estimate proposed.

    All these methods for SPD matrices discussed above are based on the trace metric (or affine

    invariant metric) in the SPD space (Lang, 1999; Terras, 1988). Recently, Arsigny et al. (2007)

    proposed a Log-Euclidean metric and showed its excellent theoretical and computational prop-

    erties. Dryden et al. (2009) compared various metrics of the space of SPD matrices and their

    properties.

    To the best of our knowledge, this is the first paper to develop an intrinsic local polynomial

    regression (ILPR) model for estimating an intrinsic conditional expectation of a SPD matrix

    response, S, given a covariate vector x from a set of observations (x1, S1), · · · , (xn, Sn), wherethe xi can be either univariate or multivariate. In practice, x can be the arc-length of a specific

    fiber tract (e.g., right internal capsule tract), the coordinates in the 3D imaging space, or

    demographic variables such as age. Important applications of ILPR include smoothing diffusion

    tensors along fiber tracts and smoothing diffusion and deformation tensor fields. Another

    application is quantifying the change of diffusion and deformation tensors as well as the inter-

    regional functional connectivity matrix across groups and over time.

    Relative to the existing literature on the analysis of SPD matrices, we make several impor-

  • LocalSPD 3tant contributions in this paper.

    • To account for the curved nature of the SPD space, we propose the ILPR method forestimating the intrinsic conditional expectation of random SPD responses given the co-

    variate. We also derive an approximation of a cross-validation method for bandwidth

    selection.

    • Theoretically, we compare the trace metric and the Log-Euclidean metric and establishthe asymptotic properties of the ILPR estimators corresponding to each metric.

    • Theoretically and numerically, we examine the effect that the use of different metrics hason statistical inference in the SPD space.

    The rest of the paper is organized as follows. In Section 2, we develop the ILPR method and

    a cross-validated bandwidth method for nonparametric analysis of random SPD matrix-valued

    data. In Section 3, we compare the trace metric and the Log-Euclidean metric and derive their

    ILPR estimators. We investigate the asymptotic properties of the estimators proposed under

    the Log-Euclidean metric and the estimators under the trace metric in Sections 4.1 and 4.2,

    respectively. We examine the finite sample performance of the ILPR estimators via simulation

    studies in Section 5. We analyze a real data set to illustrate a real-world application of the

    proposed ILPR method in Section 6 before offering some concluding remarks in Section 7.

    2. Intrinsic Local Polynomial Regression for SPD Matrices

    In this section, we develop a general framework for using intrinsic local polynomial regression

    in the analysis of SPD matrices and will examine two examples in Section 3. Let Sym+(m)

    and Sym(m) be, respectively, the set of m×m SPD matrices and the set of m×m symmetricmatrices with real entries. The space Sym(m) is a Euclidean space with the Frobenius metric

    (or Euclidean inner product) given by tr(A1A2) for any A1, A2 ∈ Sym(m), whereas Sym+(m)is a Riemannian manifold, which will be detailed below. There is a one-to-one correspondence

    between Sym(m) and Sym+(m) through matrix exponential and logarithm. For any matrix

    A ∈ Sym(m), its matrix exponential is given by exp(A) =∑∞k=0A

    k/k! ∈ Sym+(m). Conversely,for any matrix S ∈ Sym+(m), there is a log(S) = A ∈ Sym(m) such that exp(A) = S.

    Standard nonparametric regression models for responses in the Euclidean space estimate

    E(S|X = x). However, for a random S in a curved space, one cannot directly define theconditional expectation of S given X = x with the usual expectation in Euclidean space. We

    are interested in answering the following question.

  • 4 Yuan, Zhu, et al.(Q1) How do we define an intrinsic conditional expectation of S at each x, denoted by D(x),

    in Sym+(m)?

    To appropriately define D(x), we review some basic facts about the geometrical structure of

    Sym+(m) near D(x) (Lang, 1999; Terras, 1988). See Figure 1 for a graphical illustration. We

    first introduce the tangent vector and tangent space at D(x) in Sym+(m). For a small scalar

    δ > 0, let C(t) be a differentiable map from (−δ, δ) to Sym+(m) passing through C(0) = D(x).A tangent vector at D(x) is defined as the derivative of the smooth curve C(t) with respect

    to t evaluated at t = 0. The set of all tangent vectors at D(x) forms the tangent space of

    Sym+(m) at D(x), denoted as TD(x)Sym+(m), which can be identified with Sym(m). The

    TD(x)Sym+(m) is equipped with an inner product 〈·, ·〉, called a Riemannian metric, which

    varies smoothly from point to point. For instance, one may use the Frobenuis metric as a

    Riemannian metric. Two additional Riemannian metrics for Sym+(m) will be given in Section

    3. For a given Riemannian metric, we can calculate 〈U, V 〉 for any U and V on TD(x)Sym+(m)and then we can calculate the length of a smooth curve C(t) : [t0, t1] → Sym+(m), whichequals

    ∫ t1t0

    √〈Ċ(t), Ċ(t)〉dt, where Ċ(t) is the derivative of C(t) with respect to t. A geodesic

    is a smooth curve on Sym+(m) whose tangent vector does not change in length or direction as

    one moves along the curve. For a U ∈ TD(x)Sym+(m), there is a unique geodesic, denoted byγD(x)(t;U), whose domain contains [0, 1], such that γD(x)(0;U) = D(x) and γ̇D(x)(0;U) = U .

    The Riemannian exponential mapping ExpD(x) : TD(x)Sym+(m) → Sym+(m) of the tangent

    vector U is defined as ExpD(x)(U) = γD(x)(1;U). The inverse of the Riemannian exponential

    map LogD(x)(·) = Exp−1D(x)(·) is called the Riemannian logarithmic map from Sym+(m) to a

    vector in TD(x)Sym+(m). Finally, the shortest distance between two points D1(x) and D2(x) in

    Sym+(m) is called the geodesic distance between D1(x) and D2(x), denoted as g(D1(x), D2(x)),

    which satisfies

    g(D1(x), D2(x))2 = 〈LogD1(x)(D2(x)),LogD1(x)(D2(x))〉. (1)

    We define ED(X) to be LogD(X)(S) in TD(X)Sym+(m). Statistically, ED(X) can be regardedas the residual of S relative to D(X). Let vecs(C) = (c11, c21, c22, · · · , cm1, · · · , cmm)T be anm(m+1)/2×1 vector for any m×m symmetric matrix C = (cij). Thus, the intrinsic conditionalexpectation of S at X = x is defined as D(x) ∈ Sym+(m) such that

    E{LogD(X)(S)|X = x} = Om, (2)

    where Om is the m×m matrix with all elements zeros and the expectation is taken component-wise with respect to the multivariate random vector vecs(LogD(x)(S)). In fact, (2) characterizes

    intrinsic means (Bhattacharya and Patrangenaru, 2005).

  • LocalSPD 5

    Fig. 1. Graphical illustration of the geometrical structure of Sym+(m) near D(x).

    Suppose that (xi, Si), i = 1, · · · , n is an independent and identically distributed randomsample, where Si ∈ Sym+(m). For notational simplicity, we focus on a univariate covariatethroughout the paper. We are interested in using the observed data {(xi, Si), i = 1, · · · , n} toestimate D(X) defined in (2) at each X = x0. By ignoring the Riemannian metric introduced

    in TD(X)Sym+(m), we can directly minimize a weighted least square criterion based on the

    metric related to the regular Frobenius inner product, which is given by

    Ln(D(x0)) =

    n∑i=1

    Kh(xi − x0)tr(LogD(x0)(Si)2). (3)

    In (3), Kh(u) = K(u/h)h−1, in which h is a positive scalar, and K(·) is a kernel function such as

    the Epanechnikov kernel (Fan and Gijbels, 1996; Wand and Jones, 1995). However, it is unclear

    whether the estimate, which minimizes Ln(D(x0)), is truly consistent or not. Therefore, we

    are interested in solving the second question below.

    (Q2) How do we use the observed data to consistently estimate D(X) in (2) at each X = x0?

    For a specific Riemannian metric, we consider estimating D(X) at X = x0 by minimizing

    a weighted intrinsic least square criterion, denoted by Gn(D(x0)) given by

    n∑i=1

    Kh(xi − x0)〈LogD(x0)(Si),LogD(x0)(Si)〉 =n∑i=1

    Kh(xi − x0)g(D(x0), Si)2. (4)

    Directly minimizing Gn(D(x0)) with respect to D(x0) leads to a weighted intrinsic mean of

    S1, · · · , Sn ∈ Sym+(m) at x0, denoted by D̂I(x0) (Bhattacharya and Patrangenaru, 2005). Itwill be shown below that D̂I(x0) is truly a consistent estimate of D(x0).

  • 6 Yuan, Zhu, et al.Local polynomial regression has received extensive attention for the nonparametric estima-

    tion of regression functions when both response and covariate are in Euclidean space Fan and

    Gijbels (1996); Wand and Jones (1995). However, little has been done on developing local

    polynomial regression when the response is in a Riemannian manifold and the covariates are in

    Euclidean space. Therefore, we are interested in solving a third question below.

    (Q3) How do we define the intrinsic local polynomial regression for estimating D(X) in (2)

    at each X = x0?

    We propose the intrinsic local polynomial regression for estimating D(X) at X = x0 as

    follows. Since D(x) is in the curved space, we cannot directly expand D(x) at x0 by using a

    Taylor’s series expansion. Instead, we consider the Riemannian logarithmic map of D(x) at

    D(x0) in TD(x0)Sym+(m). Let Im be an m × m identity matrix. Since LogD(x0)(D(x)) for

    different x0 are in different tangent spaces, we may transport them from TD(x0)Sym+(m) to

    the same tangent space TImSym+(m) through a parallel transport given by

    φD(x0) : TD(x0)Sym+(m)→ TImSym

    +(m).

    That is, we have

    Y (x) = φD(x0)(LogD(x0)(D(x))) ∈ TImSym+(m) and LogD(x0)(D(x)) = φ

    −1D(x0)

    (Y (x)), (5)

    where φ−1D(x0)(·) is the inverse map of φD(x0)(·). Moreover, since Y (x0) = φD(x0)(Om) = Omand Y (x) are in the same space TImSym

    +(m), we expand Y (x) at x0 by using the Taylor’s

    series expansion as follows:

    LogD(x0)(D(x)) = φ−1D(x0)

    (Y (x)) ≈ φ−1D(x0)(k0∑k=1

    Y (k)(x0)(x− x0)k), (6)

    where k0 is an integer and Y(k)(x) is the k−th derivative of Y (x) with respect to x divided by

    k!. Equivalently, D(x) can be approximated by

    D(x) ≈ ExpD(x0)(φ−1D(x0)

    (

    k0∑k=1

    Y (k)(x0)(x− x0)k)) = D(x, α(x0), k0), (7)

    where α(x0) contains all unknown parameters in {D(x0), Y (1)(x0), · · · , Y (k0)(x0)}.

    To estimate α(x0), we substitute the approximation of D(x) in (7) into (4) to obtain

    Gn(α(x0)), which is given by

    Gn(α(x0)) =

    n∑i=1

    Kh(xi − x0)g(ExpD(x0)(φ−1D(x0)

    (

    k0∑k=1

    Y (k)(x0)(x− x0)k)), Si)2. (8)

  • LocalSPD 7Subsequently, we calculate an intrinsic weighted least square estimator of α(x0) defined by

    α̂I(x0;h) = argminα(x0)Gn(α(x0)). (9)

    Then we can calculate D(x, α̂I(x0;h), k0), denoted by D̂I(x, h), as an intrinsic local polynomial

    regression estimator (ILPRE) of D(x). When k0 = 0, D(x, α̂I(x0;h), 0) is exactly the intrinsic

    local constant estimator of D(x0) considered in Davis et al. (2010).

    We propose using a leave-one-out cross validation method for bandwidth selection due to

    its conceptual simplicity. Let D̂(−i)I (xi;h) be the estimate of D(xi) obtained by minimizing

    Gn(α(xi)) with (xi, Si) deleted for a given bandwidth h and all i. The cross-validation score is

    defined as follows:

    CV(h) = n−1n∑i=1

    g(Si, D̂(−i)I (xi;h))

    2. (10)

    The optimal h, denoted by ĥ, can be obtained by minimizing CV(h). However, since computing

    D̂(−i)I (xi;h) for all i can be computationally prohibitive, we suggest to use the first-order

    approximation of CV(h), whose details will be given below under each specific metric. Although

    it is possible to develop other bandwidth selection methods, such as plug-in and bootstrap

    methods (Rice, 1984; Park and Marron, 1990; Hall et al., 1992; Hardle et al., 1992), we must

    deal with additional computational and theoretical challenges, which will be left for future

    research.

    3. ILPR under Log-Euclidean Metric and Trace Metric

    As discussed in Dryden et al. (2009), various metrics can be defined for tangent vectors on

    TD(x)Sym+(m). To assess the effect of different metrics on ILPREs, we develop ILPR under

    two commonly used metrics, including the Log-Euclidean metric and the trace metric.

    3.1. Log-Euclidean Metric

    In this section, we review some basic facts about the theory of the Log-Euclidean metric,

    details of which have been given in Arsigny et al. (2007). We introduce the notation ‘L’

    into some necessary quantities under the Log-Euclidean metric. We use exp(.) and log(.)

    to represent the matrix exponential and the matrix logarithm, respectively, whereas we use

    Exp and Log to represent the Riemannian exponential and logarithm maps, respectively. Let

    ∂D(x) log .(U) be the differential of the matrix logarithm at D(x) ∈ Sym+(m) acting on aninfinitesimal displacement U ∈ TD(x)Sym+(m) (Arsigny et al., 2007). The Log-Euclidean metric

  • 8 Yuan, Zhu, et al.on Sym+(m) is defined as

    〈U, V 〉 = tr({∂D(x) log .(U)}{∂D(x) log .(V )}), (11)

    where U and V are in TD(x)Sym+(m). The geodesic γD(x),L(t, U) is given by exp(log(D(x)) +

    t∂D(x) log .(U)) for any t ∈ R. Let ∂log(D(x)) exp .(A) be the differential of the matrix exponentialat log(D(x)) ∈ Sym(m) acting on an infinitesimal displacement A ∈ Tlog(D(x))Sym(m) (Arsignyet al., 2007). The Riemannian exponential and logarithm maps are, respectively, given by

    ExpD(x),L(U) = exp(log(D(x)) + ∂D(x) log .(U)), (12)

    LogD(x),L(S) = ∂log(D(x)) exp .(log(S)− log(D(x))).

    The geodesic distance between D(x) and S is uniquely given by

    gL(D(x), S) =√tr[{log(D(x))− log(S)}⊗2]. (13)

    We consider two SPD matrices D(x) and D(x0). For any UD(x0) ∈ TD(x0)Sym+(m), theparallel transport φD(x0),L : TD(x0)Sym

    +(m)→ TImSym+(m) is defined by

    φD(x0),L(UD(x0)) = ∂D(x0) log .(UD(x0)) ∈ TImSym+(m). (14)

    Combining (12) and (14) yields

    Y (x) = φD(x0),L(LogD(x0),L(D(x))) = log(D(x))− log(D(x0)),

    D(x) = exp(log(D(x0)) + Y (x)). (15)

    In this case, ED(X) = log(S)− log(D(X)) and E{log(S)|X = x} = log(D(x)).Let vec(A) = (a11, ..., a1m, a21, ..., a2m, · · · , am1, · · · , amm)T be the vectorization of an m×m

    matrix A = (aij). Under the Log-Euclidean metric, Gn(D(x0)) in (4) can be written as

    Gn(D(x0)) =

    n∑i=1

    Kh(xi − x0)tr[{log(D(x0))− log(Si)}2]. (16)

    To compute the ILPR estimator, we use the Taylor’s series expansion to expand log(D(x)) at

    x0 as follows:

    log(D(x)) ≈k0∑k=0

    log(D(x0))(k)(x− x0)k = log(DL(x, αL(x0), k0)), (17)

    where αL(x0) contains all unknown parameters in log(D(x0))(k) for k = 0, · · · , k0. We compute

    α̂IL(x0;h) by minimizing Gn(DL(x, αL(x0), k0)). It can be shown that α̂IL(x0;h) has the

    explicit expression as

    α̂IL(x0;h) = vec({n∑i=1

    Kh(xi − x0)Xi(x0)⊗2}−1n∑i=1

    Kh(xi − x0)Xi(x0)vecs(log(Si))T ), (18)

  • LocalSPD 9where Xi(x) = (1, (xi−x), · · · , (xi−x)k0)T . By substituting α̂IL(x0;h) into DL(x, αL(x0), k0),we have D̂IL(x;h, k0) = DL(x, α̂IL(x0;h), k0).

    Let ek0+1,i be the (k0 + 1) unit vector having 1 in the i-th entry and 0 elsewhere. Let

    eTk0+1,i{∑nj=1Kh(xj−x)Xj(x)⊗2}−1Kh(xi−x)Xi(x) = ai(x). The cross-validation score CV(h)

    can be simplified as follows:

    CV(h) = n−1n∑i=1

    gL(Si, D̂IL(xi;h))2/{1− ai(xi)}2. (19)

    Replacing ai(xi) in (19) by the average of a1(x1), · · · , an(xn), we can get the generalized cross-validation (GCV) score as follows:

    GCV(h) = n−1n∑i=1

    gL(Si, D̂IL(xi;h))2/{1−

    n∑i=1

    ai(xi)/n}2. (20)

    Without special saying, for the Log-Euclidean metric, we use GCV(h) to select the bandwidth

    throughout this paper.

    3.2. Trace Metric

    We review some basic facts about the theory of trace metric (Schwartzman, 2006; Lang, 1999;

    Terras, 1988; Fletcher et al., 2004; Batchelor et al., 2005; Pennec et al., 2006). We add the

    notation of ‘T’ into some necessary geometric quantities under the trace metric. Under the

    trace metric, an inner product of U and V in TD(x)Sym+(m) is defined as

    〈U, V 〉 = tr(UD(x)−1V D(x)−1). (21)

    The geodesic γD(x),T (t;U) is given by G(x) exp(tG(x)−1UG(x)−T )G(x)T for any t, where G(x)

    is any square root of D(x) such that D(x) = G(x)G(x)T . The Riemannian exponential and

    logarithm maps are, respectively, given by

    ExpD(x),T (U) = γD(x),T (1;U) = G(x) exp(G(x)−1UG(x)−T )G(x)T ,

    LogD(x),T (S) = G(x) log(G(x)−1SG(x)−T )G(x)T . (22)

    The geodesic distance between D(x) and S, denoted by gT (D(x), S), is given by√tr{log2(G(x)−1SG(x)−T )} =

    √tr{log2(S−1/2D(x)S−T/2)}, (23)

    where S1/2 is any square root of S.

    We consider two SPD matrices D(x) and D(x0) = G(x0)G(x0)T . For any UD(x0) ∈

    TD(x0)Sym+(m), the parallel transport φD(x0),T is defined by

    φD(x0),T (UD(x0)) = G(x0)−1UD(x0)G(x0)

    −T ∈ TImSym+(m). (24)

  • 10 Yuan, Zhu, et al.Thus, combining (22) and (24) yields

    Y (x) = φD(x0),T (LogD(x0),T (D(x))) = log(G(x0)−1D(x)G(x0)

    −T ),

    D(x) = G(x0) exp(Y (x))G(x0)T . (25)

    In this case, ED(X) = log(G(X)−1SG(X)−T ).To compute the ILPR estimator, we use the Taylor’s series expansion to expand Y (x) at x0

    as follows:

    D(x) ≈ G(x0) exp(k0∑k=1

    Y (k)(x0)(x− x0)k)G(x0)T = DT (x, αT (x0), k0), (26)

    where αT (x0) contains all unknown parameters in G(x0) and Y(k)(x0) for k = 1, · · · , k0. Thus,

    we can compute α̂IT (x0;h) by minimizing Gn(αT (x0)). Under the trace metric, minimizing

    Gn(αT (x0)) is computationally challenging when k0 > 0, since Gn(αT (x0)) is not convex and

    may have multiple local minimizers. Thus, standard gradient methods, which strongly depend

    on the starting value of αT (x0), do not perform well for optimizing Gn(αT (x0)) when k0 > 0.

    Hence, we develop an annealing evolutionary stochastic approximation Monte Carlo algorithm

    (see Liang (2011) for good discussion) for computing α̂IT (x0;h). Details can be found in the

    supplementary document.

    To simplify the computation of CVT (h), we suggest the first-order approximation to CVT (h)

    as follows:

    CVT (h) ≈ n−1n∑i=1

    gT (Si, D̂IT (xi;h, k0))2 + 2pn(h), (27)

    where D̂IT (x;h, k0) = DT (x, α̂IT (x0;h), k0). The CVT (h) is close to Akaike’s information cri-

    terion (AIC) (Sakamoto et al., 1999) and pn(h) can be regarded as the number of degrees of

    freedom. The explicit form of pn(h) is presented in the supplementary document.

    4. Asymptotic Properties

    We derive the asymptotic properties of ILPREs, such as asymptotic normality, under the Log-

    Euclidean and trace metrics. Furthermore, we systematically compare the intrinsic local con-

    stant and linear estimators under each metric and between the two metrics.

    4.1. Log-Euclidean Metric

    Under the Log-Euclidean metric, ILPRE is almost equivalent to the LPR estimator for multi-

    variate response in Euclidean space. Thus, we can generalize the existing theory of the local

  • LocalSPD 11polynomial regression estimator (Fan and Gijbels, 1996; Wand and Jones, 1995). Moreover, we

    only present the consistency and asymptotic normality of ILPRE for interior points, since the

    asymptotic properties of ILPRE for boundary points are similar to those for interior points in

    Euclidean space (Fan and Gijbels, 1996).

    To proceed, we need some additional notation. Let a⊗2 = aaT for any vector or matrix

    a and Iq be an identity matrix of size q = m(m + 1)/2. Let H = diag(1, h, · · · , hk0) ⊗ Iq.Let u = (u1, · · · , uk0)T and v = (v1, · · · , vk0)T be k0 × 1 vectors, where uk =

    ∫xkK(x)dx

    and vk =∫xkK(x)2dx for k ≥ 0. Let U0 = (ui+j) and V0 = (vi+j) for 0 ≤ i, j ≤ k0 be

    two (k0 + 1) × (k0 + 1) matrices for 0 ≤ i, j ≤ k0. Let fX(x) and f (1)X (x) be the marginaldensity function of X and its first-order derivative with respect to x, respectively. We define

    M(x0;h) = (M1(x0;h)T , · · · ,Mk0+1(x0;h)T )T , in which we have

    Mk(x0;h) =

    uk0+kvecs(log{D(x0)}(k0+1)),

    for even k0 + k as 0 < k ≤ k0 + 1;huk0+k+1vecs(log{D(x0)}(k0+1) log(fX(x0))(1) + log{D(x0)}(k0+2)(k0 + 2)−1),

    for odd k0 + k.

    We have the following results, whose proof is similar to that of Theorem 2 in the supple-

    mentary document.

    Theorem 1. Suppose that x0 is an interior point of fX(.). Under the Log-Euclidean metric and

    conditions (C1)-(C4) in the appendix, we have the following results.

    (i) H{α̂IL(x0;h)− αL(x0)} converges to 0 in probability as n→∞.(ii) For k0 = 0, under an additional condition (C10) in the appendix and that f

    (1)X (x) is

    continuous in a neighborhood of x0, we have

    √nh[H{α̂IL(x0;h)− αL(x0)} − h2u2vecs(0.5 log{D(x0)}(2) +

    f(1)X (x0)

    fX(x0)log{D(x0)}(1))]

    →L N{0,Σ0(x0)}, (28)

    where Σ0(x0) = f−1X (x0)v0ΣED (x0) with ΣED (x) = Cov(vecs[log(S) − log{D(x)}]|X = x) and

    →L denotes convergence in distribution.(iii) For k0 > 0, under the conditions of Theorem 1 (ii), we have

    √nh[H{α̂IL(x0;h)− αL(x0)} −

    hk0+1

    (k0 + 1)!(U−10 ⊗ Iq)M(x0;h)]→L N{0,Σ(x0)}, (29)

    where Σ(x0) = f−1X (x0)(U

    −10 V0U

    −10 )⊗ ΣED (x0).

    Theorem 1 delineates the asymptotic properties of α̂IL(x0;h) for k0 ≥ 0, which covers theasymptotic properties of the intrinsic local constant and linear estimators of D(x0) as k0 = 0, 1.

  • 12 Yuan, Zhu, et al.In particular, the asymptotic bias and variance of D̂IL(x0;h, 0) are closely related to those of

    the Nadaraya-Watson estimator when both response and covariate are in Euclidean space (Fan,

    1992). Since vecs(log{D̂IL(x0;h, k0)}) is a subvector of α̂IL(x0;h), we calculate the asymptoticaverage mean squared error (AMSE) conditional on x = {x1, . . . , xn} as

    AMSE(log{D̂IL(x0;h, k0)}) = E{tr([log{D̂IL(x0;h, k0)} − log{D(x0)}]2)|x}.

    Furthermore, for a given weight function w(x), we may consider a constant bandwidth that

    minimizes the asymptotic average mean integrated squared error (AMISE) as

    AMISE(log{D̂IL(.;h, k0)}) =∫

    AMSE(log{D̂IL(x;h, k0)})w(x)dx.

    Finally, we can calculate the asymptotically optimal local bandwidth, denoted by hopt,L(x0; k0),

    for minimizing AMSE(log{D̂IL(x0;h, k0)}) and the optimal bandwidth, denoted by hopt,L(k0),for minimizing AMISE(log{D̂IL(.;h, k0)}).

    By Theorem 1 (iii), AMSE(log{D̂IL(x0;h, 0)}) equals v0{nhfX(x0)}−1tr{ΣED (x0)}+h4u22tr{(vecs[0.5 log{D(x0)}(2) + f

    (1)X (x0)fX(x0)

    −1 log{D(x0)}(1)])⊗2}. For the intrinsic locallinear estimator, AMSE(log{D̂IL(x0;h, 1)}) is given by 0.25h4u22tr{(vecs[log{D(x0)}(2)])⊗2}+v0{nhfX(x0)}−1tr{ΣED (x0)}. Intrinsic local constant and linear estimators have the sameasymptotic covariance and their differences are concerned only with their biases. The local con-

    stant estimator has one more term h2u2f(1)X (x0)fX(x0)

    −1vecs[log{D(x0)}(1)], which depends onthe marginal density fX(.). Subsequently, we can get the optimal bandwidths, whose detailed

    expression can be found in the supplementary document.

    4.2. Trace Metric

    Under the trace metric, since ILPRE is different from the LPR estimator for multivariate

    response in Euclidean space, we study the consistency and asymptotic normality of ILPRE for

    both interior and boundary points.

    We need to introduce some notation for discussion. Consider a function

    ψ(S,G, Y ) = gT (S,G exp(Y )GT )2, (30)

    where G is an m×m lower triangle matrix, S ∈ Sym+(m), and Y ∈ Sym(m). Let α = (αG, αY ),in which αG = vecs(G) and αY = vecs(Y ). Let ∂αψ(S,G, Y ) and ∂

    2αψ(S,G, Y ) be the first and

    second order derivatives of ψ(S,G, Y ) with respect to α, respectively. By substituting Y (X)

  • LocalSPD 13into ∂αψ(S,G, Y ) and ∂

    2αψ(S,G, Y ) and using the decomposition of α = (αG, αY ), we define Ψ1(x) Ψ2(x)

    Ψ2(x)T Ψ3(x)

    = E{∂2αψ(S,G, Y (X))|X = x}, Ψ11(x) Ψ12(x)Ψ12(x)

    T Ψ22(x)

    = E[{∂αψ(S,G, Y (X))}⊗2|X = x],where the expectation is taken with respect to S given X = x. Let 1k0 be a k0 × 1 columnvector with all elements ones. Let U2 = (ui+j) and V2 = (vi+j) for 1 ≤ i, j ≤ k0 be twok0 × k0 matrices. We define

    ℵ(x0;h) = (w1(x0;h)TΨ2(x0),w(x0;h){1k0 ⊗Ψ3(x0)})T

    and w(x0;h) = (w2(x0;h)T , · · · , wk0+1(x0;h)T ), in which

    wk(x0;h) =

    uk0+kvecs(Y (k0+1)(x0)) for even k0 + k as 0 < k ≤ k0 + 1,huk0+k+1vecs(Y (k0+1)(x0) log(fX(x0))(1) + Y (k0+2)(x0)(k0 + 2)−1) for odd k0 + k.Finally, let αT (x) = (vecs{G(x)}T , vecs{Y (1)(x)}T , · · · , vecs{Y (k0)(x)}T )T .Theorem 2. Suppose that x0 is an interior point of fX(·). Under the trace metric and conditions(C1)-(C8) in the appendix, we have the following results.

    (i) There exist solutions α̂IT (x0;h) to equation ∂Gn(αT (x0))/∂αT (x0) = 0 such that

    H{α̂IT (x0;h)− αT (x0)} converges to 0 in probability as n→∞.(ii) For k0 = 0, if f

    (1)X (x) is continuous in a neighborhood of x0, then we have

    √nh[H{α̂IT (x0;h)− αT (x0)} − h2u2vecs{G(1)(x0)

    f(1)X (x0)

    fX(x0)+ 0.5G(2)(x0)}]→L N{0,Ω0(x0)}, (31)

    where Ω0(x0) = u−20 f

    −1X (x0)v0Ψ1(x0)

    −1Ψ11(x0)Ψ1(x0)−1.

    (iii) For k0 > 0, if condition (C9) in the appendix is also true, we have

    √nh[H{α̂IT (x0;h)− αT (x0)} −

    hk0+1

    (k0 + 1)!N (x0)−1ℵ(x0;h)]→L N{0,Ω(x0)}, (32)

    where Ω(x0) = f−1X (x0)N (x0)−1N ∗(x0)N (x0)−1 and N (x) and N ∗(x) are, respectively, given

    by

    N (x) =

    u0Ψ1(x) u⊗Ψ2(x)uT ⊗Ψ2(x)T U2 ⊗Ψ3(x)

    ,N ∗(x) = v0Ψ11(x) v ⊗Ψ12(x)

    vT ⊗Ψ12(x)T V2 ⊗Ψ22(x)

    .Theorem 2 delineates the asymptotic bias, covariance, and asymptotic normality of α̂IT (x0;h)

    for k0 ≥ 0. Based on Theorem 2, it is straightforward to derive the asymptotic bias, covari-ance, and asymptotic normality of D̂IT (x0;h, k0) for k0 ≥ 0. Moreover, to have a direct

  • 14 Yuan, Zhu, et al.comparison between the trace and Log-Euclidean metrics, we calculate the asymptotic biases

    and covariances of log{D̂IT (x0;h, k0)} under these two metrics. Subsequently, we calculateAMSE(log{D̂IT (x0;h, k0)}) and AMISE(log{D̂IT (.;h, k0)}) for a given weight function w(x).Minimizing AMSE(log(D̂IT (x0;h, k0))) and AMISE(log(D̂IT (x0;h, k0))), respectively, leads to

    the optimal bandwidths, whose detailed expressions can be found in the supplementary docu-

    ment.

    We are interested in comparing the asymptotic properties of the intrinsic local constant

    D̂IT (x0;h, 0) and the local linear estimator D̂IT (x0;h, 1). It follows from the delta method

    that AMSE(log{D̂IT (x0;h, 0)}) can be approximated as

    h4u22tr([GD(x0)Tvecs{G(1)(x0)f (1)X (x0)fX(x0)

    −1+ 0.5G(2)(x0)}]⊗2)

    + (nh)−1tr{GD(x0)⊗2Ω0(x0)}+ o(h4 + (nh)−1), (33)

    where GD(x0) = {∂vec(log(G(x0)⊗2))/∂vecs(G(x0))T }T . The asymptotic bias and varianceof D̂IT (x0;h, 0) are similar to those of the Nadaraya-Watson estimator when response is in

    Euclidean space (Fan, 1992). For the intrinsic local linear estimator, AMSE(log(D̂IT (x0;h, 1)))

    equals 0.25h4u22tr[{GD(x0)TΨ1(x0)−1ΨT2 (x0)vecs(Y (2)(x0))}⊗2]+(nh)−1tr{GD(x0)⊗2Ω0(x0)}.

    We consider ILPRE near the edge of the support of fX(x). Without loss of generality, we as-

    sume that the design density fX(.) has a bounded support [0, 1] and consider the left-boundary

    point x0 = dh for some positive constant d. The asymptotic consistency and normality of IL-

    PRE are valid for the boundary points after slight modifications on the definitions of uk and vk.

    Denote uk,d =∫∞−d x

    kK(x)dx and vk,d =∫∞−d x

    kK2(x)dx. Correspondingly, u, U2, V2, U0 and V0are replaced by ud, U2,d, V2,d, U0,d and V0,d, respectively. Let ck0+2,d = (uk0+2,d, · · · , u2k0+1,d)T

    and ℵd(0+) = (uk0+1,dΨ2(0+), ck0+2,d⊗Ψ3(0+))Tvecs(Y (k0+1)(0+)). For the boundary points,we have the following asymptotic results under the trace metric.

    Theorem 3. Suppose that x0 = dh is a left boundary p