NON-PARAMETRIC GRADUATION USING KERNEL METHODS BY ... · Non-Parametric Graduation Using Kernel...

NON-PARAMETRIC GRADUATION USING KERNEL METHODS

BY PROFESSOR J. B. COPAS, B.Sc., PH.D.

University of Birmingham

AND

S. HABERMAN, M.A., PH.D., F.I.A.

The City University

1. INTRODUCTION

LET E be an event whose probability of occurrence depends on some continuous variable x,

For example, E may be death and x age, E may be incidence of lung cancer and x amount of smoking, or E may be reconviction of a parolee with x previous criminal convictions (with suitable definitions of the underlying time interval for the occurrence of E). Given observations on n individuals with characteristic x and the incidence of E, it is desired to estimate the function qx.

The simplest case is when the data are grouped—suppose x occurs in n, cases of which E occurs sx times. Then the elementary crude estimate is

This paper describes a simple, non-parametric method of graduating observed rates or probabilities of the form The technique has been used for smoothing data sets arising in medicine and criminology (Copas (1)(2)) and is extended here to an actuarial example and the results compared with more traditional approaches.

The approach to graduation described in this paper is non-parametric or distribution-free in that it does not involve functional forms or parameters of such forms. In general, non-parametric methods apply to very wide families of distribution rather than only to families specified by a particular functional form. Since fewer assumptions are made, these methods lead to conclusions which require fewer qualifications.

The technique described makes use of kernel methods which have been developed in the recent statistical literature to estimate a probability density from a sample of observations. Given observations y1, y2 . . . , yn from a density g (y),

135

Richard Kwan

JIA 110 (1983) 135-156

136 Non-Parametric Graduation Using Kernel Methods

which it is required to estimate, the general form of the kernel estimator of g(y) is given by

for some function where h= h(n) is positive and 0 as This type of estimator may be viewed as a weighted average over the sample distribution function, using observations nearest to a point x to provide the most information about the density ƒ(x) at that point. The earliest attempts at density estimation are due to John Graunt who used a histogram approach for representing birth and death data(3). Kernel estimators start with the work of Rosenblatt(4) who extended this classical approach to incorporate ‘moving histograms’ as represented by the above equation. These features will be described in more detail in subsequent sections.

2. INTRODUCTION TO GRADUATION USING KERNEL METHODS

Graduation may be regarded as the principles and methods by which a set of observed (or crude) probabilities are adjusted in order to provide a suitable basis for inferences to be drawn and further practical computations to be made (e.g. life tables to be constructed from sets of probabilities of death by age).

The fundamental justification for the graduation of a set of observed probabilities like qx is the premise (suggested by experience of nature) that, if the number of individuals in the group on whose experience the data are based, nx, had been considerably larger, the set of observed probabilities would have displayed a much more regular progression with x. In the limit, with nx indefinitely large, the set of probabilities would thus have exhibited a smooth progression with x (we deliberately avoid a detailed discussion of the concept of ‘smoothness’). Therefore the observed data may be regarded as a sample from a large population so that the observed probabilities, derived therefrom, are subject to sampling errors. Providing these errors are random in nature, they may be reduced by increasing the size of the sample and thereby extending the scope of the investigation. A simpler, cheaper and more practicable alternative is often to use graduation to remove these random errors (rather than increasing the sample size).

Taking the case where nx is not large enough for the above estimate to be sufficiently reliable, a statistical method is needed which combines the information at related values of x. The actuarial methods of graduation, described for example by Benjamin and Pollard(5), can be described broadly as (a) graphical methods, (b) parametric methods and (c) summation and adjusted-average methods. Parametric methods involve the fitting of some mathematical function

Non-Parametric Graduation Using Kernel Methods 137

Thus ƒx( ) may be a straight line, a cubic spline, a logistic function or may refer to a standard set of q’s, and the parameters will usually be determined by a formal procedure like maximum likelihood, minimum chi-square or weighted least squares. Although in the context of the assumed function, such methods will be efficient, they are always liable to some degree of bias since no preassigned function will fit reality exactly.

Methods (c) are non-parametric and aim to give more stable estimates than qx by combining data at different values of x, but without presupposing any particular form for qx. Like parametric methods, they too are liable to give biased estimates, but in such a way that it is possible to balance an increase in bias with a decrease in sampling variation. One method is a weighted average of the observed proportions:

where wx, i is a set of weights, but again sampling fluctuations in cells with small nx can have an undue influence on the resulting estimate. A full description of graduation formulae of the form (e.g. Spencer’s 21 -term formula(6)) is given in Chapter 13 of Benjamin and Pollard(5). A variant of which gives less weight (relatively) to cells with low frequency is

This last estimate can be extended to the general case (with possibly ungrouped data) in

(1)

where the sum is over all n cases in the denominator but, in the numerator, over only those cases where E has occurred. in these expressions, all cases are counted separately even if there are ties among x’s In this estimate h is a constant, and

is a continuous ‘kernel’ function which is assumed to be positive, symmetric about zero, and with a single mode at the origin. The adjusted-average methods referred to above may be characterized by a graphical representation of the


coefficients or weights wx, i–the generalization to equation (1) with its use of a continuous kernel function comes about by replacing the discrete plot of wx, i by a continuous (coefficient) curve.

A simple example of a kernel function is the following

which has the triangular shape displayed in Figure 1. Other examples of suitable kernels are

(normal kernel) and (Laplace kernel).

The more immediate properties of (as defined in (1)) are the following:

(a) it always lies in [0, 1]; (b) h controls the degree of smoothing—if h is very small, is essentially the

proportion of E cases at x; if h becomes larger, data at other values of x have greater influence (in fact, as tends to the overall probability of occurrence of E.);

(c) in the grouped case, is equivalent to with an obvious choice of weights, and to when the group sizes are all equal.

An unusual feature of equation (1) is that the degree to which the estimated curve responds to features of the data can be controlled in a continuous fashion. By contrast, the smoothness of parametric methods can only be regulated in discrete steps, for example by increasing the degree of a polynomial, increasing the number of knots in a cubic spline or changing from one family of curves to another. The mathematical properties of such curves will also tend to change

Figure 1. The simple kernel function


abruptly. The estimate (1) is given by a single formula, and no statistical fitting of parameters is needed. Similarly, neither the graphical approach to graduation nor the adjusted average methods show such flexible adherence to the observed data.

The value of h can be given a physical interpretation as a ‘window width’ on an x-axis. Thus for the simple kernel all the information for estimating &comes from those data in the range x±2h while for the normal kernel, h is the standard deviation of the distribution of weights, so to estimate nearly all the information comes from those data in the range x±2h. Estimation at any given value of x can be likened to that of a binomial proportion with effective sample size of approximately

as shown in (6) below. This is an increasing function of h: the larger is h the more data are being used for the estimate at each individual x, and the more highly correlated will be the estimates at neighbouring values of x.

An alternative motivation for estimate (1) will be given in the next section (§ 3). Statistical properties of the estimate are then discussed further in §§4 and 5 (together with Appendices I and II). A generalization is given in §6 and then in §7 the method is applied to a standard set of actuarial data, including a comparison with existing graduation methods.

3. DENSITY ESTIMATES AND NON-PARAMETRIC REGRESSION

A problem of current interest in mathematical statistics is that of ‘density estimation’—given a random sample y1, y2, . . , yn from an unknown continuous probability density function g(y), we wish to find a smooth function which estimates g(y). A recent review of the large literature on this problem is Bean and Tsokos(3). Much attention is given to the so-called kernel estimates (originally proposed by Rosenblatt(4) and Parzen(7)) given by the following (as in $1):

(2)

The similarity with equation (1) is immediate. Suppose that, in the situation of the last section, x is assumed to arise from a random variable X, for example X might be the age of a person selected at random from a particular population.

140

Then

Non-Parametric Graduation Using Kernel Methods

where g(x) is the probability density function of X and gE(x) is the density function of X conditional on those cases in which E has occurred. Using (2) on g(x) and gE(x) separately, and estimating P(E) by the observed proportion of E-cases, given precisely the estimate (1).

Experience with (2) shows that the choice of h is much more important than the choice of As before, h governs the degree of smoothing in the method and may be likened to the choice of the class-width in a histogram. No general method for determining h has been developed-the difficulty is that the maximum likelihood estimate of g(y) for given y1, y2, . . . , yn is a discrete distribution giving equal probability to the observed values, a solution which is not very helpful in the search for smooth estimates. Although a number of ad hoc methods have been discussed in the literature, the choice of h to use in practice is basically a subjective one, involving a compromise between reflecting important features of the data and yet not over-reacting to spurious chance fluctuations. The dichotomy between fidelity to the observed data and smoothness underlies much of the discussion of graduation in actuarial literature(5). This suggests that the same considerations apply to in (1), namely that the choice of the form of is relatively unimportant but that h needs to be chosen carefully in the light of both the data and whatever (subjective) experience is available.

It is also worth noting that if we define the indicator variable Z to be 1 if E occurs and 0 otherwise, then qx is the conditional expectation of Z given x, and so is a regression function in the technical sense. Then estimate (1) is a special kind of non-parametric regression estimate. Non-parametric methods of estimating regression functions in the more usual case of Z being a continuous variate have been discussed in another recent review paper, Collomb(8).

4. PROPERTIES OF THE NON-PARAMETRIC ESTIMATE

Supposing that each of the n individuals recorded in the data has a value xi together with an associated indicator value Zi (1 if E occurs, 0 otherwise), the estimate (1) can be written

where

If, for convenience, we let qi=qxi, the true probability for the i’th case, then the Zi's are independent and have a binominal distribution with mean qi and


variance qi(1–qi). Hence the expected value of the estimate can be written as follows:

(5)

(3)

The second term of the right-hand side is the bias of the estimate. If h is sufficiently small so that only cases of xi close to x (and hence with qi close to qx) contribute substantially to the summations in (3) this bias term will be small. More exactly, using a Taylor series expression and denoting differentials with respect to x by ':

and so the bias term in (3) is

(4)

If the xi's are symmetrically placed in the neighbourhood of x, the coefficient of will be zero; in practice this coefficient will be small except near the ends of the range of the x’s. The second term in (5) will be small if is small, i.e. the curve of the population values qx is approximately linear in the neighbourhood of x, or if h is small so that the effective size of (xi–x)2 in the appropriate sum is limited. It is important to note that the coefficient of is always positive, so that will tend to overestimate if the curve is convex and underestimate if the curve of qx is concave (as must be so by considering the geometry of moving averages).

The variance of may be written

(6)

This expression can be approximated as follows:

This expression will decrease as h increases, and so sampling fluctuations will tend to diminish as greater smoothing is used, as expected. A full discussion of the properties of the exact equation (6) is provided in Appendix I.

It is worth noting that the sum of the squares of the weights needed in the above formula for the approximate variance of is, in cases of practical interest, equivalent to a second application of the same method but with a different choice of h: for example, for the normal kernel h is replaced by h/ and for the Laplace kernel h is replaced by ½h.


5. SAMPLING PROPERTIES OF THE ESTIMATE (1): A SPECIAL CASE

Exact sampling properties of the estimate clearly depend on the configu- ration of the x’s in the data, and on the nature of the true function qx. A simple case which can be investigated further is that of equally spaced and equally sized groups, say with m cases, with x equal to each of the integers 1, 2, . . . , k (thus n = mk). If vi is the observed proportion of cases at x = i for which E occurs (thus vi= si/ni in the notation of §2), the estimate evaluated at x =j is

where the explicit mention of h being omitted for simplicity of notation. Let each vi have expectation qi, the true probability for the ith group.

In order to develop asymptotic properties of it will be assumed that m is large (so that vi has mean qi and small variance qi(1–qi)/m) and that h is small (so that wij decreases rapidly as i moves away from j). Without loss of generality, the weights will be scaled so that will be denoted by ε. For the normal and Laplace kernels (inter alia) it can then be shown that

(7)

Since the properties of qx are only required on the grid 1, 2, . . . , k, the role of the derivatives of qx is taken by the associated finite differences which, for definiteness, are taken as

Then it is easy to show that, to first order terms in ε,

and

(8)


this last expression being the variance of the elementary estimate vj of qj The mean squared error for j= 2,3, . . . k – 1 is therefore

which is less than qj(1 – qj)/m provided that

Thus, for the improvement in variance to outweigh the increase in bias, the curvature of qx (as represented by the second derivative) must not be too large, involving a balance between and the small quantities and A slightly different expression applies, of course, at the ends of the range, j= 1 and k.

In the case of grouped data, it is usual to assess the goodness of fit of any particular set of estimated probabilities by calculating the x2 statistic, this being

When, in the usual parametric method, is formed by fitting r parameters, this quantity is distributed asymptotically as x2 on k – 1 – r degrees of freedom, and so has expectation also equal to k – 1 – r. It is of interest to compare this expectation with that obtaining for the non-parametric method (1).

Using the notation adopted here, the x2 statistic may be written

Now if m is large, then it is approximately true that

(9)

where

144 Non-Parametric Graduation Using Kernel Method

Using the results of equations (7) and (8) and retaining only the lowest order terms in ε, it may be shown that

(10)

where

and

A fuller derivation is provided in Appendix II. The first and second terms in (10) are respectively the bias and variance

components of the expectation of x2. If k is sufficiently large so that the end corrections can be ignored, the bias depends on and will vanish as the true curve qx approaches local linearity. In that case the variance term will be the dominant one with value at most ε 2 (6k – 8). If this were to equal the expected value for a parametric model involving a small number r of fitted parameters, then the following would approximately hold for large k

This suggests that for the non-parametric method to give any reasonable fit in terms of x2, ε should be the order of magnitude of .4, implying for a Laplace kernel, for example the sequence of weights . . . , .06, .16, .4, 1, .4, .16, .06, . . .

It should be emphasized that these are asymptotic considerations in the equal groups case only, and the previous paragraph cannot be taken as a definitive recommendation for a particular choice of weights (as discussed earlier in $2). Further, non-parametric methods leading to estimates such as (1) are likely to be most useful in cases with sparse groups or when there are only a small number of ties amongst the x’s, both being situations in which the value of x2 is not particularly informative.

6. A MODIFIED NON-PARAMETRIC ESTIMATE

It follows from (5) that the bias in the non-parametric estimate is negligible provided h is sufficiently small. If, however, a substantial degree of smoothing is

Non-Parametric Graduation Using Kernel Method 145

required because of a small sample size or for some other reason, this bias may be quite noticeable, particularly at the ends of the range of the data. In such cases it is possible to reduce the bias by replacing estimate (1) by

(11)

where and is some given function of x to be discussed shortly. Essentially, (11) amounts to first subtracting from the data, then smoothing what is left using equation (1), and then adding back to the result to form the final estimate.

If h is very small, estimates (1) and (11) are virtually identical irrespective of . In this sense, can be chosen arbitrarily: estimate (1) is in fact the special case of (11) with = 0 for all X. By analogy with (5), the bias of (11) is given by

This will be smaller than (5) if is sufficiently close to the true curve Thus it is suggested that be taken as an a priori estimate of perhaps using the results of a previous study if such are available. Alternatively, can be taken as a simple parametric function fitted to the available data, provided the number of parameters used in the fitting procedure is sufficiently small that the sampling variability in can be ignored.

The variance of is the same as that of (as given by expression (6)), provided the same value of h is used. But because of the reduction in bias of the modified method a larger value of h can be taken, and therefore a correspond- ingly smaller sampling variance achieved. If happens to be exactly correct then h can be increased indefinitely since will then converge to

The improved statistical performance of the estimate is only achieved at a cost, namely that of having to specify the initial function . If the sample size is large and the curve is required for descriptive or exploratory use only, then the modification is unnecessary. However, as the example in the next section shows, there are cases where judicious choice of can greatly increase the scope of the method in its ability to accommodate important features of a practical situation.

7. EXAMPLE: MATERIAL. AND RESULTS

This section describes the application of the graduation methods discussed in §§2–6 to a set of mortality rates. The data have been taken from the example introduced in Chapter 12 of Benjamin and Pollard(5) and used for illustrative purposes in their very full description of graduation techniques. Use of such a standard example has facilitated the comparison of results.


Table 1 presents the raw data for this actuarial example. The second and third columns show the number of cases (nx) and number of deaths (sx) respectively at the ages (x) given in the first column. The fourth column gives the crude death rates per 10,000 population Scrutiny of these crude rates indicates an approximately exponential increase with age in the crude rates beyond age 50. At younger ages the pattern is affected by the small numbers of deaths; however, the crude rates at ages 44 and 45 are much higher than at neighbouring ages. This is an unexpected feature which will be discussed further in subsequent paragraphs.

Table 1. A Mortality Experience: crude death rates per 10,000

Population No of Crude death rate exposed

nx 1,051

940 1,048

716 719

1,051 1,042 1,804 1,468 1,576 1,647 1,861 1,669 1,624 1,157 2,193 1,803 2,402 2,120 2,406 1,975 2,564 1,798 2,536 2,511 1,858 1,835 1,393 1,462 1,245 1,064 1,502

875 927 497 983

deaths per 10,000 sx qx

Age x 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

1 10 6 64 2 19 3 42 2 28 4 38 4 4 38

12 67 7 48

16 102 18 109 16 86 13 78 14 86 11 95 19 81 20 111 31 129 21 127 38 158 37 38

187 148

36 200 51 201 71 201 32 172 54 294 47 337 40 274 34 275 46 432 74 493 36 411 38 410 38 584 60 614


In Table 2 the crude rates are repeated and successive columns present the results of various graduations of these data. The final rows give the values of x2 and the degrees of freedom (where appropriate) for these various graduation methods.

Inspection of graphs of estimate (1) when calculated for these data confirm that there is little to choose between the normal and Laplace kernels. There is a slight suggestion that the normal gives a smoother curve than the Laplace for a

Table : 2. Graduation of Mortality Data: Rates per 10,000

Crude death rates

10 64 19 42 28 38 38 67 48

102 109 86 78 86 95 87

111 129 127 158 187 148 200 201 283 172 294 337 274 275 432

Graduated Death Rates D E F G H I 25 19 35 22 30 31 28 21 31 21 33 33 30 34 38 33 36 34 33 40 40 38 39 36 36 46 42 44 43 40 40 51 44 49 46 47 43 56 46 55 50 55

x 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 x2

34 43 32

43 53 42

61 50 72 66 55 84 70 60

472 529 489

A B C 30 38 27

509 591 533

33 40 29

550 663 581

35 47 35 37 50 38

27·3 35·1 37·7

61 51 57 46

109 90 75 90 80 71 88 85 78 88 91 85 93 98 93

101 105 101 113 113 111 125 122 121 139 132 132 152 143 144 164 155 157 177 168 172 194 183 187 214 199 205 231 217 223 246 237 244 265 260 266 286 286 290 307 314 317 340 347 346 385 384 377 424 426 411 448 474 449

J K L 27 27 33

M 26 29 32 35 39 43 47 58 69 81 87 87 84 87 95

103 114 121

31 30 35 32 39 34 44 38

35 39 42 47 51 56

49 43 54 51 60 61 65 70 70 78 75 84 80 86

48 61 49 61 55 64 52 66 53 66 60 73 57 71 56 72 65 81 63 76 61 78 71 75 69 81 66 84 77 88

61 66 70 75 79

76 87 72 90 84 89 86 88 84 83 93 79 96 91 91 91 89 90 91 100 86 102 99 97 99 107 95 109 108 105

109 116 105 117 118 115 119 125 116 125 125 127 131 136 129 134 139 140 143 148 142 145 152 152 161 158 158 165 165 162 172 175 174 172 180 180 189 191 193 187 195 195

98 93 106 101 115 112 122 120 136 136 149 150 162 162

96 104 112 122 134 146 160 175 192 210 229

135 148 162 177 192

177 177 192 193 209 210 228 230 248 248 270 268

207 209 213 205 212 212 226 228 234 225 231 229

511 492 509 493 486 468

248 249 257 248 251 245 271 272 282 273 273 263

559 535 552 528 527 496

297 297 308 300 296 284 326 323 336 328 332 310

611 581 598 564 572 524

356 352 366 359 350 342

39·8 33·2 40·5 33·5 34·6 31·3

390 383 398 391 380 377 427 416 433 424 413 411 467 453 470 458 448 440

209 228 248 270

250 273

293 288 297 293 329 323 324 329 348 344 352 348 379 381 383 379 413 419 416 413 449 454 452 449 489 491 491 489 531 533 533 531 578 582 579 578 33·6 29·9 33·1 30·2

493 411 410 584 614

Degrees of freedom

30? 33 32 32? 34? 32 34


given value of x2, and so there is perhaps a marginal case in favour of the normal kernel. This is due to the different tail behaviour of the two distributions—for a given value of the Laplace gives more weight to distant points than does the normal.

As the group sizes in this example are of the same order of magnitude (most of the nx's are between 1,000 and 2,000), it might be expected that the formulae in § 5 would give at least a rough indication of the relevant sampling properties. When h = ·5 (a small amount of smoothing), = ·135 for both kernels, whence

(6k – 8) = 3.79. The in (10) are very small for these data and SO this last figure should correspond to E(x2): the actual value of x2 is 2.91 for the normal and 3.30 for the Laplace. However, ash, and hence , increases the higher order terms in (10) become progressively more important and the actual value of x2 can be substantially smaller than predicted. When = ·40, h = ·74 for the normal kernel, but inspection of the resulting curve shows that it is considerably under-smoothed (x2= 12.4). A value twice this amount, h= 1.50, appears sensible, the resultant graduated rates being given in the third column (marked A) of Table 2.

The next seven columns of Table 2 (marked B to H) correspond to various other graduation techniques which are given here for the purposes of comparison. These are (following the letter headings of the columns in the table):

B Graphical graduation C Parametric graduation—Gompertz D Parametric graduation—Double Gompertz E Parametric graduation—Lidstone’s Method F Parametric graduation—Reference to Standard Table G Parametric graduation—Natural Cubic Splines H Parametric graduation—Logistic Regression

These methods are described in detail by Benjamin and Pollard(5). Some details including parameter values and the methods of fitting employed are included in Appendix III, together with the exact page references in (5). The reader will note that the results for methods B, C, D, E and G have been taken directly from Benjamin and Pollard(5) while those for F and H have been derived independently by the present authors.

Method B leaves scope for individual judgment and thereby admits the possibility of individual bias so that equally competent exponents may produce different graduations for the same set of crude data. Graduation method E also suffers from a serious disadvantage in that the Lidstone transformation often leads to values of

which are small and widely varying (as in Appendix III) and hence in need of adjustment before the graduation can be applied.


From the viewpoint of rigorous statistical theory it is unsatisfactory to use as an independent variable in a regression equation a probability which is constrained to lie between 0 and 1, as the estimated value from such a regression equation could fall outside these bounds (and in particular be negative). This criticism applies to graduation methods C, D and FT. It would be more satisfactory to use the logit function, i.e. loge(q/(1 – q)) as an independent variable as in method H. This function asymptotes to – as q approaches zero and to + as q approaches unity. Use of the logit transformation in H ensures that the graduated values of q will all lie between 0 and 1.

In Table 2 three values of the degrees of freedom are annotated by a question mark: for methods B, E and F. For the graphical graduation the number of degrees of freedom lost by the process of drawing a smooth curve near to the observed age-specific rates is problematic. Benjamin and Pollard(5) suggest a reduction of 2 or 3 degrees of freedom for each section of the curve consisting of about 10–15 ages. These lost degrees of freedom can be attributed to the fitted height of the curve, the slope and the curvature (possibly). For methods E and F which both involve reference to a standard life table (English Life Tables No. 12—Ma1es(9)) the number of degrees of freedom lost is problematic because the standard mortality rates were themselves derived from a parametric graduation (here involving seven parameters) and it is unclear how much allowance to make for this: for example Benjamin and Pollard(5) suggest no allowance.

Table 2 shows that these methods give remarkably similar results for the upper half of the age range, although some modest differences emerge at the lower ages with, for example, evidence that the graphical graduation (B) slightly overesti- mates the rates for ages under 41.

Given the above comments it is interesting that the linear logistic (H) gives a value near the centre of the range of estimates at all ages. A selection of these, along with the crude rates and the normal kernel estimates (A) are displayed in Figure 2. * Clearly, the biggest difference between A and the standard methods is the behaviour of the curve near x = 44. If a curve such as H were in fact correct, then the probability of obtaining two consecutive positive residuals of the size observed at x = 44 and x = 45 is extremely small; in this sense the ‘kink’ at these values is statistically significant. Thus it is not surprising that a non-parametric method reacts to a statistically significant feature of the data (since it depends only on the data and not on any preconceived mathematical curve) whereas the parametric methods do not (as these are guaranteed in advance to smooth out any kinks of this kind). Of course the kink can be made less marked by increasing h; h = 2·0 give the figures in column I (also shown in Fig. 3). Evidently, increasing h by ·5 makes the kink only slightly flatter, showing that the kernel estimate (1)

* For ease of interpreting Figures 2–4, the results of the various graduation methods are shown as continuous curves. However, the values for plotting these curves are taken from Table 2; each is rounded to the nearest whole number per 10,000 and defined at integral values of age only.

Small oscillations in these computer drawn curves are due to rounding errors and to the particular interpolation procedure used in the graphics computer software rather than to the graduation methods themselves.

Figure 2 Log (death rate per 10,000) against age: four graduation methods.

Figure 3. Log (death rate per 10,000) against age: new method with h= 1·5 and 2·0.


changes only slowly with changes in the smoothing parameter. It is noteworthy that the concentration of Benjamin and Pollard(5) on parametric methods leads to an absence of comment on the ‘kink’ at ages 44 – 45.

A choice has to be made about the behaviour of the rates near x = 44, either

(a) the kink should be regarded with suspicion and hence graduated out (despite being statistically significant), or

(b) the kink is a real feature of the data and should be reflected in any graduation.

For case (a), a much larger value of h is needed, and so for the reasons explained in §6 the modified nonparametric estimate (11), should be used instead of (1). Taking to be the linear logistic fit (H) and a value of h = 4·0 gives the graduated rates shown in column J in Table 2 (also shown in Fig. 4). This curve is very smooth and appears to fit well. Incidentally, if the same is used but with a modest h such as h = 2·0 (column K), the values are quite close to those given by equation (1) with the same h (column I), demonstrating that for smaller values of h the role of is unimportant. The necessity for using a parametric fit in (11) can be avoided by using a priori values; taking to be the standard rates (as used in methods E and F(9)) gives the rates in column L (also shown in Fig. 4), where

Figure 4. Log (death rate per 10,000) against age: modified method with different


h = 4·0 has again been used. These values are close to those for the method using the linear logistic (J).

Turning to case (b), the original method (1) with h = 1·5 or 2·0 can be used. If greater smoothing is required at ages other than those near x = 44, then again (11) should be used but with values for which allow for the presence of the kink. One possibility is to take the linear logistic at all values of x except for x between 42 and 47 (inclusive) where a more detailed curve such as A is assumed. This, together with h = 4·0, results in the rates in column M (also shown in Fig. 4). These values are extremely smooth and yet accommodate the kink in the data; to obtain such a curve using a parametric method would require an unrealistically large number of parameters.

8. CONCLUSIONS

A method of non-parametric graduation using kernel methods has been described for application in the graduation of vital rates. The statistical properties of the estimates have been outlined. The method is applied to a standard data set and the results are compared with those available from graphical and parametric graduations of the same data. The method compares favourably in terms of flexibility, simplicity, adherence to the data and smoothness. A generalization, which enables prior information to be incorpor- ated in the graduation process, leads to a particularly good fit when unusual waves are present in the data which tend to be smoothed out by parametric graduations unless an unwieldy number of parameters are introduced.

9. ACKNOWLEDGEMENT

We are grateful to Mrs V. Payne for programming the calculations of the kernel graduation method and the computer-drawn diagrams in this paper.

REFERENCES

(1) COPAS, J. B. (1982). Plotting p against x. To appear in Applied Statistics. (2) COPAS, J. B. (1982). Regression, prediction and shrinkage. To appear in Journal of Royal

Statistical Society, Series B. (3) BEAN, S. J. & TSOKOS, C. P. (1980). Developments in non-parametric density estimation. Int.

Statist. Rev. 48, 267. (4) ROSENBLATT, M. (1956). Remarks on some non-parametric estimates of a density function. Ann.

Math. Stat. 21, 832. (5) BENJAMIN, B. & POLLARD, J. H. (1980). The Analysis of Mortality and Other Actuarial Statistics.

Heinemann, London. (6) SPENCER, J. (1904). On the graduation of the rates of sickness and mortality presented by the

experience of the Manchester Unity of Oddfellows during the period 1893 – l897. J.I.A. 38, 334. (7) PARZEN, E. (1962). On the estimation of a probability density function and the mode. Ann.

Math. Stat. 40. 1065.


(8) COLLOMB, G. (1981). Estimation non-paramétrique de la régression: Revue bibliographique. Int. Statist. Rev. 49, 75.

(9) Registrar General (1968). Decennial Supplement. English Life Tables No. 12. H.M.S.O., London.

(10) LIDSTONE, G. J. (1892). On an application of the graphic method to obtain a graduated mortality table. J.I.A. 30, 212.

(11) McCUTCHEON, J. J. & EILBECK, J. C. (1977). Experiments in the graduation of the English Life Tables (No. 13) data. T.F.A. 35, 281.

From equation (4)

APPENDIX I

It can therefore be shown that

Equation (6) may therefore be re-written as follows:

(A1)

(A2)

By arguments similar to those in §4, the coefficient of will usually be small, the term in can only make the variance smaller, and the term involving

will be small provided that either qx is locally linear or else h is sufficiently small.

APPENDIX II

The derivation of (10) begins with the approximate formula for the application of ² given in (9):

(A3)

where

Elementary differentiation gives

(A4)

(A6)

(A5)

154

and

Non-Parametric Graduation Using Kernel Methods

where

and

Using, the specific wij given by (7) and the finite differences defined by (8), and retaining only the lowest order terms in , the following approximations for j = 2, 3 ,..., k – 1 are obtained:

Substituting the values in (A6) into (A4) and (A5) and then into (A3) gives, for j=2, 3,. . . , k – 1 the following

Similar calculations for j = 1 and j = k give

Summing over values of j from 1 to k then gives

where are as defined in §5 of the text.


APPENDIX III

Below are further details for the seven standard graduation methods used on the

illustrative data of Table 1.

B Graphical graduation (reference: p. 258 – 60(5) Following the construction of a smooth curve near to the crude rates, the first and higher differences of the smoothed rates are examined. The first differences are adjusted until their progression is roughly geometric thereby producing the graduated rates.

C Gompertz/Makeham (reference: p. 316 – 17(5)) The equation here is

The method of fitting is by weighted least squares. Since

so that weights of can be used. Following a graphical fitting of Gompertz’s

and y(0) = ·09 were obtained. Using an iterative procedure, the fitted values were

D Double Gompertz (reference: p. 318–19(5))

Using a similar approach to C with initial values of m(0) = 1.10–5 = ·06 n(0) = 1·1.10–5 and ß(0) = ·10, fitted parameter values of m = 2·268·10–5

were obtained.

E Lidstone’s Method (reference: p. 328–31(5) and (10)) The equation is

where refer to the rates from a standard table, here taken as English Life Tables No. 12 (Males)(9). The values of the variables on the left-hand side are small and vary considerably—some of this variation can be removed by taking a running average. A seven-term average has been used before using the above equation for fitting. A weighted least squares method was used, with the same weights as for C and D since the variance of is approximately equal to the variance of Originally the Lidstone transformation was proposed for use in

procedure with weights of nx/qsx for convenience. The parameter values arewhere fitting is carried out by a weighted least squaresin the notation of

Using knots equally spaced at ages 35, 47, 59 and 71, the equation to be fitted is


conjunction with a graphical graduation rather than the above parametric approach.

(5) & (11)

F Reference to Standard Table (reference: p. 332–3(5)) The equation is

where is as for E. An iterative weighted least squares procedure was used. As a first step weights of were used as above to give graduated rates which were then used as weights in a second regression (i.e. [i.e. )] to give the final values shown in Table 2 with final parameter values of = 1·0392 and ß = 1·931.10–³. Further iteration is unlikely to produce substantial changes in the graduated rates. If such an iterative approach were carried out and it converged the resulting graduated rates would be the minimum chi-square solution to the original estimation problem.

G Natural Cubic Splines (reference: p. 345–54(5) & (11)

a0 = – 1712·, a1 = 55·07, b1 = 1·333·10 –² and b2 = 1·852·10–¹.

H Logistic Regression (reference: p. 312–13(5)) The equation is

where the parameters are estimated using the method of maximum likelihood. Using initial trial values of a = –3·0 and b = ·08 the log likelihood function is found to be maximized at a = 2·8033 and b = 8·560.10–². Incorporating an extra term of the form c(x – 70)² only marginally increased the maximum and the resulting value of c is negligible.

NON-PARAMETRIC GRADUATION USING KERNEL METHODS BY ... · Non-Parametric Graduation Using Kernel...

Documents

Transcript of NON-PARAMETRIC GRADUATION USING KERNEL METHODS BY ... · Non-Parametric Graduation Using Kernel...