Using a mixture model to predict the occurrence of diabetic retinopathy

10
STATISTICS IN MEDICINE, VOL. 14,2599-2608 (1995) USING A MIXTURE MODEL TO PREDICT THE OCCURRENCE OF DIABETIC RETINOPATHY PHILIP YOUNG Department o/ Applied Statistics, University of Reading, P.O. Box 238, Reading, RG6 2AL. U.K. BYRON MORGAN Institute of Mathematics and Statistics. University of Kent, Canterbury, Kent, CT2 7NF. U.K. AND PETER SONKSEN, SEBASTIAN TILL AND CHARLES WILLIAMS Department of Endocrinology and Chemical Pathology, University of London, St. Thomas’s Hospital. Lambeth Palace Road, London, SEI 7EH. U.K. SUMMARY Diabetes mellitus is a common condition which has several serious complications associated with it. In this paper a mixture model, based on one previously used to predict the onset of AIDS, is used to predict the onset of one of these complications, diabetic retinopathy, the major cause of adult blindness in the U.K. This model differs from the previous AIDS model by introducing covariates into the model and using a wider choice of mixture distributions. The fit and distributional assumptions of the model are then discussed for this example. The model is fitted to the data by maximum likelihood. It is important that the training set contains balanced numbers of individuals with and without retinopathy. 1. INTRODUCTION Diabetes mellitus (DM) is a severe but common medical problem, affecting between 1 and 2 per cent of the U.K. population. DM is caused by a deficit in the secretion or action of the hormone insulin which regulates the blood glucose level, and, as a consequence of this deficiency, a higher than normal concentration of glucose is maintained in the body. Since the discovery, and subsequent isolation of insulin in the 1920s, DM is no longer life-threatening for the majority of sufferers. Unfortunately, since insulin is a protein molecule, it is easily digested, and therefore cannot be taken orally but must be injected into the patient’s bloodstream. An alternative treatment is to take tablets that increase the secretion of insulin. However, this preferred treatment is only available for patients who still have some ability to secrete insulin. Diabetics are often divided into two groups. The so-called type I’s, who are typically young when diagnosed, are insulin dependent and tend to be underweight. The other group, the so-called type II’s, are the opposite; they are typically obese, develop diabetes in middle age and have only an impaired ability to secrete insulin. The type I1 patients form a very heterogeneous group, and as a conse- quence we shall restrict attention to the much more homogeneous type I group. Unfortunately, although treatment with either tablets or insulin should stop the blood glucose level rising to dangerous levels, it cannot control the blood glucose level as well as the body’s own CCC 0277-67 151951232599- 10 0 1995 by John Wiley & Sons, Ltd. Received June 1994 Revised February 1995

Transcript of Using a mixture model to predict the occurrence of diabetic retinopathy

Page 1: Using a mixture model to predict the occurrence of diabetic retinopathy

STATISTICS IN MEDICINE, VOL. 14,2599-2608 (1995)

USING A MIXTURE MODEL TO PREDICT THE OCCURRENCE OF DIABETIC RETINOPATHY

PHILIP YOUNG Department o/ Applied Statistics, University of Reading, P.O. Box 238, Reading, RG6 2AL. U . K .

BYRON MORGAN Institute of Mathematics and Statistics. University of Kent, Canterbury, Kent, CT2 7NF. U . K .

AND

PETER SONKSEN, SEBASTIAN TILL AND CHARLES WILLIAMS Department of Endocrinology and Chemical Pathology, University of London, St. Thomas’s Hospital.

Lambeth Palace Road, London, SEI 7EH. U . K .

SUMMARY Diabetes mellitus is a common condition which has several serious complications associated with it. In this paper a mixture model, based on one previously used to predict the onset of AIDS, is used to predict the onset of one of these complications, diabetic retinopathy, the major cause of adult blindness in the U.K. This model differs from the previous AIDS model by introducing covariates into the model and using a wider choice of mixture distributions. The fit and distributional assumptions of the model are then discussed for this example. The model is fitted to the data by maximum likelihood. It is important that the training set contains balanced numbers of individuals with and without retinopathy.

1 . INTRODUCTION

Diabetes mellitus (DM) is a severe but common medical problem, affecting between 1 and 2 per cent of the U.K. population. DM is caused by a deficit in the secretion or action of the hormone insulin which regulates the blood glucose level, and, as a consequence of this deficiency, a higher than normal concentration of glucose is maintained in the body. Since the discovery, and subsequent isolation of insulin in the 1920s, DM is no longer life-threatening for the majority of sufferers. Unfortunately, since insulin is a protein molecule, it is easily digested, and therefore cannot be taken orally but must be injected into the patient’s bloodstream. An alternative treatment is to take tablets that increase the secretion of insulin. However, this preferred treatment is only available for patients who still have some ability to secrete insulin. Diabetics are often divided into two groups. The so-called type I’s, who are typically young when diagnosed, are insulin dependent and tend to be underweight. The other group, the so-called type II’s, are the opposite; they are typically obese, develop diabetes in middle age and have only an impaired ability to secrete insulin. The type I1 patients form a very heterogeneous group, and as a conse- quence we shall restrict attention to the much more homogeneous type I group.

Unfortunately, although treatment with either tablets or insulin should stop the blood glucose level rising to dangerous levels, it cannot control the blood glucose level as well as the body’s own

CCC 0277-67 151951232599- 10 0 1995 by John Wiley & Sons, Ltd.

Received June 1994 Revised February 1995

Page 2: Using a mixture model to predict the occurrence of diabetic retinopathy

2600 P. YOUNG ET AL.

homeostatic mechanism. Chronic elevation of blood glucose levels in diabetics over a period of years, or decades, can lead to microvascular complications.

Some microvascular complications are life-threatening, for example nephropathy (renal dam- age) and cardiac problems. Yet, many microvascular complications are debilitating rather than immediately life threatening, for example, neuropathy, where the nerves in the extremities become damaged. In this case, the diabetic loses sensation and infection can easily take hold, which in severe cases can lead to amputation.

This paper concentrates on one such debilitating complication - retinopathy, for which suitable data are available. Retinopathy results from damage to the retinal lining of the eye and is the commonest cause of blindness in people of working age in the western world.

Owing to these debilitating complications, looking after diabetics is a major health service cost in the U.K.’ A means of predicting which patients are likely to develop retinopathy, and roughly when, is therefore of great interest to clinicians.

The model of this paper will be developed cross-sectionally on a large diabetic database recorded during the routine consultation of staff in the endocrine clinic at St. Thomas’ Hospital.’ In this paper we concentrate on the assessment made when the patient first attended the clinic. This first attendance may be either several years after diagnosis, for example, the patient may have been previously diagnosed and treated by their GP or another hospital, or be the time when a positive diagnosis of diabetes is made. Type I patients were identified as being less than 40 years of age at diagnosis, and on insulin treatment. Of course, many of the variables to be used are time-dependent, and because the model used in this paper will not account for any time- dependence it may well favour those variables which are relatively time-independent.

Arising from the work of Young,4 five key variables were selected as possible covariates for use in the model, out of a list of approximately 170 recorded on the database. These five variables were selected by first excluding those variables with a high proportion of missing data. The clinical staff were then consulted to determine which of the remaining variables were likely to be associated with retinopathy. Next, stepwise linear discriminant analysis was used to assess which variables were actually associated with retinopathy. Finally, a pilot analysis was conducted using the method in this paper to determine the exact list of variables to use. It was necessary to use only a few variables because of the large amount of missing data and the prohibitive amount of computation that would have been required otherwise.

Excluding all cases with missing data in these five variables left 554 patients of which 123 had retinopathy. The population of patients attending the St. Thomas’ diabetic clinic are drawn mainly from the Lambeth and surrounding districts of London, with a high representation of various racial groups. Of the 554 patients in our cohort, 66 were Asian, 39 were Negroid and 38 were of undetermined, but non-Caucasian racial origin. However, neither racial origin or gender appear to be important in the development of retinopathy in this cohort of type I diabetics, or indeed el~ewhere.~ The median age of the patients at diagnosis was 24, with an inter-quartile range of 13 to 34. The median duration of diabetes was 7 years, with an inter-quartile range of 0 (newly diagnosed) to 15 years. Yet, those patients who had retinopathy had on average 12.5 years longer durations of diabetes than those without retinopathy. However, while it is widely known that the duration of diabetes is a pivotal factor in the development of retinopathy, it is also known that occurrence of retinopathy is influenced by other factors.’

The five variables were: the body mass index (BMI), which is a measure of obesity; the haemoglobin Alc ( H b A I ) , which is a measure of the average blood glucose level over an approximate three month period prior to the measure being taken; the patient’s current dosage of insulin (Insq); the patient’s systolic blood pressure when lying down (bpsf); and the result of using a biothesiometer (an electronic vibration threshold measuring device) on the toe (biotoe). The last

Page 3: Using a mixture model to predict the occurrence of diabetic retinopathy

MIXTURE MODEL TO PREDICT OCCURRENCE OF DIABETIC RETINOPATHY 260 1

covariate gives a measure of the amount of sensation the patients have in their toes, thus acting directly as a measure of the degree of neuropathy.

2. THE MODEL

The following model was developed by Struthers and Farewell (1989)5 to predict the onset of AIDS. Predictions of onset in AIDS and in retinopathy are essentially similar problems; in both cases we have an 'infected' population, either HIV infected or diabetics, some of whom may go on to develop problems after the start of the disease, either AIDS or retinopathy. In both cases the length of time of'infection' is critical, but unknown. For type I diabetics, the view is generally held that the period between onset of diabetes and diagnosis of diabetes is very short. However, views do differ on the exact definition of the onset of diabetes, and it could be argued that the metabolic changes leading to diabetes happen slowly over a fairly long occult period. In the work that follows, we give this period a distribution and estimate the parameter of this distribution.

A survival analysis approach is adopted. Here T is defined as the time between the (unknown) onset of diabetes and the development of retinopathy. For patients who will not develop retinopathy, T is effectively infinite. It is assumed that for those patients who do develop retinopathy that T has a Weibull distribution,

Pr( T > t ) = exp [ - (pi ) ' ]

where p is known as the spread parameter. Unlike the AIDS model of Struthers and F a r e ~ e l l , ~ p will be taken to be a function of a vector of covariates X. Specifically, p = exp( - a ' X ) where a is a vector of coefficients. The other parameter, 6, is known as the shape parameter and is restricted to being positive.

However, not all diabetics develop retinopathy, which is accounted for by introducing a prob- ability 2 that the patient develops retinopathy, which is known as a long term survivor parameter.6 We should also let A be a function of the covariate vector X, and as A is a probability, we set A = [l + exp( - p'X)J-' where

The basic problem is that we are unsure of the exact time of onset of diabetes, and values of t adopted by T are therefore unknown. To overcome this problem we assume that t = T + U, where T is the known time between diagnosis and time of examination, and U is a positive random variable, with density over the hidden time period, between the start of diabetes and the time of diagnosis. Then if the distribution function of U is g(u) where u E [0, a] for some a then

is another vector of coefficients.

Pr(F > T ) = (1 - A) + 1 g(u)exp{ - [ p ( r + u ) ] ' } du l where is a random variable representing the observable time between diagnosis and develop- ment of retinopathy. Struthers and Farewell used a uniform mixture model, that is, g(u) = a - ' , claiming that the choice of mixture distribution was not important. Using a uniform mixture has the advantage of simplifying (l) , but has the disadvantage, especially for this example, of being unrealistic. Hence we shall consider the more general power distribution with the probability distribution function g(u) given by

g(u) = pup- 'a-p (2)

where p is positive and is known as the power parameter. Figure 1 illustrates the range of shapes of density function that can be obtained from a power distribution. Note that the uniform distribution is then obtained as a special case when p = 1. Other have considered the

Page 4: Using a mixture model to predict the occurrence of diabetic retinopathy

2602 P. YOUNG ET AL

O'"' ;\ , p = 0.3

0 0.1 0.2 0.3 0.4 0.5 0.0 0.7 0.8 0 9 1

U

Figure 1. Probability density function of the power distribution of equation (2) for various values of p , with a = 1

use of finite mixture models, however, the use of such models is particularly suited to the case of competing risks and will not be considered further here.

Since there is little known about the age of onset of diabetes, the value of a was chosen as the age of diagnosis. This allows diabetes to start at any time, between birth and diagnosis, and so a varies between patients. As mentioned above, it is expected that the onset of diabetes occurs close in time to diagnosis for type I diabetics.' If this is true we would expect most values of U to be small, and hence to have a small value of p. We shall see later that this is precisely what we obtain for the maximum likelihood estimation of p.

3. ESTIMATION OF THE PARAMETERS

The unknown parameters (a, /I, 6 and p) are estimated using maximum likelihood estimation. Patients are denoted using the subscript i, i = 1,2, ..., n; then without loss of generality, if we assume the first rn have retinopathy and the remainder do not at the time of examination, then the log-likelihood is defined as

i = 1 i = m + I

= log[ii(l - l c e x p { - [pi(ri + u)']}du i = I U P

Optimization of 1 was carried out using a FORTRAN 77 program which used the NAG9 subroutine EMJAF, which is a quasi-Newton routine, on a Sun 4 workstation. Explicit solutions of (1) were, except in some trivial special cases: unobtainable, and so numerical integration of (1) was carried out using the NAG function DOlATF, which is a general purpose quadrature routine. A forward stepwise procedure was used to introduce the covariates one at a time into either p or 1 as required. At each step the variable which increased the likelihood the most was introduced, provided the improvement in the likelihood was significant at the standard 5 per cent level, using a maximum likelihood ratio test (MLRT). The use of Bonferroni type adjustments to the significance levels was also considered but this did not alter the outcome of the stepwise

Page 5: Using a mixture model to predict the occurrence of diabetic retinopathy

MIXTURE MODEL TO PREDICT OCCURRENCE OF DIABETIC RETINOPATHY 2603

procedure. I t was computationally expedient to parameterize the model in terms of the logar- ithms of p and 6 because both parameters are constrained to be positive.

The inverse of the observed information matrix was then calculated to provide approximate estimates of the standard deviations of the parameter estimates.

4. RESULTS

4.1. Balanced data set

We randomly selected a group of 123 patients without retinopathy from the 431 patients without retinopathy. For the balanced data set of 123 patients with and without retinopathy, the stepwise procedure introduced biotoe and HbAl into the long term survivor term, while Insq was introduced into the spread parameter. This meant the following seven coefficients were used in the model. For the spread parameter vector a, there are coefficients for the constant term, the biotoe variable and the HbAl variable. For the long term survivor parameter vector /3, there are coefficients for the constant term and the Insq variable. Finally, there are two coefficients for the log of the shape parameter 6 and the log of the power parameter p. The estimates of these coefficients, along with their approximate standard deviations, calculated by inversion of the information matrix, are given in Table I.

The stepwise procedure was conducted throughout using the power distribution as the mixture distribution. At the end of the stepwise procedure the use of a uniform mixture instead of a power mixture was investigated, but found to be unsatisfactory as it increased the log-likelihood from - 108.08 to - 98.68 which was highly significant (P -= 0.001). Figure 2 shows the profile

log-likelihood plot of log p. This is obtained by fixing p at a particular value and maximizing the log-likelihood over the other parameters. Note that as expected the value of p is significantly less than unity, which corresponds to the uniform distribution.

An approximate 95 per cent confidence interval for log p can either be deduced from inversion of the information matrix (Table I), or the profile likelihood (Figure 2) using the standard asymptotic theory of the maximum likelihood ratio testlo; in this case the difference between the two methods is minimal. Similarly, at the end of the stepwise procedure the use of the exponential distribution as an alternative to the Weibull distribution was investigated, but this was also found to significantly reduce the log-likelihood. The constant terms for the spread parameter a, and the long term survivor parameter /I, were routinely included in the stepwise procedure. However, it is worth noting that the a constant does not seem to be needed at the end of the stepwise procedure. It is useful however to retain this term for general use of the model, especially with regard to patients on low doses of insulin. All of the above ties in with the results of naive t-tests of the seven coefficients in Table I, which are all significant with the exception of the a constant.

The interpretation of the effect of the covariates is that a rise in either HbA1, biotoe or Insq is a bad prognosis. This is exactly what would be expected. A high HbAl implies that for the last three months a patient has had very poor control over their blood glucose level. A raised biotoe value indicates poor sensation in a patient’s toe, which is indicative of neuropathy. A high insulin dose roughly reflects the severity of a patient’s diabetes, although this is a complex issue as the amount of insulin a patient receives depends on several possible factors. The way the variables are used in the model is also of interest. As the Insq variable is introduced into the spread parameter it implies that, given that a patient will develop retinopathy, it is the Insq variable that determines how long it takes for the retinopathy to develop. It is the HbAl and biotoe variables that are used to determine the chance of a patient developing retinopathy. In addition, from Table 11, the approximate correlation matrix, we can see that none of the coefficients relating to the three

Page 6: Using a mixture model to predict the occurrence of diabetic retinopathy

2604 P. YOUNG ET AL.

Table I. Coefficient estimates with approximate standard deviations for the model fitted to the balanced data set

Name Estimate Standard deviation

tl constant a Insq j3 constant P biotoe j3 HbAl log 6 h P

- 0,267 - 0053

1 440 - 0057 - 0.052

1.786 - 1'203

0.334 0.015 0.403 0.023 0.020 0.457 0189

130 A - log likelihood maximum over

all other parameters

a" -3 - 2 1 0 1 2 3

log power parameter , p

Figure 2. Log-likelihood profile plot of the log of the power parameter, p, for the model fitted to the balanced data set. The solid line represents the profile log-likelihood of log p, while the dotted line represents the upper limit of 95 per cent

confidence interval for p based on the asymptotic theory of the maximum likelihood ratio test

Table 11. Approximate correlation matrix for model parameters fitted to the balanced data set, obtained by inversion of the observed information matrix

a constant a biotoe a HbAl /.I constant /3 insq log 6 log p

a constant 1 ~OOO 0.660 - 0.189 - 0049 - 0.169 0'233 - 0.033 a biotoe 1 ~OOO 0.151 - 0'063 0.004 0.351 0314 a HbAI 1.OOo -0060 0026 0.110 0.102

P Insq I~OOo 0096 0045 P constant 1 *OOO 0.547 - 0,084 - 0.141

log 6 1.OOO - 0.089 1% P 1 ~OOO

variables are particularly correlated, indicating that the three variables are effectively acting independently within the model.

Now we consider the fit of the model. In very crude terms the model correctly predicted 106 of the 123 patients with retinopathy (86 per cent correct) and 98 of the 123 without retinopathy (80 per cent correct), giving an overall success rate of 83 per cent. However, this is a crude and potentially optimistic assessment, having been measured on the data set used to construct the

Page 7: Using a mixture model to predict the occurrence of diabetic retinopathy

MIXTURE MODEL TO PREDICT OCCURRENCE OF DIABETIC RETINOPATHY 2605

model. Yet, when a full cross-validatedl assessment of the success rate was carried out, identical predictions were made, indicating that the above success rate has minimal bias.

When the balanced training set was formed, 308 patients without retinopathy were not used. We can now use these as a test data set for validation of the model. We find that the model correctly predicts 239 of these 308 patients without retinopathy, a success rate of 78 per cent. Compared with the original 80 per cent crude estimate of success for patients without re- tinopathy. This again suggests that any optimistic bias in the crude success rate is very small. This level of bias is closely in agreement with previous work involved in the fitting these models4

We can also check the fit of the model by comparing the expected and observed hazard plots, as follows. The patients were split up into groups with similar durations of diabetes since diagnosis. The number who had developed retinopathy were divided by the total number in the group to obtain the observed ‘hazard’ for each group. The hazard here does not have the usual interpreta- tion, as we are using a balanced data set. Similarly, the expected hazard is obtained by dividing the expected number in the group with retinopathy (obtained from equation (1)) by the number in the group. The expected and observed hazards are plotted in Figure 3. A visual inspection of Figure 3 reveals that the expected and observed hazard functions are similar.

However, it is possible to carry out a chi-square test to assess the difference between the two curves.12 Using the fitted model the test statistic X2 = 18.2. Unfortunately, due to the low observed frequencies of retinopathy for patients with a long diagnosed durations of diabetes, the approximation of the distribution of the test statistic X2 to a chi-squared distribution seemed questionable. Instead, the P-value of the test statistic was approximated using a Monte Carlo simulation of 100,OOO realizations of the expected distribution of the number of patients developing retinopathy. the approximate P-value was calculated as 0.091 which implies a reason- ably good fit of the model to the data. For comparison, the resulting P-value obtained using the chi-squared approximation was 0.976, and thus would be highly inaccurate.

4.2. Unbalanced data set

Using all the patients, the stepwise procedure introduced the same three variables that were introduced when using the balanced data set, namely, Insq, biotoe and HbAI. However, on this occasion all three variables were introduced into the spread parameter. The estimates of these coefficients, along with their approximate standard deviations, calculated by inversion of the information matrix, are given in Table 111.

The interpretation of the variables is much the same as when using the balanced data set, that is, a rise in any of the three variables is a bad prognosis. The only difference lies in the way these variables are used, that is, they are all used in the spread parameter.

As when using the balanced data set, reduction of the model by either using a uniform mixture instead of the power mixture or using a exponential distribution instead of a Weibull distribution was found to significantly reduce the log-likelihood. Naive t-tests were all significant, except for a biotoe and p constant.

If we now consider the fit of the model, although the overall naive success rate is comparable to using the balanced sample date (80 per cent), we have lost specificity. Although the model successfully predicted 410 and 431 patients without retinopathy (95 per cent correct), it only predicted 35 out of the 123 patients with retinopathy as having retinopathy (28 per cent correct).

If we consider the comparison of the hazard functions, displayed in Figure 4, then it is obvious that the fit is not as appealing visually as that obtained using the balanced data set, illustrated in Figure 3. Indeed, using the same chi-square test as in Section 4.1, X2 = 38.9, indicating that there is a significant difference between the observed and expected hazard functions (P = 0.015).

Page 8: Using a mixture model to predict the occurrence of diabetic retinopathy

2606 P. YOUNG ET AL.

Proportion in

grotip with

retinopathy

0 10 20 30 40 50 6C

- E x p a c l c d Observed

Duration of diabetes (years)

Figure 3. A comparison of the observed and expected hazard plots for the balanced data set

Table 111. Coefficient estimates with approximate standard deviations for the model fitted to the complete data set

Name Estimate Standard deviation

a constant a Ins9 a biotoe a HbAl /I constant log s 1% P

4.006 - 0024 - 0.038 - 0201

0.445 2.0 18

- 1.051

0.062 0.0 1 2 0.028 0.083 0.292 0587 0.294

0.6

0 5 -

Proportion in 0.4

group with

retinopathy 0.3

.

0 10 20 30 4 0 50 60

- Expecled Observed

Duration o f diabetes (years)

Figure 4. A comparison of the observed and expected hazard plots for the unbalanced data set

5. DISCUSSION

Fitting the model of this paper to the full data set by maximum likelihood provides too great an emphasis on the dominant group of patients. Similar findings have been found elsewhere.’’ This renders the fitted model useless for prediction. By contrast, the model fitted to the balanced data set works very well. The covariates selected, and how they are incorporated into the model,

Page 9: Using a mixture model to predict the occurrence of diabetic retinopathy

MIXTURE MODEL TO PREDICT OCCURRENCE OF DIABETIC RETINOPATHY 2607

provide an interesting objective verification of medical beliefs, for example, that poor diabetic control may lead to retionopathy.

Attempts at modifying the likelihood function by penalizing the likelihood contribution of those patients without retinopathy were made using a variety of penalties. However, although in the best cases the modified likelihood procedure gave better prediction than the unbalanced data set, it never achieved the levels of prediction obtained using a balanced data set, and the choice of the optimum penalty factor used lacked any obvious interpretation. In addition, the properties of adapting the likelihood in this way are not understood, and it was decided not to proceed further with the uses of this weighted likelihood function.

The model of this paper extends the work of Struthers and Farewell in two ways: by using a non-uniform mixture distribution, and through the introduction and selection of covariates. Experiments with removing the long term survivor term from the model resulted in a highly significant reduction in the likelihood, demonstrating that such a term, with its obvious biological implications, is clearly needed in the model.

We also experimented with further model extensions, generalizing the Weibull distribution to the generalized gamma,I4 which as well as the Weibull also contains the gamma and log-normal distributions as special cases, and replacing the power distribution by a beta distribution, but for the data used there was no significant improvement in either case. This also gives some justification that the choice of distribution used here was sufficiently flexible.

The approach of this paper may be applicable to other complications, and may be generalized to when more than one complication is present. The St. Thomas’ data base also records data from numerous follow-up visits for patients, and work is currently under way to use this follow-up data to further assess the effectiveness of the model in making future predictions on the occurrence of retinopathy.

ACKNOWLEDGEMENTS

We thank Dr. D. Jerwood, Dr. B. Balkau, Professor N. Keiding and Professor D. M. Titterington for their comments on an earlier draft of this paper.

REFERENCES

1. Krolewski, A. S., Warram, J. H., Rand, L. I. and Kahn, C. R. ‘Epidemiologic approach to the etiology of type I diabetes mellitus and its complications’, New England Journal of Medicine, 317, 1390-1398 (1987).

2. Till, S., Williams, C. D., Fowle, A. J. and Sonksen, P. H. ‘Datascan: An easy interactive method of interrogation and analysis of a diabetic database’, Diabetic Medicine, 2, 328A (1985).

3. Klein, R., Klein, B. E. K. and Moss, S. E. ‘Epidemiology of proliferative diabetic retinopathy’, Diabetes Care, 15, 1875-1891 (1992).

4. Young, P. J. Some statistical models for the occurrence of microvascular complications in diabetics, unpublished PhD thesis of the University of Kent, U.K., 1991.

5. Struthers, C. A. and Farewell, V. T. ‘A mixture model for time to AIDS data with left truncation and an uncertain origin’, Biometrika, 76, 814-817 (1989).

6. Farewell, V. T. ‘The use of mixture models for the analysis of survival data with long term survivors’, Biometrics, 38, 1041-1046 (1982).

7. Gordon, N. H. ‘Application of the theory of finite mixture for the estimation of “cure” rates for treated cancer patients’, Statistics in Medicine, 9, 393-407 (1990).

8. McLachlan, G. J. and McGiffin, D. C. ‘On the role of finite mixture models in survival analysis’, Statistical Methods in Medical Research, 3, 21 1-227 (1994).

9. Numerical Algorithms Group. N A G Fortran Library Manual, Mark 12, Volume 4, Numerical Algo- rithms Group, Oxford, 1987.

10. Cox, D. R. and Hinkley, D. V. Theoretical Statistics, Chapman and Hall, 1974. Chapter 4

Page 10: Using a mixture model to predict the occurrence of diabetic retinopathy

2608 P. YOUNG ET A L

1 1 . Stone, M. ‘Cross-validation choice and assessment of statistical predictions’, Journal of the Royal

12. Elandt-Johnson, R. C. and Johnson, N. L. Survival Models and Data Analysis, Wiley, 1980 p. 219-220. 13. Seber, G. A. F. Multiuariate Observations, Wiley, 1984, p. 286. 14. Lawless, J. F. Statistical Models and Methodsfor Lifetime Data, Wiley, 1982. Chapter 1 .

Statistical Society, series B, 36, 1 1 1-133 (1974).