Analysis of diabetic patients through their examination history

7

Click here to load reader

Transcript of Analysis of diabetic patients through their examination history

Page 1: Analysis of diabetic patients through their examination history

1

2

3 Q1

45

67

9

10111213141516

1 7

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

Q2Q1

Expert Systems with Applications xxx (2013) xxx–xxx

ESWA 8424 No. of Pages 7, Model 5G

27 February 2013

Q1

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Analysis of diabetic patients through their examination history

Dario Antonelli a, Elena Baralis b, Giulia Bruno a, Tania Cerquitelli b,⇑, Silvia Chiusano b, Naeem Mahoto b

a Department of Production Systems and Economics, Politecnico di Torino, Turin, Italyb Department of Control and Computer Engineering, Politecnico di Torino, Turin, Italy

a r t i c l e i n f o

1819202122

Keywords:Data miningCluster analysisPatient examination historyDiabetes

2324252627282930

0957-4174/$ - see front matter � 2013 Elsevier Ltd. Ahttp://dx.doi.org/10.1016/j.eswa.2013.02.006

⇑ Corresponding author. Tel.: +39 3487659476; faxE-mail addresses: [email protected] (D. An

it (E. Baralis), [email protected] (G. Bruno),(T. Cerquitelli), [email protected] (S. Chiusan(N. Mahoto).

Please cite this article in press as: Antonelli, D(2013), http://dx.doi.org/10.1016/j.eswa.2013.0

a b s t r a c t

The analysis of medical data is a challenging task for health care systems since a huge amount of inter-esting knowledge can be automatically mined to effectively support both physicians and health careorganizations. This paper proposes a data analysis framework based on a multiple-level clustering tech-nique to identify the examination pathways commonly followed by patients with a given disease. Thisknowledge can support health care organizations in evaluating the medical treatments usually adopted,and thus the incurred costs. The proposed multiple-level strategy allows clustering patient examinationdatasets with a variable distribution. To measure the relevance of specific examinations for a given dis-ease complication, patient examination data has been represented in the Vector Space Model using theTF-IDF method. As a case study, the proposed approach has been applied to the diabetic care scenario.The experimental validation, performed on a real collection of diabetic patients, demonstrates the effec-tiveness of the approach in identifying groups of patients with a similar examination history and increas-ing severity in diabetes complications.

� 2013 Elsevier Ltd. All rights reserved.

31

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

1. Introduction

Nowadays, large amount of medical data, storing the medicalpatient history, is collected by health care organizations. Datamining, which focuses on studying effective and efficient algorithmsto transform large amounts of data into useful knowledge(Pang-Ning, Steinbach, & Kumar, 2006), may provide valuableinsight into these huge data collections. For example, data miningtechniques can be used to extract a variety of information on thepatient history such as the medical protocols usually adopted forpatients with a given disease. Healthcare organizations can exploitthis knowledge to improve their current processes, assess newmedical guidelines, or enrich the existing ones. Medical guidelinesrepresent standard medical pathways specifying the actionsnecessary to treat, with optimal effectiveness and efficiency,patients with a given disease.

This study addresses the problem of analyzing patients’examination data to identify the examination pathways commonlyfollowed by patients. This issue is crucial for health care organiza-tions, because it can significantly impact on the effectiveness of themedical treatments as well as on the costs incurred by theorganizations.

76

77

78

79

80

81

ll rights reserved.

: +39 0110907099.tonelli), elena.baralis@polito.

[email protected]), [email protected]

., et al. Analysis of diabetic pat2.006

The data analysis framework proposed in this paper exploits amultiple-level clustering approach to discover, in a data collectionwith a variable distribution, cohesive and well-separated groups ofpatients with a similar examination history. Cluster analysis is anexploratory data mining technique that partitions a data objectcollection into groups based on object properties, without the sup-port of additional a priori knowledge (in contrast with classifica-tion algorithms using class label information) (Pang-Ning et al.,2006). To cluster data collections with a variable distribution, theproposed multiple-level clustering strategy iteratively focuses ondisjoint dataset portions and locally identifies clusters. Amongthe state-of-the-art clustering techniques, the density-basedDBSCAN algorithm (Ester, Kriegel, Sander, & Xu, 1996) has beenadopted due to its properties. DBSCAN allows the identificationof arbitrarily shaped clusters, is less susceptible to noise andoutliers, and does not require the specification of the number ofexpected clusters in the data. To highlight the relevance of specificexaminations for a given clinical condition, in the proposedframework patient examination data has been represented in theVector Space Model (VSM) (Salton, 1971) using the TF-IDF method(Pang-Ning et al., 2006).

As a reference case study, the proposed framework has been ap-plied to a real dataset of diabetic patients provided by the NationalHealth Center (NHC) of the Asti province (Italy). The resultsshowed that, starting from a large collection of raw examinationdata, the framework allows the identification of clusters containingpatients with a similar examination history. More specifically, clus-ters contain patients with increasing disease severity, as patients

ients through their examination history. Expert Systems with Applications

Page 2: Analysis of diabetic patients through their examination history

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

2 D. AntonelliQ1 et al. / Expert Systems with Applications xxx (2013) xxx–xxx

ESWA 8424 No. of Pages 7, Model 5G

27 February 2013

Q1

are tested with more and more specific examinations to diagnosediabetes complications. The results were discussed with the sup-port of clinical domain experts, showing a fairly good correlationamong the examination pathways suggested by the clusters andthe guidelines for diabetes disease (ICD-9-CM, 2011).

The paper is organized as follows. Section 2 describes the state-of-the-art clustering methods and explains the selection of theDBSCAN algorithm for this study. Section 3 analyzes previousrelated work. Section 4 presents the proposed framework anddescribes its building blocks, while the results obtained for the realdiabetic patient dataset are discussed in Section 5. Finally, Section6 draws conclusions and future work.

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

2. Selection of the clustering algorithm

Cluster analysis partitions objects into groups (clusters) so thatobjects within the same group are more similar to each other thanthose objects assigned to different groups (Pang-Ning et al., 2006).Clustering algorithms require the definition of a metric to evaluatethe similarity (or distance) between objects based on the featuresdescribing objects.

Clustering algorithms can be classified into the following fourcategories (Pang-Ning et al., 2006): (i) center, (ii) density, (iii) mod-el, and (iv) hierarchical-based methods.

In center-based methods (e.g., K-means (Juang & Rabiner, 1990)),a cluster is a set of data objects in which each object is closer (moresimilar) to the prototype that defines the cluster than to the proto-type of any other cluster. The prototype is the most representativepoint in the cluster. These methods find spherical-shaped clusters,unless clusters are well separated, and are sensitive to outliers.

In density-based methods (e.g., DBSCAN (Ester et al., 1996)), acluster is a dense area of data objects surrounded by an area oflow density. These approaches are less sensitive to the presenceof outliers than center-based techniques and can identify non-spherical shaped clusters.

Model-based methods (e.g. EM (McLachlan & Krishnan, 1997),COBWEB (Fisher, 1987a; Fisher, 1987b)) hypothesize a mathemat-ical model for each cluster, and then determine the best fitbetween the model and the object collection. These methods candeal with outliers and noise. However, similarly to K-means, EMrequires the specification of the number of expected clusters.

Hierarchical-based methods exploit an agglomerative or divisiveapproach to generate a hierarchical collection of clusters. The(most common) agglomerative approach (Pang-Ning et al., 2006)initially assigns each data object to a singleton cluster. The twoclosest clusters are then iteratively merged using a cluster proxim-ity measure (e.g., single-link, complete-link, or group average aver-age) (Pang-Ning et al., 2006). These methods are often used whenthe underlying application requires the creation of a taxonomy,which is not the case for our application scenario.

Density-based methods showed remarkable properties for clus-tering patients based on their examination history. More specifi-cally, the very effective density-based algorithm DBSCAN (Esteret al., 1996) has been selected in this study. Differently from otheralgorithms (e.g., K-means), DBSCAN is less sensitive to outliers andcan find arbitrarily shaped clusters. Outliers, when not identifiedand isolated as in DBSCAN, are clustered together with the otherdata objects, thus decreasing cluster cohesion. In addition, DBSCANdoes not require an a priori specification of the number of clustersin the data, as opposed to K-means and EM.

Medical datasets can include outliers as specific examinationpathways for some disease conditions and clusters can be non-spherical shaped. Besides, since our aim is discovering the exami-nation pathways usually adopted for a given disease through anexplorative data analysis, the expected number of clusters can be

Please cite this article in press as: Antonelli, D., et al. Analysis of diabetic pat(2013), http://dx.doi.org/10.1016/j.eswa.2013.02.006

hardly guessed a priori. For these reasons, the DBSCAN algorithmhas been adopted for the cluster analysis in this study. The maincharacteristics of DBSCAN are reported in the following Section 2.1.

To discover clusters in datasets with a variable distribution,state-of-the-art clustering algorithms can be applied in a multi-ple-level fashion to focus on different dataset portions and locallyidentify cluster (e.g., the bisecting K-means algorithm (Pang-Ninget al., 2006)). The multiple-level strategy adopted in this studyexploits the DBSCAN algorithm to select the dataset part analyzedat each iteration and locally cluster it.

2.1. The DBSCAN algorithm

The DBSCAN algorithm (Ester et al., 1996) relies on two inputparameters, named Eps and MinPts, to define a density thresholdin the data space. A dense region in the data space is an-dimensional sphere with radius Eps and containing at leastMinPts objects.

The DBSCAN algorithm iterates over the data objects in thecollection by analyzing their neighborhood. It classifies objects asbeing (i) in the interior of a dense region (a core point), (ii) onthe edge of a dense region (a border point), or (iii) in a sparselyoccupied region (a noise or outlier point). Any two core points thatare close enough (within a distance Eps of one another) are put inthe same cluster. Any border point close enough to a core point isput in the same cluster as the core point. Outlier points (i.e., pointsfar from any core point) are isolated.

DBSCAN can discover arbitrarily shaped clusters and identifyoutliers as objects in a low density area in the data space. Theeffectiveness of DBSCAN is affected by the selection of the Epsand MinPts values. Section 4.2 discusses how this issue has beenaddressed in this study.

3. Related work

Several works exploited data mining techniques to analyzemedical data by addressing different pathologies and facets ofvarious diseases.

Clustering algorithms have been widely exploited to analyzemedical data for patients affected by different diseases (Ahmed &Funk, 2001; Buczak, Moniz, Feighner, & Lombardo, 2009; Choong,Langford, Dowsey, & Santamaria, 2000; Mulroy, Gronley, Weiss,Newsam, & Perry, 2003;Santamaria, Houghton, Kimmel, & Graham,2003; Van Rooden et al., 2010). Van Rooden et al. (2010) reviewedthe cluster methods used to identify Parkinson’s disease subtypes.It showed that the K-means algorithm (Juang & Rabiner, 1990) wasmostly adopted, but also highlighted its two major limitations, i.e.,the sensitiveness to outliers and the need of defining the expectednumber of clusters. To overcome these issues, the DBSCAN algo-rithm has been used in this study, being able to internally evaluatethe optimal number of clusters and automatically identify outliers.

In Buczak et al. (2009), patients were represented by trackingthe number of occurrences for each clinical event (e.g., hospital vis-it, lab order). Patients were then grouped using a hierarchicalagglomerative clustering method with Ward’s linkage as proximitymeasure (Pang-Ning et al., 2006). Differently from Buczak et al.(2009), in this study we exploit the TF-IDF scheme to weight therelevance of specific examinations for each diabetes condition. Inaddition, a multiple-level cluster approach is used to identifygroups of patients in datasets with a variable data distribution.

Concerning the analysis of medical data for diabetic patients,several works, mainly exploiting classification techniques, havebeen proposed (Karegowda, Jayaram, & Manjunath, 2012; Meng,Huang, Rao, Zhang, & Liu, 2012; Mohamudally & Khan, 2011;Zhong, Chow, & He, 2012). Classification is a supervised data

ients through their examination history. Expert Systems with Applications

Page 3: Analysis of diabetic patients through their examination history

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

D. AntonelliQ1 et al. / Expert Systems with Applications xxx (2013) xxx–xxx 3

ESWA 8424 No. of Pages 7, Model 5G

27 February 2013

Q1

mining approach that assigns new unlabeled data to a class labelby means of a model built from data with known class label(Pang-Ning et al., 2006).

In Karegowda et al. (2012), diabetic patients were categorizedwith a K-nearest neighbor (KNN) classifier (Pang-Ning et al.,2006). To enhance classification accuracy, in a pre-processing stepa genetic algorithm and a feature selection technique (i.e., the cor-relation method (Pang-Ning et al., 2006)) are used to identify therelevant features for classification. In Meng et al. (2012) theauthors compared three prediction models (i.e., logistic regression,decision tree, and artificial neural networks) for diabetic patientsclassification. The comparison was based on common risk factorscollected from both diabetic and pre-diabetic patients with a stan-dard questionnaire. This work showed that the decision tree model(C5.0 (Pang-Ning et al., 2006)) yielded the best accuracy, followedby logistic regression, and artificial neural networks. The work inZhong et al. (2012) proposed a multi-level support vector machineapproach to classify and predict clinical charge profiles as well asthe length of hospital stay for patients affected by heart, diabetes,and cancer diseases. In Mohamudally and Khan (2011), differentdata mining algorithms (the K-means algorithm, C4.5 decision treeclassifier, artificial neural networks, and 2D graphs for data visual-ization) are integrated to predict, classify, and visualize a medicaldiabetic dataset.

A parallel effort has been devoted to exploit clustering tech-niques for diabetic patients by addressing different issues as foodanalysis (Phanich, Pholkul, & Phimoltares, 2010), gait patterns(Sawacha, Guarneri, Avogaro, & Cobelli, 2010), and relationshipsamong diabetes and risk factors (Chaturvedi, 2003). Food cluster-ing analysis for diabetic patients has been proposed in Phanichet al. (2010). Using Self-Organizing Map (SOM) and the K-meansalgorithm, this work provided a Food Recommendation Systemsuggesting proper substituted foods. The work in Sawacha et al.(2010) proposed a cluster analysis of biomechanical data to grouppatients with similar diabetic gait patterns. In Chaturvedi (2003)the spatial clusters of diabetes prevalence in Texas has been ana-lyzed. The relationship of risk factors (i.e., age and obesity) associ-ated with diabetes have also been analyzed. Differently from theseworks, this study exploits clustering techniques to identify groupsof patients with similar examinations history.

4. Methodology

The proposed framework to analyze the patient examinationhistory contains four main steps: (i) data collection, (ii) data

292

293

294

295296298298

299

300

301302304304

305

306307

309309

310

311

312Fig. 1. The proposed framework for the cluster analysis of patient examinationdata.

Please cite this article in press as: Antonelli, D., et al. Analysis of diabetic pat(2013), http://dx.doi.org/10.1016/j.eswa.2013.02.006

transformation, (iii) cluster analysis, and (iv) cluster evaluation.The building blocks of the framework are shown in Fig. 1 and de-tailed in the following subsections.

The patient examination log data is first collected and thentransformed using the Vector Space Model (VSM) representation(Salton, 1971). Examination frequencies are weighted throughthe Term Frequency (TF) – Inverse Document Frequency (IDF)scheme (Pang-Ning et al., 2006).

The multiple-level clustering approach is applied to identify, ina dataset with a variable distribution, groups of patients with asimilar examination history. The DBSCAN algorithm has beenexploited for the cluster analysis.

Finally, clustering results are evaluated through a quality indexbalancing both intra-cluster homogeneity and inter-cluster separa-tion. Silhouette (Rousseeuw, 1987) has been considered as refer-ence index. Cluster sets are also analyzed together with a domainexpert to assess their significance. To analyze the examinationpathways represented by the cluster set, each cluster has beencharacterized with the examinations appearing in it.

4.1. Representation of the patient examination history

The data recording the patient examination history is repre-sented using the Vector Space Model (VSM) (Salton, 1971). Eachpatient is a vector in the examination space. Each vector elementcorresponds to a different examination and is associated with aweight describing the examination relevance for the patient. Morespecifically, it reports the weighted number of times the examina-tion was repeated by the patient. The Term Frequency (TF) – In-verse Document Frequency (IDF) scheme (Pang-Ning et al., 2006)has been adopted to weight examination frequency. Both theVSM representation and the TF-IDF scheme have been applied inprevious works to represent text documents.

The adopted data representation allows highlighting the rele-vance of specific examinations for a given patient condition. TheTF-IDF value increases proportionally to the number of times anexamination appears in the patient history, but is offset by the fre-quency of the examination in the patient collection, which helps tocontrol the fact that some examinations are generally more com-mon than others. Unweighted examination frequencies do notproperly characterize the patient condition, since standard routinetests usually appear with high frequency, while more specific testsmay appear with lower frequency.

More formally, let D be a collection of patient records andR = {e1, . . . ,ek} the set of examinations done by at least one patientin D. Each patient pi in D is represented by a weighted examinationfrequency vector vpi

of jRj cells. Each element vpi½j� of vector vpi

reports the weighted frequency wpi ;ejof examination ej for patient

pi, i.e.,

vpi¼ ½wpi ;e1 ; . . . ;wpi ;ejRj �: ð1Þ

The TF-IDF weight wpi ;ejfor the pair (pi,ej) is computed as the

product of two terms, called Term Frequency (TFpi ;ej) and Inverse

Document Frequency (IDFej),

wpi ;ej¼ TFpi ;ej

� IDFej: ð2Þ

The Term Frequency TFpi ;ejfor the pair (pi,ej) represents the relative

frequency of examination ej for patient pi. It is given by

TFpi ;ej¼ fpi ;ej

X16k6jRj

,fpi ;ek

; ð3Þ

where fpi ;ejis the number of times patient pi underwent examination

ej andP

16k6jRjfpi ;ekis the total number of examinations done by

patient pi.

ients through their examination history. Expert Systems with Applications

Page 4: Analysis of diabetic patients through their examination history

313

314315317317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366367

369369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

4 D. AntonelliQ1 et al. / Expert Systems with Applications xxx (2013) xxx–xxx

ESWA 8424 No. of Pages 7, Model 5G

27 February 2013

Q1

The Inverse Document Frequency IDFejfor examination ej repre-

sents the frequency of ej in the patient collection. It is computed as

IDFej¼ Log½jDj=jpk 2 D : fpk ;ej

– 0j�; ð4Þ

where jDj is the number of patients in the collection D andjpk 2 D : fpk ;ej

– 0j is the number of patients in D who underwent(at least once) examination ej. Mathematically, the base of the logfunction does not matter and constitutes a constant multiplicativefactor towards the overall result.

The TF-IDF weight wpi ;ejfor the pair (pi,ej) is high when

examination ej appears with high frequency in patient pi and lowfrequency in patients in the collection D. When examination ej

appears in more patients, the ratio inside the IDF’s log functionapproaches 1, and the IDFej

value and TF-IDF weight wpi ;ejbecome

close to 0. Hence, the approach tends to filter out commonexaminations.

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

407407

408

409

410

411

412

413

414

415

4.2. The multiple-level DBSCAN approach for cluster analysis

Density-based algorithms can effectively discover clusters ofarbitary shape and filter out outliers, thus increasing cluster homo-geneity. Clusters are identified as dense areas of data objects sur-rounded by an area of low density.

In the DBSCAN algorithm, density is evaluated based on theuser-specified parameters Eps and MinPts. One single executionof DBSCAN discovers dense groups of patients according to onespecific setting for these parameters. Patients in lower densityareas are labeled as outliers and not assigned to any cluster. Hence,different parameter settings are needed to discover clusters indatasets with a variable data distribution as the one consideredin this study. Groups of patients with close examination historiesmay have both different cardinalities and densities, i.e., groupsmay contain examination histories with different degrees ofsimilarity.

The proposed multiple-level clustering approach allows cluster-ing datasets with a variable distribution by iteratively applyingthe DBSCAN algorithm on different (disjoint) dataset portions.The whole original dataset is clustered at the first level. Then, ateach subsequent level, patients labeled as outliers in the previouslevel are re-clustered. The DBSCAN parameters Eps and MinPtsare properly set at each level.

In the following, Section 4.2.1 presents the measure used toevaluate the similarity between patient examination histories,while Section 4.2.2 describes how the DBSCAN parameters andthe number of clustering levels have been selected.

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

4.2.1. Similarity between patient examination historiesThe cosine similarity measure has been adopted to evaluate the

similarity between the weighted examination frequency vectorsrepresenting the patient examination histories. This measure hasbeen often used to compare documents in text mining (Steinbach,Karypis, & Kumar, 2000).

Let pi and pj be two arbitrary patients in the collection D. Let vi

and vj be the corresponding weighted examination frequency vec-tors as in Eq. (1). The cosine similarity between vi and vj is com-puted as

cosðv i; v jÞ ¼v i � v j

kv ikkv jk¼

P16k6jRjv i½k�v j½k�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

16k6jRjv i½k�2q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

16k6jRjv j½k�2q ; ð5Þ

where cos(vi,vj) is in the range [0,1]. cos(vi,vj) equal to 1 describesthe exact similarity of examination histories for patients pi and pj,while cos(vi,vj) equal to 0 points out that patients have complemen-tary histories (i.e., the sets of their examinations are disjoint).

Please cite this article in press as: Antonelli, D., et al. Analysis of diabetic pat(2013), http://dx.doi.org/10.1016/j.eswa.2013.02.006

4.2.2. Number of clustering levels and DBSCAN parametersThe dataset density is preliminarily analyzed using the k-dist

graph (Pang-Ning et al., 2006) to select both the number of itera-tions for the multiple-level clustering approach and the Eps andMinPts values for each iteration.

For each patient in the collection, the k-dist graph plots the dis-tance to its kth nearest neighbor according to their examinationhistory. On the x-axis patients are sorted by the distance to thekth nearest neighbor, while on the y-axis distances to the kth near-est neighbor are reported. The k value corresponds to the MinPtsparameter. The y-axis represents possible values of the Eps param-eter. By cutting the graph at a given value on the y-axis, the corre-sponding px value on the x-axis partitions the patient collectioninto the following two subsets. Patients placed on the left handside of px are labeled by DBSCAN as core points, and those on therigth side of px as outlier or border points.

Sharp changes in the k-dist graph identify dataset portions witha different density (Pang-Ning et al., 2006). The multiple-levelstrategy analyzes these dataset portions in different iterations.The Eps value to cluster each dataset part is selected in correspon-dence with the sharp change appearing in the graph. Section 5.2discusses the selection of DBSCAN parameters and number of iter-ation levels for the diabetes dataset considered in this study.

4.3. Cluster evaluation

The discovered cluster set is evaluated using the Silhouetteindex (Rousseeuw, 1987). Silhouette allows evaluating theappropriateness of the assignment of a data object to a clusterrather than to another by measuring both intra-cluster cohesionand inter-cluster separation.

The silhouette value for a given patient pi in a cluster C iscomputed as

sðpiÞ ¼bðpiÞ � aðpiÞ

maxfaðpiÞ; bðpiÞg; sðpiÞ 2 ½�1;1�; ð6Þ

where a(pi) is the average distance of patient pi from all otherpatients in the cluster C, and b(pi) is the smallest of averagedistances from its neighbor clusters. The silhouette value for acluster C is the average silhouette value on all its patients. Negativesilhouette values represent wrong patient placements, whilepositive silhouette values a better patient assignments. Clusterswith silhouette values in the range [0.51,0.70] and [0.71,1] respec-tively show that a reasonable and a strong structure have beenfound (Kaufman & Rousseeuw, 1990). The cosine similarity metrichas been used for silhouette evaluation, since this measure wasused to evaluate patient similarity in the cluster analysis (seeSection 4.2.1).

4.4. Data mining tool

The DBSCAN algorithm available in the RapidMiner system hasbeen used for the cluster analysis within the proposed framework.The RapidMiner toolkit (Rapid Miner Project, 2013) is an open-source system consisting of a number of data mining algorithmsto automatically analyze a large data collection and extract usefulknowledge.

The procedures for data transformation and cluster evaluationhave been developed in the Phyton programming language (Py-thon Software Foundation, 2013). These procedures transformthe patient examination log data into the VSM representationusing the TF-IDF scheme and compute the silhouette values forthe cluster set provided by the cluster analysis.

ients through their examination history. Expert Systems with Applications

Page 5: Analysis of diabetic patients through their examination history

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

Table 1Examination frequencies (%) in first-level clusters containing patients undergoingroutine tests (Eps = 0.3, MinPts = 30).

Category Examination C11 C21 C31 C41 C51

Routine Checkup visit 78 96 58 100 100Glucose level 78 98 63 – 100Urine test 72 97 58 – –Venous blood 96 75 35 – –Capillary blood 72 97 58 – –Haemoglobin 100 – – – –Specialistic visit – 13 100 – –

Cardiovascular Electrocardiogram – – 100 – –Eye Fundus oculi – – 28 – –Number of patients 223 1764 43 110 41Silhouette 0.67 0.55 0.85 0.99 1.0

D. AntonelliQ1 et al. / Expert Systems with Applications xxx (2013) xxx–xxx 5

ESWA 8424 No. of Pages 7, Model 5G

27 February 2013

Q1

5. Results and discussion

This section presents and discusses the results obtained whenanalyzing a real collection of examination log data for diabeticpatients with the proposed framework. In the following,Section 5.1 describes the dataset considered in this study, whileSection 5.2 specifies the framework configuration for the clusteranalysis. The cluster results are presented and discussed in Section5.3. Finally, Section 5.4 evaluates the performance of the multiple-level clustering approach in terms of execution time.

5.1. Diabetic patient dataset

The dataset considered in this study was collected by the LocalHealth Center (LHC) of the Asti province in Italy. The LHC databasestores all the accesses to the health care system in the Astiprovince in the year 2007. From this database the examinationlog data of all patients with overt diabetes were extracted. Rawdata contain 95,788 records with examinations performed by6380 patients. They contain both (i) routine and (ii) more specificexaminations to analyze diabetes complications on various degreesof severity. The dataset includes both male and female patients in awide age range (between 4 and 95 years). The diagnostic and ther-apeutic procedures are defined using the ICD 9-CM (InternationalClassification of Diseases, 9th revision, Clinical Modification)(ICD-9-CM, 2011).

5.2. Selection of parameters for clustering diabetic patients

To select the number of iterations for the multiple-level cluster-ing strategy and the DBSCAN parameter for each level, we relied onthe k-dist graph as discussed in Section 4.2.2. In selecting theseparameters, we addressed the following issues. To discover repre-sentative examination pathways for the diabetes, we aim at avoid-ing clusters including few patients. In addition, to consider alldifferent patient examination histories, we aim at limiting thenumber of patients labeled as outliers and thus unclustered.

We plotted the k-dist graph, and analyzed the cluster results, byvarying the MinPts (i.e., k) value. Low MinPts values were not suit-able for our purpose, because DBSCAN identifies small groups ofpatients. At the same time, large MinPts values may increase thenumber of outlier points included in the clusters. The experimentalresults showed that MinPts in the range [25,35] provided similarresults, that were compliant with the above issues. We selectedMinPts = 30. From the k-dist graph for k = 30, we observed threemain dataset portions with a different density for Eps in the range[0.2–0.3], [0.4–0.5], and [0.6–0.7], respectively. Consequently, weadopted a three-level clustering approach, with each level focusingon one among the three dataset parts. The selected Eps values were0.3, 0.4, and 0.6 for the first, second, and third clustering level,respectively.

5.3. Analysis of the clustering results

Starting from a large collection of raw examination data, theproposed framework allows the discovery of a set of clusterscontaining patients with a similar examination history. Themultiple-level DBSCAN approach, iterated for three levels, com-puted clusters that progressively contain patients with increasingseverity in diabetes, because patients are tested using more andmore specialized examinations. More specifically, first-levelclusters contain patients mainly undergoing routine tests tomonitor diabetes conditions, or some basic tests to diagnosedisease complications. Second-level clusters collect patients thatare tested using an increasing number of examinations to diagnose

Please cite this article in press as: Antonelli, D., et al. Analysis of diabetic pat(2013), http://dx.doi.org/10.1016/j.eswa.2013.02.006

some diabetes complications. Examinations become progressivelymore numerous and specific in third-level clusters, indicating pa-tients that can have diabetes complications of increasing severity.Since at each level clusters contain more specific examinations, alower number of patients is contained in each cluster and the clus-ter size tends to reduce progressively. Clusters show good cohesionand separation as they are characterized by high silhouette values.The results, discussed with the support of a clinical domain expert,show a fairly good correlation among the examination pathwayssuggested by the clusters and the guidelines for diabetes disease(ICD-9-CM, 2011).

Cluster properties are discussed in detail in the followingsubsections. Tables 1–3 report, for each first- and second-levelcluster, the examination frequencies computed as the percentageof patients in the cluster tested by each examination. Clusters arenamed as Cij in the tables, where j denotes the level of themultiple-level DBSCAN approach providing the cluster and i locallyidentifies the cluster at each level j.

5.3.1. First-level cluster setFirst-level clusters are reported in Tables 1 and 2. Clusters can

be partitioned into the following two main groups: clusters con-taining patients (i) with standard examinations to monitor diabe-tes conditions (clusters C11 –C51 , in Table 1), (ii) coupled withbasic examinations to diagnose disease complications (clustersC61 –C111 in Table 2).

The two largest clusters (C11 and C21 ) contain patients whomostly performed standard routine tests. Besides routine examin-ations, all patients in cluster C31 had a specialistic visit and weretested with usual basic examinations to diagnose the most fre-quent diabetes complications, as risks for cardiovascular diseaseand eye problems. All patients in clusters C41 and C51 only had acheckup visit, together with the glucose level test in cluster C51 .These two clusters may include patients usually tested in privatestructures and periodically reporting test results to NHC.

Patients in clusters C61 –C111 were additionally tested to diagnosediabetes complications in the (a) eye (C61 ), (b) cardiovascular sys-tem ðC71 Þ, (c) both eye and cardiovascular system ðC81 Þ, (d) carotid(C91 ), and (e) limb ðC101 Þ. Finally, (f) cluster C111 includes tests forthe liver, kidneys, and in particular cardiovascular system. Differ-ently from second- and third-level clusters, diabetes complicationswere monitored in clusters C61 –C111 using a few (quite standard)tests which showed a limited degree of severity. Only the cardio-vascular system has been thoroughly tested in cluster C111 . Stan-dard routine examinations still appear in clusters C61 –C111 eventhough (usually) with a lower frequency than in clusters C11 –C51 .

Patients with an examination history significantly dissimilarfrom all the others are labeled as outliers and not included inany cluster. The DBSCAN algorithm was re-applied on thesepatients only (3509 patients), with different parameter values.The results are discussed in Section 5.3.2.

ients through their examination history. Expert Systems with Applications

Page 6: Analysis of diabetic patients through their examination history

541

542

543

544

545

546

547

548

549

550

551

552

553

554

Table 2Examination frequencies (%) in first-level clusters including basic tests to diagnose diabetes complications (Eps = 0.3, MinPts = 30).

Category Examination C61 C71 C81 C91 C101 C111

Routine Checkup visit 77 78 66 62 68 97Glucose level 74 74 64 62 59 97Urine test 74 74 64 57 56 92Venous blood 57 60 53 48 44 97Capillary blood 74 73 63 55 56 92Haemoglobin – – – 14 12 100Specialistic visit – – – – – –Complete blood count – – – – 3 3

Cardiovascular Electrocardiogram – 100 100 – – 42Cholesterol – – – – – 100HDL cholesterol – – – – – 100Triglycerides – – – – – 100

Eye Fundus oculi 100 – 100 – – 39

Liver ALT – – – – – 100AST – – – – – 100

Kidney Creatinine – – – – – 3Creatinine clearance – – – – – 100Culture urine – – – – – 100Microscopic urine analysis – – – – – 100Uric acid – – – – – 100

Carotid ECO doppler carotid – – – 100 – –Limb ECO doppler limb – – – – 100 –Number of patients 294 144 140 42 34 36Silhouette 0.65 0.74 0.79 0.95 0.97 0.90

Table 3Examination frequencies (%) in second-level clusters, including examinations to diagnose more severe diabetes complications (Eps = 0.4, MinPts = 30).

Category Examination C12 C22 C32 C42 C52

Routine Checkup visit 65 90 22 98 71Glucose level 52 92 22 100 95Urine test 37 90 19 98 68Venous blood 35 84 100 100 98Capillary blood 37 85 17 98 63Haemoglobin 13 34 100 98 83Specialistic visit – 13 – – –Complete blood count 3 15 45 7 93

Cardiovascular Electrocardiogram 17 21 70 47 20Cholesterol 7 25 100 97 93HDL cholesterol 4 26 100 99 92Triglycerides 7 26 100 97 90

Eye Fundus oculi 50 38 53 49 27Angioscopy – 30 – – –Eye examination 100 5 – – –Renital photocoagulation – 100 – – –

Liver ALT – 21 10 98 98AST – 21 8 97 98Bilirubin – 2 – – 100Gamma GT – – 100 – 95

Kidney Creatinine 4 20 99 11 78Creatinine clearance 2 10 – 99 –Culture urine 2 11 67 97 44Microscopic urine analysis 4 16 2 82 73Uric acid – 13 3 97 68Microalbuminuria – 7 100 60 37

Carotid ECO doppler carotid – 5 – – 2Limb ECO doppler limb – – – – –Number of patients 46 61 139 283 41Silhouette 0.92 0.88 0.72 0.69 0.87

6 D. AntonelliQ1 et al. / Expert Systems with Applications xxx (2013) xxx–xxx

ESWA 8424 No. of Pages 7, Model 5G

27 February 2013

Q1

5.3.2. Second-level cluster setSecond-level clusters contain patients with more diversified

examination histories. More specifically, the following two maincategories of clusters can be identified: (i) clusters containing pa-tients tested with specific examinations to diagnose a given diabe-tes complication (clusters C12 –C22 ); (ii) clusters with patients whounderwent various examinations to diagnose different diabetes

Please cite this article in press as: Antonelli, D., et al. Analysis of diabetic pat(2013), http://dx.doi.org/10.1016/j.eswa.2013.02.006

complications (clusters C32 –C52 ). These two categories respectivelyindicate patients that can be seriously affected by a particular dis-ease complication or by more than one disease complication at thesame time. Second-level clusters are reported in Table 3.

Clusters C12 and C22 include specific examination to diagnoseeye complications. More specifically, all patients in cluster C12

underwent a battery of tests to assess vision and ability to focus

ients through their examination history. Expert Systems with Applications

Page 7: Analysis of diabetic patients through their examination history

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685

686

D. AntonelliQ1 et al. / Expert Systems with Applications xxx (2013) xxx–xxx 7

ESWA 8424 No. of Pages 7, Model 5G

27 February 2013

Q1

on objects (called ‘‘Eye examination’’ in Table 3). Instead, all pa-tients in cluster C22 had Retinal photocoagulation, a laser operationdone in cases of long-term eye complications, such as proliferativeretinopathy.

All patients in clusters C32 –C52 may suffer complications on thecardiovascular, liver, and kidneys systems, but with differentdegrees of severity. Patients in cluster C52 are at risk of more severeliver complications than those in clusters C32 and C42 , since theirexamination history includes more tests in this category.Analogously, cluster C32 is characterized by more severe renal com-plications than clusters C42 and C52 . Cardiovascular complicationshave similar severity in clusters C32 –C52 , because examinations inthis category appear with similar frequency.

At this stage, 2939 patients are classified as outliers and notassigned to any cluster. We further applied the DBSCAN algorithmon them by modifying the parameter setting. The results areanalyzed in Section 5.3.3.

5.3.3. Third-level cluster setThe results collected at this stage (with DBSCAN parameter

Eps = 0.6 and MinPts = 30) show a similar trend to second-levelclusters. Third-level clusters contain patients that may suffer morecomplications at the same time (e.g., complications on carotid,liver, and cardiovascular systems) and are tested with morespecific examinations (e.g., transcutaneous oxygen and carbondioxide monitor or upper abdominal ultrasound). By stopping themultiple-level DBSCAN approach at this level, only 1239 patientslabeled as outliers remain unclustered, with respect to the initialcollection of 6380 patients considered at the first level (i.e., about19% of patients). By further applying the DBSCAN algorithm on thisoutlier set, fragmented groups of patients can be identified. Theseclusters can represent patients affected by more rare diabetescomplications, and thus with examinations different from thosedone by most patients.

5.4. Execution time

The run time of DBSCAN at the first, second, and third level is9 min 10 s, 2 min 25 s, and 1 min 45 s, respectively. The run timeprogressively reduces because less patients are considered at eachsubsequent level.

6. Conclusion

To get the most out of large and complex medical databases,innovative data analysis techniques are needed to extract usefulknowledge in a timely fashion. In this paper a data analysis frame-work based on a multiple-level clustering approach has been pro-posed to identify groups of patients with a similar examinationhistory in a dataset with a variable data distribution. The TF-IDFmethod has been exploited to represent patient examination data,thus highlighting the relevance of the different examinations. Theproposed methodology has been validated in the diabetic care sce-nario on a real dataset of diabetic patients. The analysis identifiedcohesive and well-separated groups of patients with standard ormore specific examinations for diabetes, showing a good correla-tion with medical guidelines for diabetes (ICD-9-CM, 2011). Futuredevelopments of the proposed approach will explore the correla-tion between patient examination data and additional aspects ofthe medical treatments such as pharmaceutical drug therapies.Furthermore, we plan to apply the proposed approach to differentmedical contexts (e.g., cardiac patients).

Please cite this article in press as: Antonelli, D., et al. Analysis of diabetic pat(2013), http://dx.doi.org/10.1016/j.eswa.2013.02.006

Acknowledgments

The authors thank Dr. Baudolino Mussa and Dr. Dario Bellomofor their advice and fruitful discussions.

References

Ahmed, M. U., & Funk, P. (2011). Mining rare cases in post-operative pain by meansof outlier detection. In IEEE symposium on signal processing and informationtechnology (ISSPIT) (pp. 35–41).

Buczak, A. L., Moniz, L. J., Feighner, B. H., & Lombardo, J. S. (2009). Mining electronicmedical records for patient care patterns. In IEEE symposium on computationalintelligence and data mining (CIDM) (pp. 146–153).

Chaturvedi, K. (2003). Geographic concentrations of diabetes prevalence clusters intexas and their relationship to age and obesity. <http://www.ucgis.org/summer03/studentpapers/kshitijchaturvedi.pdf>. Retrieved, 9, 2010.

Choong, P. F. M., Langford, A. K., Dowsey, M. M., & Santamaria, N. M. (2000). Clinicalpathway for fractured neck of femur: A prospective, controlled study. TheMedical Journal of Australia, 172, 423–426.

Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm fordiscovering clusters in large spatial databases with noise. In Knowledgediscovery and data mining (KDD) (pp. 226–231).

Fisher, D.H. (1987a). Improving inference through conceptual clustering. In Nationalconference on artificial intelligence (AAAI) (pp. 461–465).

Fisher, D. H. (1987b). Knowledge acquisition via incremental conceptual clustering.Machine Learning, 2, 139–172.

McLachlan, G., & Krishnan, T. (1997). The EM algorithm and extensions. Wiley series inprobability and statistics. John Wiley and Sons.

ICD-9-CM. (2011). International classification of diseases, Ninth revision, Clinicalmodification. <http://icd9cm.chrisendres.com>. Accessed March, 2011.

Juang, B.-H., & Rabiner, L. (1990). The segmental k-means algorithm for estimatingparameters of hidden markov models. IEEE Transactions on Acoustics, Speech andSignal Processing, 38, 1639–1641.

Karegowda, A., Jayaram, M., & Manjunath, A. (2012). Cascading k-means clusteringand k-nearest neighbor classifier for categorization of diabetic patients.International Journal of Engineering and Advanced Technology (IJEAT), 147–151.

Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction tocluster analysis. Wiley.

Meng, X.-H., Huang, Y.-X., Rao, D.-P., Zhang, Q., & Liu, Q. (2012). Comparison of threedata mining models for predicting diabetes or prediabetes by risk factors. TheKaohsiung Journal of Medical Sciences, 93–99.

Mohamudally, N., & Khan, D. M. (2011). Application of a unified medical data miner(umdm) for prediction, classification, interpretation and visualization onmedical datasets: The diabetes dataset case. In International conference onadvances in data mining: Applications and theoretical aspects (ICDM) (pp. 78–95).Berlin, Heidelberg: Springer-Verlag.

Mulroy, S., Gronley, J., Weiss, W., Newsam, C., & Perry, J. (2003). Use of clusteranalysis for gait pattern classification of patients in the early and late recoveryphases following stroke. Gait Posture, 18, 114–125.

Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining.Addison-Wesley.

Phanich, M., Pholkul, P., & Phimoltares, S. (2010). Food recommendation systemusing clustering analysis for diabetic patients. In IEEE international conference oninformation science and applications (ICISA) (pp. 1–8).

Python Software Foundation. (2013). Python programming language officialwebsite. <http://www.python.org/>. Accessed January, 2013.

Rapid Miner Project. (2013). The rapid miner project for machine learning. <http://rapid-i.com/>. Accessed January, 2013.

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation andvalidation of cluster analysis. Computational and Applied Mathematics, 53–65.

Salton, G. (1971). The SMART retrieval system: Experiments in automatic documentprocessing. Prentice-Hall.

Santamaria, N., Houghton, L., Kimmel, A., & Graham, L. (2003). Clinical pathways forfractured neck of femur: A cohort study of health related quality of life, patientsatisfaction and clinical outcome. Australian Journal of Advanced Nursing, 20,24–29.

Sawacha, Z., Guarneri, G., Avogaro, A., & Cobelli, C. (2010). A new classification ofdiabetic gait pattern based on cluster analysis of biomechanical data. Journal ofDiabetes Science and Technology, 4, 1127–1138.

Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clusteringtechniques. In KDD workshop on text mining.

Van Rooden, S. M., Heiser, W. J., Kok, J. N., Verbaan, D., van Hilten, J. J., & Marinus, J.(2010). The identification of Parkinson’s disease subtypes using cluster analysis:A systematic review. Movement Disorders, 25, 969–978.

Zhong, W., Chow, R., & He, J. (2012). Clinical charge profiles prediction for patientsdiagnosed with chronic diseases using multi-level support vector machine.Expert Systems with Applications, 39, 1474–1483.

ients through their examination history. Expert Systems with Applications