엑셀마이너를 활용한 데이터 분석

19
BA 682 데이터마이닝(Data Mining) 20123820 강준현

Transcript of 엑셀마이너를 활용한 데이터 분석

Page 1: 엑셀마이너를 활용한 데이터 분석

BA 682 데이터마이닝(Data Mining)

20123820 강준현

Page 2: 엑셀마이너를 활용한 데이터 분석

I. 유니버설 뱅크 데이터를 사용한 로지스틱 회귀분석 모델 구축

Page 3: 엑셀마이너를 활용한 데이터 분석

Data Exploration

Income Family Size

Credit Card Avg Education

Page 4: 엑셀마이너를 활용한 데이터 분석

Data Exploration

ID Age Experience Income ZIP Code Family CCAvg Education MortgagePersonal LoanSecurities AccountCD Account Online CreditCard

ID 1

Age -0.00847 1

Experience -0.00833 0.994215 1

Income -0.01769 -0.05527 -0.04657 1

ZIP Code 0.013432 -0.02922 -0.02863 -0.01641 1

Family -0.0168 -0.04642 -0.05256 -0.1575 0.011778 1

CCAvg -0.02467 -0.05203 -0.05009 0.645993 -0.00407 -0.10928 1

Education 0.021463 0.041334 0.013152 -0.18752 -0.01738 0.064929 -0.13614 1

Mortgage -0.01392 -0.01254 -0.01058 0.206806 0.007383 -0.02044 0.109909 -0.03333 1

Personal Loan -0.0248 -0.00773 -0.00741 0.502462 0.000107 0.061367 0.366891 0.136722 0.142095 1

Securities Account -0.01697 -0.00044 -0.00123 -0.00262 0.004704 0.019994 0.015087 -0.01081 -0.00541 0.021954 1

CD Account -0.00691 0.008043 0.010353 0.169738 0.019972 0.01411 0.136537 0.013934 0.089311 0.316355 0.317034 1

Online -0.00253 0.013702 0.013898 0.014206 0.01699 0.010354 -0.00362 -0.015 -0.00599 0.006278 0.012627 0.175880016 1

CreditCard 0.017028 0.007681 0.008967 -0.00239 0.007691 0.011588 -0.00669 -0.01101 -0.00723 0.002802 -0.01503 0.278644365 0.00421 1

Age Experience

Page 5: 엑셀마이너를 활용한 데이터 분석

Data Dimension Reduction

Principal Components

Variable 1 2 3 4 5

Age 0.01554224 0.70662385 0.08264883 0.54202592 -0.44701025

Experience 0.01338275 0.70728117 -0.07694203 -0.54321295 0.44561642

Income -0.99977607 0.02043895 0.00444289 0.00272685 0.00167583

Family 0.0039179 -0.0041944 0.98944801 -0.1447722 0.00090257

Education 0.00342416 0.00085297 0.09067582 0.6246289 0.77563143

Variance 2119.944092 261.3847961 1.2882899 0.89438522 0.53321916

Variance% 88.92215729 10.96392059 0.05403799 0.03751545 0.02236615

Cum% 88.92215729 99.88607788 99.94011688 99.97763062 99.99999237

Components

Page 6: 엑셀마이너를 활용한 데이터 분석

Data Processing

Age와 Experience중에 Experience만을 변수에 포함 시키기로 결정

Experience 중에 음수 값을 갖는 데이터들은 삭제 (52개 데이터)

Nominal 변수인 ID와 Zip code도 변수에서 제외

데이터 세트를 60:40 비율로 Training set와 Validation set로 임의 분할

XLMiner : Data Partition Sheet(Ver:

12.5.3E)

ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities AccountCD Account Online CreditCard

ID Age Experience Income ZIP Code Family CCAvg Education MortgagePersonal

Loan

Securities

AccountCD Account Online CreditCard

1 1 25 1 49 91107 4 1.60 1 0 0 1 0 0 0

4 4 35 9 100 94112 1 2.70 2 0 0 0 0 0 0

5 5 35 8 45 91330 4 1.00 2 0 0 0 0 0 1

6 6 37 13 29 92121 4 0.40 2 155 0 0 0 1 0

9 9 35 10 81 90089 3 0.60 2 104 0 0 0 1 0

10 10 34 9 180 93023 1 8.90 3 0 1 0 0 0 0

12 12 29 5 45 90277 3 0.10 2 0 0 0 0 1 0

17 17 38 14 130 95010 4 4.70 3 134 1 0 0 0 0

18 18 42 18 81 94305 4 2.40 1 0 0 0 0 0 0

19 19 46 21 193 91604 2 8.10 3 0 1 0 0 0 0

20 20 55 28 21 94720 1 0.50 2 0 0 1 0 0 1

21 21 56 31 25 94015 4 0.90 2 111 0 0 0 1 0

23 23 29 5 62 90277 1 1.20 1 260 0 0 0 1 0

Date: 31-Oct-2013 21:57:13

Output Navigator

Training Data Validation Data Test Data

Data

Data source Data!$A$5:$N$5004

Selected variables

Partitioning Method Randomly chosen

Random Seed 12345

# training row s 3000

# validation row s 2000

Row Id.

Selected variables

ID Age Experience Income ZIP Code Family CCAvg Education Mortgage

Personal

Loan

Securities

Account

CD

Account Online CreditCard

2619 23 -3 55 92704 3 2.40 2 145 0 0 0 1 0

3627 24 -3 28 90089 4 1.00 3 0 0 0 0 0 0

4286 23 -3 149 93555 2 7.20 1 0 0 0 0 1 0

4515 24 -3 41 91768 4 1.00 3 0 0 0 0 1 0

316 24 -2 51 90630 3 0.30 3 0 0 0 0 1 0

452 28 -2 48 94132 2 1.75 3 89 0 0 0 1 0

598 24 -2 125 92835 2 7.20 1 0 0 1 0 0 1

794 24 -2 150 94720 2 2.00 1 0 0 0 0 1 0

890 24 -2 82 91103 2 1.60 3 0 0 0 0 1 1

2467 24 -2 80 94105 2 1.60 3 0 0 0 0 1 0

2718 23 -2 45 95422 4 0.60 2 0 0 0 0 1 1

2877 24 -2 80 91107 2 1.60 3 238 0 0 0 0 0

2963 23 -2 81 91711 2 1.80 2 0 0 0 0 0 0

3131 23 -2 82 92152 2 1.80 2 0 0 1 0 0 1

3797 24 -2 50 94920 3 2.40 2 0 0 1 0 0 0

3888 24 -2 118 92634 2 7.20 1 0 0 1 0 1 0

4117 24 -2 135 90065 2 7.20 1 0 0 0 0 1 0

4412 23 -2 75 90291 2 1.80 2 0 0 0 0 1 1

4482 25 -2 35 95045 4 1.00 3 0 0 0 0 1 0

90 25 -1 113 94303 4 2.30 3 0 0 0 0 0 1

227 24 -1 39 94085 2 1.70 2 0 0 0 0 0 0

525 24 -1 75 93014 4 0.20 1 0 0 0 0 1 0

537 25 -1 43 92173 3 2.40 2 176 0 0 0 1 0

541 25 -1 109 94010 4 2.30 3 314 0 0 0 1 0

577 25 -1 48 92870 3 0.30 3 0 0 0 0 0 1

584 24 -1 38 95045 2 1.70 2 0 0 0 0 1 0

650 25 -1 82 92677 4 2.10 3 0 0 0 0 1 0

671 23 -1 61 92374 4 2.60 1 239 0 0 0 1 0

687 24 -1 38 92612 4 0.60 2 0 0 0 0 1 0

910 23 -1 149 91709 1 6.33 1 305 0 0 0 0 1

1174 24 -1 35 94305 2 1.70 2 0 0 0 0 0 0

1429 25 -1 21 94583 4 0.40 1 90 0 0 0 1 0

1523 25 -1 101 94720 4 2.30 3 256 0 0 0 0 1

1906 25 -1 112 92507 2 2.00 1 241 0 0 0 1 0

2103 25 -1 81 92647 2 1.60 3 0 0 0 0 1 1

2431 23 -1 73 92120 4 2.60 1 0 0 0 0 1 0

2546 25 -1 39 94720 3 2.40 2 0 0 0 0 1 0

2849 24 -1 78 94720 2 1.80 2 0 0 0 0 0 0

2981 25 -1 53 94305 3 2.40 2 0 0 0 0 0 0

3077 29 -1 62 92672 2 1.75 3 0 0 0 0 0 1

3158 23 -1 13 94720 4 1.00 1 84 0 0 0 1 0

3280 26 -1 44 94901 1 2.00 2 0 0 0 0 0 0

3285 25 -1 101 95819 4 2.10 3 0 0 0 0 0 1

3293 25 -1 13 95616 4 0.40 1 0 0 1 0 0 0

3395 25 -1 113 90089 4 2.10 3 0 0 0 0 1 0

3426 23 -1 12 91605 4 1.00 1 90 0 0 0 1 0

3825 23 -1 12 95064 4 1.00 1 0 0 1 0 0 1

3947 25 -1 40 93117 3 2.40 2 0 0 0 0 1 0

4016 25 -1 139 93106 2 2.00 1 0 0 0 0 0 1

4089 29 -1 71 94801 2 1.75 3 0 0 0 0 0 0

4583 25 -1 69 92691 3 0.30 3 0 0 0 0 1 0

4958 29 -1 50 95842 2 1.75 3 0 0 0 0 0 1

# Records in the training data 2969

Validation data ['UniversalBank_Logistic

# Records in the validation data 1979

Data

Training data used for building the model ['UniversalBank_Logistic

Page 7: 엑셀마이너를 활용한 데이터 분석

Logistic Regression

Set confidence level 95%

Best subset selection: Exhaustive search

Page 8: 엑셀마이너를 활용한 데이터 분석

1 2 3 4 5 6 7 8 9 10 11

2 3183.687744 219.7645264 0 Constant Income * * * * * * * * *

3 3100.568359 138.6170197 0 Constant Income Education * * * * * * * *

4 3039.182617 79.21051788 0 Constant Income Education CD Account * * * * * * *

5 2990.404297 32.41569901 0.00001378 Constant Income Family Education CD Account * * * * * *

6 2981.825928 25.83442879 0.00019169 Constant Income Family Education CD Account Online * * * * *

7 2972.419922 18.42524338 0.00426458 Constant Income Family Education CD Account Online CreditCard * * * *

8 2964.378662 12.38126278 0.06197376 Constant Income Family EducationSecurities Account CD Account Online CreditCard * * *

9 2959.063965 9.06476879 0.35692081 Constant Income Family CCAvg EducationSecurities Account CD Account Online CreditCard * *

10 2957.014404 9.01451492 0.9041543 Constant Experience Income Family CCAvg EducationSecurities Account CD Account Online CreditCard *

11 2957 11.00010586 1 Constant Experience Income Family CCAvg Education MortgageSecurities Account CD Account Online CreditCard

#Coeffs RSS Cp ProbabilityModel (Constant present in all models)

Logistic Regression

Best subset selection

1 2 3 4 5 6 7 8 9 10

Choose Subset 2 3212.504639 217.5792542 0 Constant Income * * * * * * * *

Choose Subset 3 3108.05542 115.0950928 0 Constant Income Education * * * * * * *

Choose Subset 4 3059.695313 68.71881104 0 Constant Income Education CD Account * * * * * *

Choose Subset 5 3019.44751 30.45754433 0.00001909 Constant Income Family Education CD Account * * * * *

Choose Subset 6 3008.855957 21.86244774 0.00062266 Constant Income Family Education CD Account CreditCard * * * *

Choose Subset 7 2999.336426 14.33973217 0.01660405 Constant Income Family Education CD Account Online CreditCard * * *

Choose Subset 8 2994.993164 11.99501705 0.05081629 Constant Income Family EducationSecurities Account CD Account Online CreditCard * *

Choose Subset 9 2991.976807 10.97765064 0.08504453 Constant Income Family CCAvg EducationSecurities Account CD Account Online CreditCard *

Choose Subset 10 2989.000244 10.00009251 1 Constant Experience Income Family CCAvg EducationSecurities Account CD Account Online CreditCard

#Coeffs RSS Cp ProbabilityModel (Constant present in all models)

The Regression Model

Coefficient Std. Error p-value Odds

-13.981266 0.8205896 0 8.47253E-07 8.16778E-07 8.77729E-07

0.01231669 0.00861601 0.15285712 1.01239288 0.99544007 1.02963436

0.05771863 0.00359579 0 1.05941689 1.0519768 1.06690955

0.73089772 0.0997033 0 2.07694435 1.70827281 2.52518058

0.13275796 0.05343928 0.01298148 1.14197361 1.02841508 1.26807117

1.72314227 0.15280582 0 5.60210371 4.15224123 7.55822325

0.00008745 0.00073139 0.90482515 1.0000875 0.99865484 1.00152206

-1.11542439 0.39214 0.00444875 0.32777616 0.15198027 0.70691556

4.12018013 0.4292928 0 61.5703392 26.54341698 142.8190918

-0.79940081 0.20818347 0.00012309 0.44959828 0.29896376 0.67613083

-0.95952284 0.2626732 0.00025928 0.38307562 0.22892682 0.64102113

Education

Mortgage

Securities Account

CD Account

Online

CreditCard

Constant term

Experience

Income

Family

CCAvg

Input variables 95% Confidence Interval

Page 9: 엑셀마이너를 활용한 데이터 분석

Performance Evaluation

Training Data scoring - Summary Report

0.5

Actual Class 1 0

1 179 107

0 43 2640

Class # Cases # Errors % Error

1 286 107 37.41

0 2683 43 1.60

Overall 2969 150 5.05

Classification Confusion Matrix

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable) ( Updating the value here w ill NOT update value in detailed report )

Validation Data scoring - Summary Report

0.5

Actual Class 1 0

1 130 64

0 30 1755

Class # Cases # Errors % Error

1 194 64 32.99

0 1785 30 1.68

Overall 1979 94 4.75

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Training Data scoring - Summary Report

0.3

Actual Class 1 0

1 213 73

0 93 2590

Class # Cases # Errors % Error

1 286 73 25.52

0 2683 93 3.47

Overall 2969 166 5.59

Classification Confusion Matrix

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Training Data scoring - Summary Report

0.2

Actual Class 1 0

1 235 51

0 148 2535

Class # Cases # Errors % Error

1 286 51 17.83

0 2683 148 5.52

Overall 2969 199 6.70

Classification Confusion Matrix

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Page 10: 엑셀마이너를 활용한 데이터 분석

Performance Evaluation

0

50

100

150

200

250

0 1000 2000 3000

Cu

mu

lati

ve

# cases

Lift chart (validation dataset)

CumulativePersonal Loanwhen sortedusing predictedvalues

CumulativePersonal Loanusing average 0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9 10

De

cile

me

an /

Glo

bal

me

an

Deciles

Decile-wise lift chart (validation dataset)

0

50

100

150

200

250

300

350

0 1000 2000 3000 4000

Cu

mu

lati

ve

# cases

Lift chart (training dataset)

CumulativePersonal Loanwhen sortedusing predictedvalues

CumulativePersonal Loanusing average 0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9 10

De

cile

me

an /

Glo

bal

me

an

Deciles

Decile-wise lift chart (training dataset)

Page 11: 엑셀마이너를 활용한 데이터 분석

II. 비행 연착 데이터를 활용한 나이브 베이즈 모델 구축

Page 12: 엑셀마이너를 활용한 데이터 분석

Data Exploration

0 delayed

0 ontime

1 delayed

Weather

0

10

20

30

40

50

60

70

80

90

delayed delayed delayed delayed delayed delayed delayed

1 2 3 4 5 6 7

0

200

400

600

800

1000

1200

1400

delayed ontime delayed ontime delayed ontime

BWI DCA IAD

0

200

400

600

800

1000

1200

delayed ontime delayed ontime delayed ontime

EWR JFK LGA

Week

Origin Destination

Page 13: 엑셀마이너를 활용한 데이터 분석

Data Exploration

0

50

100

150

200

250

300

350

400

450d

elay

ed

on

tim

e

del

ayed

on

tim

e

del

ayed

on

tim

e

del

ayed

on

tim

e

del

ayed

on

tim

e

del

ayed

on

tim

e

del

ayed

on

tim

e

del

ayed

on

tim

e

CO DH DL MQ OH RU UA US

CO delayed

CO ontime

DH delayed

DH ontime

DL delayed

DL ontime

MQ delayed

MQ ontime

OH delayed

OH ontime

RU delayed

RU ontime

UA delayed

UA ontime

US delayed

US ontime

Carrier

Page 14: 엑셀마이너를 활용한 데이터 분석

Data Exploration

Scheduled departure time

600 -700 4% 700-800

5% 800-900

6% 900-1000

3%

1000-1100 3%

1100-1200 1%

1200 -1300 5%

1300- 1400 5%

1400-1500 15%

1500-1600 9%

1600-1700 8%

1700-1800 15%

1800-1900 3%

1900-2000 9%

2000-2100 2% 2100-2200

8%

Page 15: 엑셀마이너를 활용한 데이터 분석

Data Processing

출발 시간이 10, 109로 600 ~ 2200 범위 를 벗어나는 아웃라이어로

판단하고 데이터 삭제

Scheduled departure time 데이터를 16개의 time block으로 재구성

예측 상황에서 미리 주어 질 수 없는 실제 비행기 출발 시간, 워싱턴 DC와

뉴욕 구간이기 때문에 모두 비슷한 수준 (평균 211.87, 중앙값 214,

최빈값 214, 표준 편차 13.31)이기 때문에 분석 변수에서 제외

명목형 변수인 tail number와 flight number 분석 변수에서 제외

비행 날짜는 요일에 비해 추후 예측에 활용할 여지가 적기 때문에 분석

변수에서 제외

데이터 세트를 60:40 비율로 Training set와 Validation set로 임의 분할

Page 16: 엑셀마이너를 활용한 데이터 분석

Naïve Bayes

Value Prob Value Prob

CO 0.036312849 CO 0.06122449

DH 0.231843575 DH 0.306122449

DL 0.188081937 DL 0.118367347

MQ 0.118249534 MQ 0.163265306

OH 0.013035382 OH 0.012244898

RU 0.174115456 RU 0.244897959

UA 0.016759777 UA 0.004081633

US 0.22160149 US 0.089795918

EWR 0.273743017 EWR 0.387755102

JFK 0.176908752 JFK 0.187755102

LGA 0.549348231 LGA 0.424489796

BWI 0.057728119 BWI 0.102040816

DCA 0.645251397 DCA 0.502040816

IAD 0.297020484 IAD 0.395918367

0 1 0 0.930612245

1 0 1 0.069387755

Mon 0.131284916 Mon 0.220408163

Tue 0.14990689 Tue 0.130612245

Wed 0.148044693 Wed 0.151020408

Thur 0.181564246 Thur 0.130612245

Fri 0.170391061 Fri 0.159183673

Sat 0.111731844 Sat 0.069387755

Sun 0.10707635 Sun 0.13877551

600-700 0.058659218 600-700 0.032653061

700-800 0.055865922 700-800 0.053061224

800-900 0.082867784 800-900 0.06122449

900-1000 0.047486034 900-1000 0.016326531

1000-1100 0.044692737 1000-1100 0.032653061

1100-1200 0.040968343 1100-1200 0.016326531

1200-1300 0.0716946 1200-1300 0.065306122

1300-1400 0.083798883 1300-1400 0.048979592

1400-1500 0.090316574 1400-1500 0.146938776

1500-1600 0.067970205 1500-1600 0.085714286

1600-1700 0.081005587 1600-1700 0.07755102

1700-1800 0.104283054 1700-1800 0.13877551

1800-1900 0.044692737 1800-1900 0.028571429

1900-2000 0.047486034 1900-2000 0.089795918

2000-2100 0.019553073 2000-2100 0.024489796

2100-2200 0.058659218 2100-2200 0.081632653

CARRIER

DEST

ORIGIN

Weather

DAY_WEEK

Binned_CRS_

DEP_TIME

Conditional probabilities

Classes-->

ontime delayedInput

Variables

RU (Continental Express Airline)를 타고 수요일 15:00 ~ 16:00 출발 IAD에서 LGA로 갈 경우 (기상은 양호함) Ontime = 0.81*0.174 * 0.148 * 0.068 * 0.297 * 0.549 *1

0.00022971 Delay = 0.186* 0.245* 0.424 * 0.396 * 0.151* 0.0857 *0.931

0.0000092

Ontime 확률 = 0.00022971 / (0.00022971 + 0.0000092)

96% (Cutoff value 50%를 넘으므로 ontime으로 분류)

Prior class probabilities

Prob.

0.814253222

0.185746778delayed

<-- Success Class

According to relative occurrences in training data

Class

ontime

Page 17: 엑셀마이너를 활용한 데이터 분석

Performance Evaluation Training Data scoring - Summary Report

0.5

Actual Class ontime delayed

ontime 1049 25

delayed 205 40

Class # Cases # Errors % Error

ontime 1074 25 2.33

delayed 245 205 83.67

Overall 1319 230 17.44

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Error Report

Validation Data scoring - Summary Report

0.5

Actual Class ontime delayed

ontime 685 14

delayed 155 26

Class # Cases # Errors % Error

ontime 699 14 2.00

delayed 181 155 85.64

Overall 880 169 19.20

Classification Confusion Matrix

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Training Data scoring - Summary Report

0.3

Actual Class ontime delayed

ontime 1074 0

delayed 228 17

Class # Cases # Errors % Error

ontime 1074 0 0.00

delayed 245 228 93.06

Overall 1319 228 17.29

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Error Report

Training Data scoring - Summary Report

0.8

Actual Class ontime delayed

ontime 672 402

delayed 83 162

Class # Cases # Errors % Error

ontime 1074 402 37.43

delayed 245 83 33.88

Overall 1319 485 36.77

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Error Report

Page 18: 엑셀마이너를 활용한 데이터 분석

Performance Evaluation

0

200

400

600

800

1000

1200

0 500 1000 1500

Cu

mu

lati

ve

# cases

Lift chart (training dataset)

Cumulative FlightStatus whensorted usingpredicted values

Cumulative FlightStatus usingaverage 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6 7 8 9 10

De

cile

me

an /

Glo

bal

me

an

Deciles

Decile-wise lift chart (training dataset)

0

100

200

300

400

500

600

700

800

0 500 1000

Cu

mu

lati

ve

# cases

Lift chart (validation dataset)

Cumulative FlightStatus whensorted usingpredicted values

Cumulative FlightStatus usingaverage 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6 7 8 9 10

De

cile

me

an /

Glo

bal

me

an

Deciles

Decile-wise lift chart (validation dataset)

Page 19: 엑셀마이너를 활용한 데이터 분석

End of presentation