PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems...
-
Upload
joan-turner -
Category
Documents
-
view
215 -
download
0
Transcript of PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems...
PKDD Discovery Challenge
(not only) on Financial Data
Petr BerkaLaboratory for Intelligent
SystemsUniversity of Economics,
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
2
Cups, Challenges, Competitions
KDD Cups (since 1997) KDD Sisyphus at ECML 1998 PKDD Discovery Challenges (since 1999) COIL Competition 2000 PAKDD Challenge 2000 PT Challenge 2000, 2001 JSAI KDD Challenge 2001 EUNITE Competition 2001, 2002 . . .
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
3
PKDD Discovery Challenge Idea
Realistic data mining conditions collaborative rather then competitive nature rather vague specification of the problem
Differences to real KDD projects short time for analysis (2-3 months) only indirect access to domain and data
experts during KDD process
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
4
Challenge Settings Data and their full description available
on the web for all participants Submissions evaluated by domain experts
(but no ordering, no winners and losers) Workshop at PKDD to present the results
and discus them with domain experts Results and comments of experts
available on the web (after the workshop)
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
5
PKDD Challenges http://lisp.vse.cz/challenge
1999, Prague financial data, thrombosis data
2000, Lyon financial data, modified thrombosis data
2001, Freiburg modified thrombosis data
2002, Helsinki atherosclerosis data, hepatitis data
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
6
Financial Challenge Background
Czech bank offering private accounts Available data for pilot study (29000 clients)
personal characteristics basic info about accounts transactions for three months
Proposed tasks segmentation (defining different types of clients w.r.t. debt) early detection of debts
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
7
Financial Challenge Data
Disposition
disp_idclient_idaccount_id
Credit Card
disp_id
Account
account_iddistrict_id
Permanentorder
account_id
Loan
account_id
Person
client_iddistrict_id
Transactions
account_id
Demograph.
district_id
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
8
Contributions
Method oriented show a method/system working on the data
Problem oriented (prototype solutions) loan and/or credit cards description loan and/or credit cards classification initial exploration relation between branches clients segmentation
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
9
Description of loans Relations between loan category and account characteristics
[Coufal et al, 1999 - GUHA] [Mikšovský et al, 1999 - EXCEL]
# LHS loan.status Fisher support confidence
1 avg_sanction_interest(no) good 6.12e-024 603 0.9234
2 avg_sanction_interest(yes) bad 6.12e-024 26 0.8966
3 perm_ord_household(yes) good 5.03e-013 421 0.9546
4 perm_ord_household(no) bad 5.03e-013 56 0.2324
5 credit_card(yes) good 1.38e-005 165 0.9706
6 monthly_payment(<2000) good 3.33e-004 125 0.9690
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
10
Classification of loans
Detecting risky clients before they are granted a loan
[Mikšovský et al, 1999 - C5.0]
decision tree to find the relevance of attributes
decision tree for classification (using misclassification costs)
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
11
Credit Cards Promotion
Description - find characteristics of a card holder deviation detection
Classification - predict score for „card value“ k-nearest neighbour
[Putten, 1999]
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
12
Clients Segmentation Description - segmentation of clients according
to transactions [Hotho, Meadche, 2000] Kohonen map + decision trees
Rule #1 for Cluster 3
If ATTR5 > 9945 and ATTR13 > 0Then -> Cluster 3 (115, 0.983)
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
13
Challenge Organizing Lessons
To get and prepare real data is difficult The time for analyzes should be as long
as possible The response rate was rather low (~
10%) No synergy effect observed
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
14
DM Lessons (1/4)
Cooperate with experts domain experts data experts . . .
… and with users
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
15
DM Lessons (2/4) Use knowledge intensive preprocessing
methods … compute age and sex from birth_number set flags for different types of operations compute monthly characteristics of
transactions (sum, avg, min, max)lbalance = 1/30 i balance(i) days(i).
…
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
16
DM Lessons (3/4) Make the results understandable
[Werner, Fogarty 2001](- ACLIGM (* (* (+ (+ LAC (* TAT (- (- (+ (* (* IGM (/ LDH LAC)) (* UA (+(/ KCT IGG) ALB))) (* UA (/ LDH LAC))) (/ (+ (+ PT C4) C4) UN)) ACLIGM)))ANA) (+ (+ C3 (- LDH (+ UA IGG))) LAC)) (+ (+ (* (* (* IGM (/ LDH LAC)) (/(/ (/ UA (* (+ (/ KCT ACLIGG) (* RF IGA)) (+ (* (/ PLT PIC) (+ LDH TCHO))(+ (- (* UA APTT) (* IGA TAT2)) (/ ACLIGG HGB))))) IGG) (* WBC UN))) HCT)(/ (* (* TAT (- (/ ALP UA) IGG)) (- (- (* (/ LDH LAC) (- TP C3)) (/ (+ (+PT C4) C4) UN)) ACLIGM)) (* (/ IGA (- GOT RBC)) (/ (* TAT2 HCT) (/ (/ (/UPRO SM) (+ (+ UA (+ (+ TCHO (- CENTROMEA LAC)) ACLIGG)) (- (* (- (* UAAPTT) (* IGA TAT2)) (+ (* TAT (+ PT (+ RBC (+ UA IGG)))) TP)) (+ UAIGG)))) (+ ACLIGM (+ (* (+ (+ (+ (* (* IGM (/ LDH (+ (/ (+ RBC (/ LDH LAC))RF) (* UA APTT)))) TP) CENTROMEA) (* (+ PT C4) (- (+ (/ (- LDH (+ (/ KCTACLIGG) (* RF IGA))) (/ ACLIGA SSB)) C3) dt))) (* (+ (* (+ DNAII IGA) HCT)(/ HCT LAC)) (+ RBC (/ (+ RBC (- (* (/ (- TG WBC) GOT) (- (+ (/ 0.08ACLIGA) (+ HGB PT)) dt)) (/ (+ (* (* IGM (/ LDH LAC)) TP) CENTROMEA) C3)))RF)))) HCT) (* IGG GPT)))))))) (+ (* TAT (- (+ (+ C3 (- LDH (+ UA IGG)))(- (* C4 TAT2) LDH)) (+ UA (/ (+ (+ (/ (/ (+ SM GOT) (* WBC UN)) (+ (/ (*(* GLU 0.03) (/ ALP UA)) RF) (* UA APTT))) (+ (+ (- (+ RBC (+ TG (/ (+ (*(* RF IGA) HCT) C4) (+ ACLIGM (- (+ (- TP C3) (/ C3 HGB)) (/ (- TG WBC)GOT)))))) ACLIGM) (+ TAT (+ (/ (* CENTROMEA (/ (* C4 TAT2) (/ (+ RBC (* dtACLIGA)) (* SM SC170)))) (* (/ HGB (- (/ ALP UA) RBC)) (/ ALP UA))) TP)))(/ (/ UPRO SM) (/ (+ RBC (* ACLIGM HGB)) GOT)))) (- (* (- (* UA APTT) (*IGA TAT2)) (+ (* TAT (+ PT (+ RBC (+ (+ PT (+ (+ (* (* IGM (/ LDH LAC))RNP) CENTROMEA) (* (/ (- TG WBC) GOT) (- (+ (/ (/ (+ DNAII IGA) (/ GPTACLIGM)) RF) C3) dt)))) (+ (/ (+ (- (* (+ (* C4 TAT2) (- (* C4 TAT2) PT))(+ RBC (+ (+ PT C4) (- CENTROMEA LAC)))) UA) C4) (+ ACLIGM (- (+ (-CENTROMEA LAC) (/ LDH LAC)) (- CENTROMEA LAC)))) (* (* ACLIGM HGB) (/ HCTLAC))))))) TP)) (+ UA IGG))) RF)))) (- (/ UPRO SM) LAC)))))
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
17
DM Lessons (4/4)
Show some (even preliminary) results soon experts are interested in solutions not in
applying sophisticated methods
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
18
Discovery Challenge Benefits
Experts deeper insight into the data
Participants experience with analyzing large real data motivations for further research
ML/KDD Community prototype tasks/solutions (like the MiningMart
project?)
Organizators … invitation to DMLL Workshop :-)
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
19
Thank You
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
20
1st. author KDD task KDD steps DM methodCoufal loan preprocessing
descriptionassociation rules
Levin loans + credit cards description association rules,ranking objects
Mikšovský relations among branches preprocessing,description,vizualization
ILP
loans preprocessing,classification,vizualization
classification rules
Pijls initial insight summarizationPutten credit cards preprocessing,
descriptiondeviation detection
preprocessing,prediction
k-NN
Spenke loans vizualization display correlationsWeber loans + credit cards preprocessing,
descriptionassociation rules
Coufal loans Description,classification
association rules +tree
Hotho client profiles based ontransactions
Preprocessing,clustering,classification
SOM, tree
Suzuki loans preprocessing,description,
exception rules
Contributions