Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data...
-
Upload
juliet-harrington -
Category
Documents
-
view
227 -
download
0
Transcript of Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data...
Pattern Aided Regression Modeling & Pattern Aided Problem Solving
Guozhu DongProfessor, PhD
Data Mining Research Lab
CSE & Kno.e.sis Center
Wright State University
Please cite this paper on CPXR: Guozhu Dong & Vahid Taslimitehrani. Pattern-Aided Regression Modeling and Prediction Model Analysis.To appear in IEEE TKDE.
Prediction is difficult, even when it is about the past.
Overview Introduction Pattern aided regression modeling: PXR
new regression model type
Contrast pattern aided regression algorithm: CPXR Diverse predictor-response relationships CPXR(Log): Traumatic brain injury (TBI) and heart failure (HF) outcome
prediction Potential applications of the CPXR methodology Other pattern aided problem solving results/apps
Most are contrast pattern aided results Some use patterns only; some use something extra
Concluding remarks
2
Prediction is difficult
Prediction is difficult, especially if it is about the future Nils Bohr, Nobel laureate in Physics Danish Proverb
Those who have knowledge, don't predict. Those who predict, don't have knowledge. Lao Tzu, 6th Century BC Chinese Philosopher
Prediction is difficult, even when it is about the past. Guozhu Dong
Guozhu Dong: Pattern Aided Regression Modeling 3
Preliminaries on prediction using regression Training dataset: {(xi,yi) | 1 <= i <=n}
xi: vector of predictor variables
yi: value of response variable
Regression model evaluation
Pattern Aided Regression ModelingGuozhu Dong 4
LR: Linear regression
Teaser 1: Performance of contrast pattern aided regression (CPXR)
CPXR: highest accuracy in 41 out of 50 datasets Average RMSE reduction (relative to LR) of 42% in 50 datasets, much higher than that of best competing methodCPXR achieved 60+% RMSE reduction in 10 out the 50. CPXR is better than LR in all 50 datasets.
5
LR: Linear regressionGBM: Gradient Boosting (generalization of AdaBoost)
RMSE reduction =[RMSE(LR)-RMSE(M)]
/ RMSE(LR)
Teaser 2: AUC of ROC curves for CPXR(Log) vs other methods on HF and TBI
classification results
Guozhu Dong: Pattern Aided Regression Modeling 6
HF (with Mayo researchers)TBI
CPXR is good because
PXR can effectively capture diverse predictor-response relationships to find good PXR models
Definition: The data (for given application) contains diverse predictor-response relationships if it contains different subgroups whose best-fit local models are highly different [Dong+TaslimiteheraniTKDE15]
Diverse predictor-response relationships are the main reason why best state-of-the-art regression methods perform often poorly
Guozhu Dong: Pattern Aided Regression Modeling 7
PreliminariesPatterns & contrast patterns
Pattern: condition describing set of objects EG: age <= 35 & rank = full professor It describes all full professors with age <=35. A pattern describes a region of data in a low dimensional
subspace, of high dimensional data
Contrast patterns: conditions that distinguish objects in different classes/conditions
Contrast patterns are useful: They are strongly associated with issues of importance
8
Contrast patterns thru example
9
• CP: A1=b & A3=e• It matches all C1 objects.• It matches
no C2 objectsIts mds={t1,t2}Generally: A pattern is CP if it matches many more objects in
one class than in other classes (aka emerging patterns)
mds(P): the set of objects matching P.An equivalence class: A set of patterns with same mds (having
same behavior). Pick one as representative.
Why contrast pattern based approaches are successful & have big potentials? Contrasting is meaningful
Contrastingrisk indicatoradvantages in survival/wellbeing Contrasting is built into human (animal) instincts
Focus on contrast patterns focus on important issues High (3—7) dim contrast patterns capture important novel
multi-variable interactions related to goals Opportunity: Humans often use low dimensional CPs (?
brain’s computing power is low & lack of data?) They use independence assumption for high dimension apps WE WANT TO DO BETTER! WE CAN DO BETTER!!!
10
Pattern aided regression model A PXR model is represented by a tuple
Each Pi is a pattern
Each fi is a local regression model for data satisfying Pi,
fd is a default local regression model
Each wi is weight for Pi
The regression function of PM is defined by
11
A pictorial illustration of a simple PXR model
Different patterns can involve different sets of variables [describing data regions in different subspaces]
Matching datasets of different patterns can overlapGuozhu Dong: Pattern Aided Regression
Modeling 12
An example PXR model u,v z: predictor variables, y: response variable ((P1, f1, w1), (P2, f2, w2), fd) gives a PXR model
Pattern Aided Regression ModelingGuozhu Dong 13
Discussion PXR is a strict generalization of Piecewise Linear
Regression (PLR) PLR can be viewed as trying to model diverse PR
relationships, but it is limited in modeling capabilities and computing algorithms [and they didn’t see DPR]
Often a PXR model uses few patterns (e.g. 7) Local regression models
model type can be complex or simple we often use simple ones such as linear or piecewise
linear models
Pattern Aided Regression ModelingGuozhu Dong 14
Diversity of predictor-response relationships Different pattern-model pairs emphasize
different sets of variables Different pattern-model pairs use highly
different regression functions Each pattern-model pair captures a highly
distinct kind of behavior Diverse predictor-response relationships
may be neutralized at the global level
Pattern Aided Regression ModelingGuozhu Dong 15
DPR in TBI (traumatic brain injury)
Pattern Aided Regression ModelingGuozhu Dong 16
How CPXR builds PXR models
17
Training Data Dfor regression
Local RegressionModels for CPs
PXR model
D = LE U SELE: Large ErrorSE: Small Error
Baseline RegressionModel f0
Mine CPs RepresentativeCPs of LE & SE
How CPXR builds PXR models: (D,f0) (LE,SE) Ps and fs PXR Starting with training dataset Build a baseline regression model f0 (or use given f0) Split data into LE (large error) and SE (small error),
based on f’s prediction error Mine CPs; Remove some CPs Build corresponding local regression models for
remaining CPs, and select patterns to construct PXR Many technical details, including variable binning,
splitting data, search objectives, baseline model type, local model type …
Pattern Aided Regression ModelingGuozhu Dong 18
Control overritting: don’t use P if |mds(P)| <= #vars
Summary of empirical results CPXR is highly accurate for building regression
models Outperforms other regression methods, often by big
margins• On accuracy• On overfitting• On. sensitivity to noise
Exp says: Diverse predictor-response relationships occur often in real life, for data with >=3 dimensions
We used 50 real datasets, and 20+ synthetic onesPattern Aided Regression Modeling
Guozhu Dong 19
Previous regression methods Linear regression (LR): uses a linear function Piecewise linear regression (PLR): splits one
variable into intervals, uses a different linear function for each interval
Support vector regression (SVR): SVM like, but minimizing prediction error
Bayesian additive regression trees (BART): ensemble of (hundreds of) decision trees
Neural networks, Gradient Boosting …
Pattern Aided Regression ModelingGuozhu Dong 20
Interpretability is low
Experiments used 50 real datasets used in previous regression studies
Pattern Aided Regression ModelingGuozhu Dong 21
6 example datasets
CPXR achieved large RMSE reduction (accuracy improvement) consistently
CPXR: highest accuracy in 41 out of 50 datasets (4 competitors)
Average RMSE reduction (relative to LR) of 42% in 50 datasets, much higher than that of best competing methodCPXR achieved 60+% RMSE reduction in 10 out the 50. CPXR is better than LR in all 50 datasets.
22
GBM: Gradient Boosting (generalization of AdaBoost) We also tried other competitors but they are not competitiveCPXR is not better than other methods on random data
RMSE(LR)-RMSE(M)-----------------------------
RMSE(LR)
Box plot of RMSE reduction
Pattern Aided Regression ModelingGuozhu Dong 23
CPXR(LP)’s median > Q3 of PLR,LR,BART
CPXR(LP)’s Q1 > median of PLR,LR,BART
CPXR(LL) is a variant of CPXR
Evaluation on overfitting CPXR is more accurate than PLR, LR and BART on testing
data; its model complexity is fairly low. CPXR has smaller relative accuracy drop (from training to
test)
Pattern Aided Regression ModelingGuozhu Dong 24
Evaluation on sensitivity to noise
Pattern Aided Regression ModelingGuozhu Dong 25
Build PXR on clean training data
Compare accuracies on •training data•noise-added test data
CPXR’s outperformance vs degree of diversity of PR-Relationships
Pattern Aided Regression ModelingGuozhu Dong 26
PIP: Positive impact pattern Diff = ratio of largest coefficient local model making big improvement of pairs of local models
High: CPXR has large RMSE reduction; Low: CPXR has low RMSE reduction
Diverse Predictor-Response Relationships May Neutralize Each Other at High Level
Example: We considered a set S1 of 4 variables, S2=S1 + 2 variables, on soil water content data
For LR, S2 does not give improvement over S1 Ditto for PLR, SVR, BART
For CPXR, PXR model on S2 gives 20% RMSE improvement over PXR model on S1 The new variables are involved in most of the diverse PR
relationships (the patterns in the PXR model) These relationships somehow cancelled each other’s
effect, at the whole data set level. missed by LR etc.
Pattern Aided Regression ModelingGuozhu Dong 27
Other apps of CPXR, besides building accurate prediction models
Analysis on a given prediction model On what kinds of data it make large prediction errors How to correct those prediction errors
Do important models in science and medicine have systematic mistakes?
Analysis on comparing two given prediction models, w.r.t. their differences
Discovering policy errors, niche opportunities, … Discovering true importance of variables (medicine) Discovering intricate multi-variable interactions ……
Pattern Aided Regression ModelingGuozhu Dong 28
CPXR for logistic regression modeling and results on outcome prediction for HF/TBI The PXR-CPXR approach is not limited to linear
regression. We adapted it for logistic regression to get CPXR(Log) We used CPXR(Log) for outcome prediction for
traumatic brain injury patients & heart failure patients. CPXR(Log) is much more accurate than standard
logistic regression and SVM CPXR(Log) also identifies important variables that are
considered unimportant by standard logistic regression
Guozhu Dong: Pattern Aided Regression Modeling 29
Results on TBI
Guozhu Dong: Pattern Aided Regression Modeling 30
AUC of ROC curves for CPXR(Log) and other methods (on HF and TBI)
Guozhu Dong: Pattern Aided Regression Modeling 31
HF TBI
Details on CPXR(Log) Model for TBI
Guozhu Dong: Pattern Aided Regression Modeling 32
Guozhu Dong: Pattern Aided Regression Modeling 33
Work on Heart Failure Patient Risk Prediction
Vahid Taslimitehrani, Guozhu Dong, Naveen L. Pereira, Maryam Panahiazar, Jyotishman Pathak: Developing an EHR-driven Heart Failure Risk Prediction Model using CPXR(Log). Submitted to journal
Mayo’s EHR Data: Patient’s demographic data -- age, gender, race and ethnicity. Lab results -- cholesterol, sodium, hemoglobin and lymphocytes, and EF. Medications -- Angiotensin Converting Enzyme (ACE) inhibitors,
Angiotensin Receptor Blockers (ARBs), β-adrenoceptor antagonists (β-blockers), Statins, and Calcium Channel Blocker (CCB).
26 major chronic conditions (co-morbidities). Many variables; many are important/needed
Guozhu Dong: Pattern Aided Regression Modeling 34
Finished CPXR-based projects/papersupto March 2015 Vahid Taslimitehrani, Guozhu Dong, Naveen L. Pereira, Maryam Panahiazar,
Jyotishman Pathak: Developing an EHR-driven Heart Failure Risk Prediction Model using CPXR(Log). Submitted to journal
Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov A. Pachepsky. Measurement Scale Effect on Prediction of Soil Water Retention
Curve and Saturated Hydraulic Conductivity. Submitted. Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic
Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury. In Proceedings of IEEE International Conference on BioInformatics and BioEngineering (BIBE) 2014
Building accurate loan default risk models for a private company. 2014-2015. Looking for collaborators
Guozhu Dong: Pattern Aided Regression Modeling 35
Summary of strength of PXR/CPXR
More accurate than state-of-the-art methods Philosophy: Using patterns to identify data groups
where given model makes large prediction errors that can be corrected systematically
Using different pattern-model pairs to model diverse predictor-response variable relationships
The approach is better suited to high dimensional data than other methods
Offering insights to mistakes in business strategies …
Guozhu Dong: Pattern Aided Regression Modeling 36
I have focused on PXR/CPXR. I will now discuss other pattern aided problem solving methods and applications
Other methods: Contrast pattern aided clustering Contrast pattern based classification and improvement of
traditional methods Contrast pattern aided gene ranking for complex diseases Contrast pattern aided outlier detection
Applications: Diagnosis of diseases, study of complex diseases, blog analysis, compound selection for drug design, crime environ analysis, apartment rental price prediction, activity recognition, …
37
38
We published a book on CDM in 2012; 3 out of 6 parts on applications
1.Preliminaries and Measures on Contrasts2.Contrast Mining Algorithms3.Mining Generalized Contrasts4.Contrast Mining for Classification & Clustering5.Contrast Mining for Bioinformatics & Chemoinformatics6.Contrast Mining for Special Application Domains44 contributing authors, from ~dozen countries; not comprehensiveMethods used by many scientists.
39
My recent results in this area1.CAEP-style classification: discriminative power aggregation of emerging patterns 2.Outlier detection / intrusion detection: almost model free; using discriminative pattern length3.Clustering quality evaluation using patterns (quality, abundance, diversity): no distance function4.CP based clustering and cluster description: no distance func needed5.Interaction based gene/SNP ranking for complex diseases6.Contrast pattern aided regression: Effectively handling diverse predictor-response relationships
Key Challenges for Pattern Aided Problem Solving
For each problem to solve, our general approach is to use a selected pattern set to help reach our goals.
Q: (1) What kinds of pattern sets? (2) How to use the patterns in the set? (3) How to efficiently search for desired pattern sets?
We need effective techniques There are millions of (contrast) patterns The search space is huge
Pattern Aided Regression ModelingGuozhu Dong 40
Using CPs more efficient
CPCQ Clustering Quality Index•CPCQ Rationale: A high-quality clustering, capturing natural
concepts in data, should have many diversified high-quality contrast
patterns (CPs) contrasting its clusters.
•A CP characterizes its home cluster and discriminates its home
cluster against other clusters.
•Home cluster of a CP: the cluster where it has highest
frequency among all clusters
•Think of a cluster as a class.
CPC AlgorithmContrast Pattern Based Clustering – aimed to maximize CPCQ
C2
tuples
1. Select Seed CPs
S1
S2
items .....
S1
S2
2. Assign Patterns to CP Group G1 of Clusters, Using MPQ
3. Assign Patterns as CPs of Clusters, Using Tuple Overlap
4. Assign Tuples to Clusters, Using Tuple Overlap
S1
S2
C1
PS(C1)
PS(C2)
Clustering data using CPC into groups, each having succinct informative group descriptions
43
• EG: Given a collection of texts/blogs (collected at ASU). • We cluster the blogs into four groups
• each group is associated with a small set of patterns • the patterns clearly indicate what the groups are about.
A special feature is ECP (an adaptation of CAEP)’s ability to accurately classify molecules on the basis of very small training sets containing only a few (e.g. 3 per class) compounds.
This feature is highly relevant for virtual compound screening when very few experimental hits are available as templates.
Reference: Jens Auer et al. Simulation of sequential screening experiments using emerging chemical patterns. Medicinal Chemistry, 4(1):80–90, 2008. [from its abstract]
44
CAEP: Semi-supervised chemical compound screening with few training samples
IBIG vs IG & FC for Colon Cancer
45
Traditional generanking methods:
Fold change & entropy:rank genes by considering impact of one gene at a time.
Are they suitable for complex diseases?
Genes lowly ranked by FC could be highly ranked by IBIG
Quote from our Contrast Data Mining book
… the most important contribution of contrast mining will come when we no longer need … simplifying approaches to handle … challenge of high dimensional data, when we have developed the methodology to systematically analyze, and accurately use, sets of multi-feature contrast patterns …. … contrast mining has made useful progress …. Success … will have a large impact on … handling of intrinsically complex processes, such as complex diseases whose behaviors are influenced by the interaction of multiple … factors.
46
Potentials (1): Develop New Pattern Aided Methods
Selecting set of patterns can help solve existing challenging problem in much better way Identify such problems, work on them …
Using sets of patterns can lead to systematic ways of handling multiple multi-variable interactions
Contrast mining, pattern aided problem solving, & pattern aided data analytics, have potential to help effectively handle challenges of high dimensional data
47
Potentials (2): Use Current Pattern Aided Methods to Solve Problems
Our developed methods can help perform regression more accurately, characterize & correct errors of given models, for vital applications in science, medicine, & economics Improving scientific models Changing what we believe in?
Our developed clustering, classification, outlier detection, gene ranking, multiple multi-variable interaction mining/selection methods can help solve challenging problems and offer new insights
48
April 19, 2023 49
www.cs.wright.edu/~gdong
QuestionsQuestions
Next step – wish to find collaborators To work on high impact prediction modeling problems