III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: [email protected] URL:...
-
date post
19-Dec-2015 -
Category
Documents
-
view
221 -
download
0
Transcript of III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: [email protected] URL:...
III 1
Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: [email protected] URL: rutcor.rutgers.edu/~salexe
Datascope - a new tool for Logical Analysis of Data (LAD)
Datascope - a new tool for Logical Analysis of Data (LAD)
DIMACS Mixer Series,September 19, 2002
III 4
LAD - Theories, Models, Classifications
LAD - Theories, Models, Classifications
Positive Theory Negative Theory
Model
III 5
Datascope FunctionsDatascope Functions
Support Set IdentificationSpace DiscretizationPattern DetectionModel ConstructionDiscriminant / Prognostic IndexClassificationFeature Analysis
III 6
Matlab Solver
InternalSolver
Datascope DataflowDatascope Dataflow
DiscretizationDiscretization
Significant Features
Cutpoints,Support Set
FeatureAnalysis
Pattern Space
DiagnosisPrognosis
RiskStratification
Pandect GenerationPandect Generation
Discriminant ConstructionDiscriminant Construction UserExcel Model
Pre-ProcessingPre-Processing
Raw Data
Theories/ModelsTheories/Models
Pattern Report
III 7
1. Support Set Identification1. Support Set Identification
Selects Small Subset of Significant Features
Preserves Hidden Knowledge
Feature Ranking Criteria:
Statistical Correlation with Outcome
Combinatorial Entropy
Distribution Monotonicity
Class Separation
Envelope Eccentricity
E.g., 10 proteins selected out of
15,144
E.g., 10 proteins selected out of
15,144
III 8
DataData
Spreadsheet OrientedOLE (via Clipboard)/ Excel Spreadsheet /
dBase tables
Training / Test GenerationBootstrapk-FoldingJackknife
New FeaturesCorrelation
III 10
2. Space Discretization 2. Space Discretization
Criteria:
Entropy
Correlation with Output
Bins (equipartitioning)
Intervals
Clustered
Class Separation
Criteria:
Entropy
Correlation with Output
Bins (equipartitioning)
Intervals
Clustered
Class Separation
Parameter Choice: User Defined Minimizing Support Set
Parameter Choice: User Defined Minimizing Support Set
Quality Measures: Entropy Separability
Quality Measures: Entropy Separability
III 12
3. Generation of Maximal Patterns 3. Generation of Maximal Patterns
Pattern Type Selection:Prime
ConesIntervals
Spanned
Pattern Type Selection:Prime
ConesIntervals
Spanned
Parameter Bound Settings:Prevalence:
% of positive observations% of negative observations
Homogeneity:on positive patternson negative patterns
Degree.
Parameter Bound Settings:Prevalence:
% of positive observations% of negative observations
Homogeneity:on positive patternson negative patterns
Degree.Post-Generation Filters:
By CharacteristicsMaximalityStrongness
Post-Generation Filters:By CharacteristicsMaximalityStrongness
III 13
16 xi.e.,
Positive Patterns
Positive Patterns
Pattern Definition Training Set Test Set Pattern Definition Training Set Test Set
III 14
Negative Patterns
Negative Patterns
Pattern Definition Training Set Test Set Pattern Definition Training Set Test Set
III 15
4. Theories and Models 4. Theories and Models
PandectPandect
Theory Selection:via:
Greedy
Bottleneck Greedy
Lexicographic Greedy
Set Covering Heuristics
Theory Selection:via:
Greedy
Bottleneck Greedy
Lexicographic Greedy
Set Covering Heuristics
Model Selection:
2 Set-Covering Problems
Quadratic Set-Covering Problem
Model Selection:
2 Set-Covering Problems
Quadratic Set-Covering Problem
III 21
5. Discriminants 5. Discriminants
Weight Selection Methods:Direct
1. Prognostic Index
2. Weighted Prognostic Index
LP-Based
3. Distance Maximizing Separator (SVM)
4. Cost Minimizing Separator
5. Expected Value Separator
NLP-Based
6. Regression in Pattern Space (ANN)
7. Best Correlation with Output
(weighted sums of patterns)
III 22
Prognostic Index Weighted Prognostic Expected Value Index Separator
Distance Maximizing Cost Minimizing Best Correlation Separator Separator with Output
III 25
Reporting Reporting
CutpointsDiscretized SpacePandectCoverage of Observations by PatternsPattern Report (Compact/Full Versions)Theories/ModelsAttribute AnalysisLog File
III 26
Pattern Space
Pattern Space
Training
+ + + + + + - - -Patterns
Test
+ + + + + + - - -Patterns
Positive Observations
Unclassified Observations
Negative Observations
III 28
AccuracySensitivitySpecificity
AccuracySensitivitySpecificity
BootstrapK-FoldingJackknife
BootstrapK-FoldingJackknife
Validation ProceduresValidation Procedures
Stratified Random Partition
Stratified Random Partition
LAD Model on Training Set
LAD Model on Training Set
Performance Evaluation
Performance Evaluation
Raw Data
III 29
Special FeaturesSpecial Features
Generating User Model Generation(Excel Files)
Datascope Macro LanguageMultiple and Complex Experiments
Interface with Other Applications
(Datascope Server)
III 30
Performance Performance C o m p a r a t i v e r e s u l t s f o r 5 d a t a s e t s f r o m t h e I r v i n e r e p o s i t o r yL A D a n d o t h e r 3 3 a l g o r i t h m s
D a t a s e t N a i v e B e s t ( B ) W o r s t ( W ) L A D ( L ) A c c u r a c y
b c w 3 5 3 9 3 . 5 0 . 0 8 9 9 . 4 8 %b l d 4 2 2 8 4 3 2 7 . 8 - 0 . 0 1 1 0 0 . 2 8 %
h e a 4 4 1 4 3 4 1 4 . 7 0 . 0 4 9 9 . 1 9 %p i d 3 3 2 2 3 1 2 1 . 5 - 0 . 0 6 1 0 0 . 6 4 %v o t 3 9 4 6 4 . 6 0 . 3 0 9 9 . 3 8 %
a v e r a g e 0 . 0 7 9 9 . 7 9 %
WBL 1:
Tjen-Sien Lim, Wei-Yin Loh and Yu-Shan Shin A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms, by, Machine Learning, 40, 203-229 (2000)
http://www.ics.uci.edu/~mlearn/MLRepository.html
III 31
LAD Case Studies LAD Case Studies
Assessing Long-Term Mortality Risk After Exercise Electrocardiography
Ovarian Cancer Detection Using Proteomic Data
Combinatorial Analysis of Breast Cancer Data from Image Cytometry and Gene Expression Microarrays
Cell Proliferation on Medical Implants
Country Risk Rating