Introduction to Predictive Learning
description
Transcript of Introduction to Predictive Learning
![Page 1: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/1.jpg)
1
Introduction toPredictive Learning
Electrical and Computer Engineering
LECTURE SET 9Nonstandard Learning Approaches
![Page 2: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/2.jpg)
2
OUTLINE• Motivation for non-standard approaches
- Learning with sparse high-dimensional data- Formalizing application requirements- Philosophical motivation
• New Learning Settings- Transduction- Universum Learning- Learning Using Privileged Information- Multi-Task Learning
• Summary
![Page 3: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/3.jpg)
3
Sparse High-Dimensional Data • Recall standard inductive learning• High-dimensional, low sample size (HDLSS) data:
• Gene microarray analysis• Medical imaging (i.e., sMRI, fMRI)• Object and face recognition• Text categorization and retrieval• Web search
• Sample size smaller than dim: d ~10K–100K, n ~ 100• Standard learning methods don’t work for such data• How to improve generalization?• How humans learn?
![Page 4: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/4.jpg)
4
How to improve generalization for HDLSS?
Conventional approaches use Standard inductive learning + a priori knowledge:• Preprocessing and feature selection (preceding learning)• Model parameterization (~ selection of good kernels)• Informative prior distributions (in statistical methods)
Non-standard learning formulations• Seek new generic formulations (not methods!) that better
reflect application requirements• A priori knowledge + additional data are used to derive
new problem formulations
![Page 5: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/5.jpg)
5
Approach Known As Big Data
SRM Inductive Principle guaranteesConvergence to optimal/good generalization when• Training sample size grows large• Model complexity is optimally controlled Big Data Paradigm• The main focus/issue is how to design powerful learning
algorithms for large data sets DATA = SIGNAL + NOISE
• Standard inductive learning setting is usually assumed.
![Page 6: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/6.jpg)
6
Formalizing Application Requirements• Classical statistics: parametric model is given (by experts)• Modern applications: complex iterative process
Non-standard (alternative) formulations may be better
APPLICATION NEEDS
LossFunction
Input, output,other variables
Training/test data
AdmissibleModels
FORMAL PROBLEM STATEMENT
LEARNING THEORY
![Page 7: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/7.jpg)
7
Philosophical Motivation• Philosophical view 1 (Realism):
Learning ~ search for the truth (estimation of true dependency from available data)
System identification
~ Inductive Learning
where a priori knowledge is about the true model
![Page 8: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/8.jpg)
8
Philosophical Motivation (cont’d)• Philosophical view (Instrumentalism):
Learning ~ search for the instrumental knowledge (estimation of useful dependency from available data)
VC-theoretical approach ~ focus on learning formulation
![Page 9: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/9.jpg)
9
VC-theoretical approach• Focus on the learning setting rather than
on learning methods• Learning formulation depends on:
(1) available data(2) application requirements(3) a priori knowledge (assumptions)
• Factors (1)-(3) combined using Vapnik’s Keep-It-Direct (KID) Principle yield a learning formulation
![Page 10: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/10.jpg)
10
Contrast these two approaches• Conventional (statistics, data mining):
a priori knowledge typically reflects properties of a true (good) model, i.e.a priori knowledge ~ parameterization
• Why a priori knowledge is about the true model?• VC-theoretic approach:
a priori knowledge ~ how to incorporate available data into problem formulationoften a priori knowledge ~ available data samples of different type new learning settings
),( wf x
![Page 11: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/11.jpg)
11
Examples of Nonstandard Settings • Standard Inductive setting, e.g., digits 5 vs. 8
Finite training set Predictive model derived using only training dataPrediction for all possible test inputs
• Possible modifications- Transduction: Predict only for given test points - Universum Learning: available labeled data ~ examples of digits 5 and 8 and unlabeled examples ~ other digits- Learning using Privileged Information:training data provided by t different persons. Group label is known only for training data, but not available for test data.- Multi-Task Learning:training data ~ t groups (from different persons)test data ~ t groups (group label is known)
ii y,x
![Page 12: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/12.jpg)
12
SVM-style Framework for New Settings • Conceptual Reasons:
Additional info/data new type of SRM structure
• Technical Reasons: New knowledge encoded as additional constraints on complexity (in SVM-like setting)
• Practical Reasons: - new settings directly comparable to standard SVM- standard SVM is a special case of a new setting- optimization s/w for new settings may require minor modification of (existing) standard SVM s/w
![Page 13: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/13.jpg)
13
OUTLINE• Motivation for non-standard approaches• New Learning Settings
- Transduction- Universum Learning- Learning Using Privileged Information- Multi-Task Learning
Note: all settings assume binary classification• Summary
![Page 14: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/14.jpg)
14
Transduction (Vapnik, 1982, 1995)• How to incorporate unlabeled test data into the
learning process? Assume binary classification
• Estimating function at given pointsGiven: labeled training data and unlabeled test points
Estimate: class labels at these test points
Goal of learning: minimization of risk on the test set:
where
*jx mj ,...,1 ii y,x ni ,...,1
),....( **1
*myyy
)/(,1)( **
1
*jj
y
m
j
ydPyyLm
R xy
),(),....,( **1
* mff xxy
![Page 15: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/15.jpg)
15
Transduction vs Induction
a priori knowledge assumptions
estimated function
training data
predicted output
induction deduction
transduction
![Page 16: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/16.jpg)
16
Transduction based on size of margin• Binary classification, linear parameterization,
joint set of (training + working) samplesNote: working sample = unlabeled test point
• Simplest case: single unlabeled (working) sample
• Goal of learning(1) explain well available data (~ joint set)(2) achieve max falsifiability (~ large margin)
~ Classify test (working + training) samples by the largest possible margin hyperplane (see below)
![Page 17: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/17.jpg)
17
Margin-based Local Learning• Special case of transduction: single working point
• How to handle many unlabeled samples?
![Page 18: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/18.jpg)
18
Transduction based on size of margin• Transductive SVM learning has two objectives:
(TL1) separate labeled data using a large-margin hyperplane ~ as in standard SVM(TL2) separate working data using a large-margin hyperplane
![Page 19: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/19.jpg)
19
Loss function for unlabeled samples• Non-convex loss function:
• Transductive SVM constructs a large-margin hyperplane for labeled samples AND forces this hyperplane to stay away from unlabeled samples
![Page 20: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/20.jpg)
20
Optimization formulation for SVM transduction• Given: joint set of (training + working) samples• Denote slack variables for training, for working • Minimize
subject to
where Solution (~ decision boundary)
• Unbalanced situation (small training/ large test) all unlabeled samples assigned to one class
• Additional constraint:
i*j
m
jj
n
ii CCbR
1
**
1
)(21),( www
mjnibyby
ji
jij
iii
,...,1,,...,1,0,1])[(1])[(
*
**
xwxw
mjbsigny jj ,...,1),(* xw** )()( bf xwx
m
ji
n
ii bm
yn 11
])[(11 xw
![Page 21: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/21.jpg)
21
Optimization formulation (cont’d)• Hyperparameters control the trade-off
between explanation and falsifiability• Soft-margin inductive SVM is a special case of
soft-margin transduction with zero slacks• Dual + kernel version of SVM transduction• Transductive SVM optimization is not convex
(~ non-convexity of the loss for unlabeled data) – **elaborate/explain** different opt. heuristics ~ different solutions
• Exact solution (via exhaustive search) possible for small number of test samples (m) – but this solution is NOT very useful (~ inductive SVM).
*CandC
0* j
![Page 22: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/22.jpg)
22
Many applications for transduction• Text categorization: classify word documents
into a number of predetermined categories• Email classification: Spam vs non-spam• Web page classification• Image database classification• All these applications:
- high-dimensional data- small labeled training set (human-labeled)- large unlabeled test set
![Page 23: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/23.jpg)
23
Example application
• Prediction of molecular bioactivity for drug discovery
• Training data~1,909; test~634 samples• Input space ~ 139,351-dimensional• Prediction accuracy:SVM induction ~74.5%; transduction ~ 82.3%Ref: J. Weston et al, KDD cup 2001 data analysis: prediction
of molecular bioactivity for drug design – binding to thrombin, Bioinformatics 2003
![Page 24: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/24.jpg)
24
Semi-Supervised Learning (SSL)• SSL assumes availability of labeled + unlabeled
data (similar to transduction)• SSL has a goal of estimating an inductive model
for predicting new (test) samples – different from transduction
• In machine learning, SSL and transduction are often used interchangeably, i.e. transduction can be used to estimate an SSL model.
• SSL methods usually combine supervised and unsupervised learning methods into one algorithm
![Page 25: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/25.jpg)
25
SSL and Cluster Assumption• Cluster Assumption: real-life application data often
has clusters, due to (unknown) correlations between input variables. Discovering these clusters using unlabeled data helps supervised learning
• Example: document classification and info retrieval- individual words ~ input features (for classification)- uneven co-occurrence of words implies clustering of the documents in the input space- unlabeled documents can be used to identify this cluster structure, so that just a few labeled examples are sufficient for constructing a good decision rule
![Page 26: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/26.jpg)
26
Toy Example for Text Classification• Data Set: 5 documents that need to be classified into 2
topics ~ “Economy’ and ‘Entertainment’• Each document is defined by 6 features (words)
- two labeled documents (shown in color)- need to classify three unlabeled documents
• Apply clustering 2 clusters (x1,x3) and (x2,x4,x5)
Budget deficit
Music Seafood Sex Interest Rates
Movies
L1 1L2 1 1 1U3 1 1U4 1 U5 1 1
![Page 27: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/27.jpg)
27
Self-Learning Method (example of SSL)
Given initial labeled data set L and unlabeled set U• Repeat:
(1) estimate a classifier using L(2) classify randomly chosen unlabeled sample using the decision rule estimated in Step (1)(3) move this new labeled sample to L
Iterate steps (1)-(3) until all unlabeled samples are classified
![Page 28: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/28.jpg)
28
Illustration (using 1-nearest neighbor classifier)
Hyperbolas data:- 10 labeled and 100 unlabeled samples (green)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9Iteration 1
Unlabeled SamplesClass +1Class -1
![Page 29: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/29.jpg)
29
Illustration (after 50 iterations)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9Iteration 50
Unlabeled SamplesClass +1Class -1
![Page 30: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/30.jpg)
30
Illustration (after 100 iterations)
All samples are labeled now:
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9Iteration 100
Class +1Class -1
![Page 31: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/31.jpg)
31
Comparison: SSL vs T-SVM• Comparison 1 for low-dimensional data:
- Hyperbolas data set (10 labeled, 100 unlabeled)- 10 random realizations of training data
• Comparison 2 for high-dimensional data: - Digits 5 vs 8 (100 labeled, 1,866 unlabeled)- 10 random realizations of training/validation data
Note: validation data set for SVM model selection
• Methods used - Self-learning algorithm (using 1-NN classification)- Nonlinear T-SVM (needs parameter tuning)
![Page 32: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/32.jpg)
32
Comparison 1: SSL vs T-SVM and SVMMethods used
- Self-learning algorithm (using 1-NN classification)- Nonlinear T-SVM (Poly kernel d=3)
• Self-learning method is better than SVM or T-SVM- Why?
HYPERBOLAS DATA SET
SETTING σ=0.025, No. of training and validation samples=10 (5 per class). No. of test samples= 100 (50 per class).
Polynomial SVM(d=3)
T-SVM(d=3) Self-Learning
Prediction Error ( %) 11.8(2.78) 10.9(2.60) 4.0(3.53)
Typical Parameters C= 700-7000 C*/C= 10-6 N/A
![Page 33: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/33.jpg)
33
Comparison 2: SSL vs T-SVM and SVMMethods used
- Self-learning algorithm (using 1-NN classification)- Nonlinear T-SVM (RBF kernel)
• SVM or T-SVM is better than self-learning method - Why?
MNIST DATA SETSETTING: No. of training and validation samples=100. No. of test samples= 1866.
RBF SVM RBF T-SVM Self-LearningTest Error ( %) 5.79(1.80) 4.33(1.34) 7.68(1.26)Typical selected parameter values
C≈ 0.3- 7gamma≈5-3-5-2
C*/C≈0.01-1 N/A
![Page 34: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/34.jpg)
34
Explanation of T-SVM for digits data setHistogram of projections of labeled + unlabeled data:
- for standard SVM (RBF kernel) ~ test error 5.94%- for T-SVM (RBF kernel) ~ test error 2.84%
Histogram for RBF SVM (with optimally tuned parameters):
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
![Page 35: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/35.jpg)
35
Explanation of T-SVM (cont’d)Histogram for T-SVM (with optimally tuned parameters)Note: (1) test samples are pushed outside the margin borders
(2) most labeled samples project away from margin
-3 -2 -1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
![Page 36: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/36.jpg)
36
Universum Learning (Vapnik 1998, 2006)
• Motivation: what is a priori knowledge?- info about the space of admissible models- info about admissible data samples
• Labeled training samples (as in inductive setting) + unlabeled samples from the Universum
• Universum samples encode info about the region of input space (where application data lives):- from a different distribution than training/test data- U-samples ~ neither positive nor negative class
• Examples of the Universum data• Large improvement for small sample size n
![Page 37: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/37.jpg)
37
Cultural Interpretation of the Universum
neither Hillary nor Obama but looks like both
• Absurd examples, jokes, some art forms
![Page 38: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/38.jpg)
38
Cultural Interpretation of the Universum
Marc Chagall:FACES
![Page 39: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/39.jpg)
39
Cultural Interpretation of the Universum
Marcel Duchamp (1919)Mona Lisa with Mustache
• Some art formssurrealism, dadaism
![Page 40: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/40.jpg)
40
More on Marcel Duchamp
Rrose Sélavy (Marcel Duchamp), 1921, Photo by Man Ray.
![Page 41: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/41.jpg)
41
Main Idea of Universum Learning
Fig. courtesy of J. Weston (NEC Labs)
• Handwritten digit recognition: digit 5 vs 8
![Page 42: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/42.jpg)
42
Learning with the Universum• Inductive setting for binary classification
Given: labeled training data and unlabeled Universum samples Goal of learning: minimization of prediction risk (as in standard inductive setting)
• Two goals of the Universum Learning:(UL1) separate/explain labeled training data using large-margin hyperplane (as in standard SVM)(UL2) maximize the number of contradictions on the Universum, i.e. Universum samples inside the margin
Goal (UL2) is achieved by using special loss function for Universum samples
*jx mj ,...,1
ii y,x ni ,...,1
![Page 43: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/43.jpg)
43
Inference through contradictions
![Page 44: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/44.jpg)
44
Universum SVM Formulation (U-SVM)• Given: labeled training + unlabeled Universum samples• Denote slack variables for training, for Universum • Minimize where
subject to for labeled data
for the Universum
where the Universum samples use -insensitive loss
• Convex optimization• Hyper-parameters control the trade-off btwn
minimization of errors and maximizing the # contradictions• When =0, standard soft-margin SVM
i*j
m
jj
n
ii CCbR
1
**
1
)(21),( www
iii by 1])[( xw nii ,...,1,0, *)( ii b xw mjj ,...,1,0*
0, * CC
0, * CC
*C
![Page 45: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/45.jpg)
45
-insensitive loss for Universum samples
x
y
*2
1
![Page 46: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/46.jpg)
46
Application Study (Vapnik, 2006)• Binary classification of handwritten digits 5 and 8• For this binary classification problem, the following
Universum sets had been used:U1: randomly selected other digits (0,1,2,3,4,6,7,9)
U2: randomly mixing pixels from images 5 and 8
U3: average of randomly selected examples of 5 and 8
Training set size tried: 250, 500, … 3,000 samples
Universum set size: 5,000 samples
• Prediction error: improved over standard SVM, i.e. for 500 training samples: 1.4% vs 2% (SVM)
![Page 47: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/47.jpg)
47
Universum U3 via random averaging (RA)
Average
Class 1
Class -1 Hyper-plane
![Page 48: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/48.jpg)
48
Random Averaging for digits 5 and 8• Two randomly selected examples
• Universum sample:
![Page 49: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/49.jpg)
49
Application Study: gender of human faces• Binary classification setting• Difficult problem:
dimensionality ~ large (10K - 20K)labeled sample size ~ small (~ 10 - 20)
• Humans perform very well for this task• Issues:
- possible improvement (vs standard SVM)- how to choose Universum?- model parameter tuning
![Page 50: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/50.jpg)
50
Male Faces: examples
![Page 51: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/51.jpg)
51
Female Faces: examples
![Page 52: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/52.jpg)
52
Universum Faces:neither male nor female
![Page 53: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/53.jpg)
53
Empirical Study (Bai and Cherkassky 2008)• Gender classification of human faces (male/ female)• Data: 260 pictures of 52 individuals (20 females and 32 males,
5 pictures for each individual) from Univ. of Essex
• Data representation and pre-processing: image size 46x59 – converted into gray-scale image, following standard image processing (histogram equalization)
• Training data: 5 female and 8 male photos• Test data: remaining 39 photos (of other people)• Experimental procedure: randomly select 13 training samples
(and 39 test samples). Estimate and compare inductive SVM classifier with SVM classifier using N Universum samples (where N=100, 500, 1000).- Report results for 4 partitions (of training and test data)
![Page 54: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/54.jpg)
54
Empirical Study (cont’d)• Universum generation:
U1 Average: of male and female samples randomly selected from the training set
U2 Empirical Distribution: estimate pixel-wise distribution of the training data. Generate a new picture from this distribution
U3 Animal faces:
![Page 55: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/55.jpg)
5555
Universum generation: examples• U1 Averaging:
• U2 Empirical Distribution:
![Page 56: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/56.jpg)
56
Results of gender classification• Classification accuracy: improves vs standard SVM by ~
2% with U1, and ~ 1% with U2 Universum
• Universum by averaging gives better results for this problem, when number of Universum samples N = 500 or 1,000 (shown above for 4 partitionings of the data)
![Page 57: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/57.jpg)
5757
Results of gender classification• Universum ~ Animal Faces:
• Degrades classification accuracy by 2-5% (vs standard SVM)
• Animal faces are not relevant to this problem
![Page 58: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/58.jpg)
58
Random Averaging Universum• How to come up with good Universum ?
- usually application – specificBut• RA Universum is generated from training data• Under what conditions RA U-SVM is effective? ~
better than standard SVM• Solution approach:
Analyze histogram of projections of training samples onto normal direction vector w of SVM model (for high-dim data)
![Page 59: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/59.jpg)
59
Histogram of projections and RA Universum
• Typical histogram (for high-dimensional data)Case 1:
-1.5 -1 -0.5 0 0.5 1 1.510
0
101
102
103
![Page 60: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/60.jpg)
60
Histogram of projections and RA Universum
• Typical histogram (for high-dimensional data)Case 2:
-1.5 -1 -0.5 0 0.5 1 1.50
2
4
6
8
10
12
14
16
![Page 61: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/61.jpg)
61
Histogram of projections and RA Universum
• Typical histogram (for high-dimensional data)Case 3:
-3 -2 -1 0 1 2 30
50
100
150
200
250
![Page 62: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/62.jpg)
62
Example: Handwritten digits 5 vs 8 • MNIST data set: each digit ~ 28x28=784 pixels• Training / validation set ~ 1,000 samples each
Test set ~ 1,866 samplesFor U-SVM ~ 1,000 Universum samples (via RA)
• Comparison of test error rate for SVM and U-SVM:
SVM U-SVM MNIST (RBF Kernel) 1.37% (0.22%) 1.20% (0.19%) MNIST (Linear Kernel) 4.58% (0.34%) 4.62% (0.37%)
![Page 63: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/63.jpg)
63
Histogram of projections for MNIST data
for RBF SVM for Linear SVM
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.50
50
100
150
200
250MNIST Data: Distribution of RBF kernel SVM output
-6 -4 -2 0 2 4 60
20
40
60
80
100
120MNIST Data: Disbution of the linear SVM output
![Page 64: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/64.jpg)
64
Analyzing ‘other digits’ UniversumThe same set-up but using digits ‘1’ or ‘3’ UniversaDigit 1 Universum Digit 3 Universum
Test error: RBF SVM~ 1.47%, Digit1~1.31%, Digit3~1.01%
-3 -2 -1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
-3 -2 -1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
![Page 65: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/65.jpg)
65
Discussion• Technical aspects: why the Universum data
improves generalization when d>>n? -SVM is solved in a low-dimensional subspace implicitly defined by the Universum data
• How to specify good Universum? - no formal procedure exists- conditions for the effectiveness of a particular Universum data set: see http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5936738
• Philosophical aspects: - Relation to human learning (cultural Universum)?- New type of inference? Impact on psychology?
![Page 66: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/66.jpg)
66
Learning Using Privileged Info (LUPI)Given: training data (x, x*, y) where x* is privileged info only for training dataEstimate: predictive model y = f(x)
This additional info x* is common in many apps:Handwritten digit recognition: ~ person’s label
Brain imaging (fMRI): human subject label (in a group of subjects performing the same cognitive task)
Medical diagnosis: ~ medical history after initial diagnosis
Time series prediction: ~ future values of the training data
![Page 67: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/67.jpg)
67
Learning Using Privileged Info (LUPI)Given: training data (x, x*, y) where x* is privileged info only for training dataEstimate: predictive model y = f(x)Main Idea:standard learning x LUPI x + x*
+
![Page 68: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/68.jpg)
68
LUPI (Vapnik, 2006)
• Goal of learning for both settings: find a function f(x) providing small error for test inputs
• Standard setting: training set ( , ), 1, ,i iy i nx
• LUPI setting: training set where is additional privileged info
• Privileged info: in the form of additional constraints on errors (slack variables)
*( , , ), 1, ,i i iy i nx x *ix
![Page 69: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/69.jpg)
69
Many Application Settings for LUPI• Medical diagnosis: privileged info ~ expensive
test results, patient history after diagnosis, etc.• Time series prediction: privileged info ~ future
values of the training time series• Additional group information (~ structured data)• Feature selection etc.
SVM+ is a formal mechanismfor LUPI training
![Page 70: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/70.jpg)
70
LUPI and SVM+ (Vapnik 2006)SVM+ is a new learning technology for LUPI• Improved prediction accuracy (proven
both theoretically and empirically)• Similar to learning with a human teacher:
‘privileged info’ ~ feedback from an experienced teacher(can be very beneficial)
• SVM+ is the most promising new technology for ‘difficult’ problems
• Computationally complex: no public domain software currently available
![Page 71: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/71.jpg)
71
LUPI Formalization and Challenges• LUPI training data in the form
• Privileged Info - assumed to have informative value (for prediction)- is not available for predictive model
• Interpretation of privileged info- additional feedback from an expert teacher, different from ‘correct answers’ or y-labels- this feedback provided in a different space x*
• Challenge: how to handle/combine learning in two different feature spaces?
niyiii ,...2,1),,( xx
)( ix
)(ˆ xfy
![Page 72: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/72.jpg)
72
SVM+ Technology for LUPI• LUPI training data in the form
• SVM+ Learning takes place in two spaces:- decision space Z where the decision rule is estimated- correction space Z* where the privileged info is encoded in the form of additional constraints on errors (slack variables) in the decision space
• SVM+ approach implements two (kernel) mappings- into decision space- into correction space
niyiii ,...2,1),,( xx
)( ix
)(ˆ zfy
Zxx :)( Zxx :)(
![Page 73: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/73.jpg)
73
Illustration of SVM+ (privileged info = group label)
Decision function
Decision space
Correcting space Correcting functions
Group1Group2Class 1Class -1
Correcting space
mapping
2
2
1
1
r slack variable for group r
mapping
![Page 74: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/74.jpg)
74
Two Objectives of SVM+ Learning• LUPI training data in the form
SVM+ Learning achieves two goals:(SVM+ 1)
- separate labeled training data in decision space Z using large-margin hyperplane (as in standard SVM)
(SVM+ 2) incorporate privileged info in the form of additional constraints on errors (slack variables) in the decision space. These constraints are modeled as (linear) functions in the correction space Z*
niyiii ,...2,1),,( xx
)( ix
)(ˆ zfy
![Page 75: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/75.jpg)
75
SVM+ Formulation aka
r
tt Ti
ri
t
r
t
rrrddbwww
C 11,..,,,..,,
),(2
),(21 min
11
wwww
subject to:
trTiby rriii ,...,1,,1)),(( zw
trTi r
ri ,...,1,,0
trTid rrrri
ri ,...,1,,),( wz
]))(,[()]([ bsignfsigny Z xwxriz
r Zri
)(xz
Zizi )(xzix
Decision Space
Correcting Space
rSVM
![Page 76: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/76.jpg)
76
SVM+ Formulation• Map the input vectors simultaneously into:
- Decision space (standard SVM classifier)- Correcting space (where correcting functions model slack variables for different groups)
• Decision space/function ~ the same for all groups• Correcting functions ~ different for each group
(but correcting space may be the same)• SVM+ optimization formulation incorporates:
- the capacity of decision function- capacity of correcting functions for group r- relative importance (weight) of these two capacities
ww, rr ww ,
![Page 77: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/77.jpg)
77
Application Study (Liang and Cherkassky, 2007)
• SVM+ technology is fairly recent.Just a few known empirical comparison studies
• fMRI data analysis problem (CMU data set)- six subjects presented with {picture or sentence}- fMRI data is recorded over 16 time intervals- need to learn binary classifier fMRI image class
• Data preprocessing (Wang et al, 2004)- extract 7 input features (Regions of Interest) from high–dim. fMRI image input vector has 16x7=112 components
• Comparison betweenstandard SVM (data from all 6 subjects is pooled together)SVM+ method (6 groups where each group has 80 samples)
![Page 78: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/78.jpg)
78
fMRI Application Study (cont’d)
• Experimental protocol: (randomly) split the data- 60% ~ training- 20% ~ validation (for tuning parameters)- 20% ~ test (for estimating prediction error)
• Details of methods used:- linear SVM classifier (single parameter C)- classifier (3 parameters: linear kernel for decision space, RBF kernel for correcting space, and parameter )
• Comparison results (over 10 trials): - standard SVM ~ ave test error: 22.2 % (st. dev. 3.8%)- ~ ave test error: 20.2% (st.dev. 3.1%)
SVM
SVM
![Page 79: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/79.jpg)
79
Multi-Task Learning (MTL)
• Application: Handwritten digit recognitionLabeled training data provided by t persons (t >1) Goal 1: estimate a single classifier that will generalize well for future samples generated by these persons ~ LUPIGoal 2: estimate t related classifiers (for each person) ~ MTL
• Application: Medical diagnosisLabeled training data provided by t groups of patients (t >1), say men and women (t = 2) Goal 1: estimate a classifier to predict/diagnose a disease using training data from t groups of patients ~ LUPIGoal 2: estimate t classifiers specialized for each group of patients ~ MTL
Note: for MTL the group info is known for training and test data
![Page 80: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/80.jpg)
80
Multi-task Learning
Training Data
Predictive Model
Training Data
Predictive Model
Training Data
Predictive Model
Training Data
Predictive Model
Task Relatedness Modeling
(a)
(b)
Task 1
Task 2
Task t
(a) Single task learning
(b) Multi-task learning (MTL)
![Page 81: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/81.jpg)
81
Contrast Inductive Learning, MTL and LWSD
Training data for Group 1
Training data for Group t
…..
Inductive Learning
Multi-TaskLearning
Learning with Structured Data
f
dataGroup info
f1
ft
f
test data for Group 1
test data for Group t
…..…..
test data (no Group info)
test data (no Group info)
![Page 82: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/82.jpg)
8282
Different Ways of Using Group Information
SVM
SVM+
SVM
SVM
svm+MTL
f(x)
f(x)
f1(x)
f2(x)
f1(x)
f2(x)
sSVM:
SVM+:
mSVM:
svm+MTL:
![Page 83: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/83.jpg)
83
Problem setting for MTL• Assume binary classification problem• Training data ~ a union of t related groups(tasks)
Each group r has i.i.d. samples
generated from a joint distribution • Test data: task label for test samples is known
• Goal of learning: estimate t models such that the sum of expected losses for all tasks is
minimized:
rni ,...,2,1),( ri
ri yx
),( yPr x
rntr ,...,2,1
),( yPr x
},...,,{ 21 tfff
t
rrr ydPwfyLwR
1
),()),(,(()( xx
![Page 84: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/84.jpg)
8484
SVM+ for Multi-Task Learning
• New learning formulation: SVM+MTL• Define decision function for each group as
• Common decision function models the relatedness among groups
• Correcting functions fine-tune the model for each group (task)
.
trdbf rrzzr r,...,1,)),(()),(()( wxwxx
bz )),(( wx
![Page 85: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/85.jpg)
8585
t
r Ti
ri
t
rrrddbwww
rrt
C11,..,,,..,,
),(2
),(21 min
11
wwww subject to:
trTidby rrir
riri
ri ,...,1,,1)),(),(( zwzw
trTiri ,...,1,,0
rizr Z
ri )(xz
Zizi )(xzix Correcting Space
SVM+MTL FormulationDecision Space
trdbsignf rrzzr r,...,1,))),(()),((()( wxwxx
![Page 86: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/86.jpg)
8686
Empirical Validation• Different ways of using group info
different learning settings:- which one yields better generalization?- how performance is affected by sample size?
• Empirical comparisons:- synthetic data set
![Page 87: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/87.jpg)
8787
Different Ways of Using Group Information
SVM
SVM+
SVM
SVM
svm+MTL
f(x)
f(x)
f1(x)
f2(x)
f1(x)
f2(x)
sSVM:
SVM+:
mSVM:
svm+MTL:
![Page 88: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/88.jpg)
8888
Synthetic Data
• Generate x with each component • The coefficient vectors of three tasks are specified as
• For each task and each data vector, • Details of methods used:
- linear SVM classifier (single parameter C)- SVM+, SVM+MTL classifier (3 parameters: linear kernel for decision space, RBF kernel for correcting space, and parameter γ) - Independent validation set for model selection
20R 20,...,1,)1,1(~ i�uniformxi
]0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,1,1[
]0,0,0,0,0,0,0,0,1,0,1,0,1,1,1,1,1,1,1,1[
]0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1[
3
2
1
)5.0( xisigny
![Page 89: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/89.jpg)
8989
Experimental Results • Comparison results (over 10 trials):
ave test accuracy(%) ±std. dev.
#per group sSVM SVM+ mSVM svm+MTLn=100 88.11±0.65 88.31±0.84 91.18±1.26 91.47±1.03
n=50 85.74±1.36 86.49±1.69 85.09±1.40 87.39±2.29
n=15 80.10±3.42 80.84±3.16 70.73±2.69 79.24±2.81
Note1: relative performance depends on sample size ‘correct’ problem setting depends on sample size
Note2: SVM+ always better than SVMSVM+MTL always better than mSVM
![Page 90: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/90.jpg)
90
OUTLINE• Motivation for non-standard approaches• New Learning Settings• Summary
- Advantages/limitations of non-standard settings- Vapnik’s Imperative
![Page 91: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/91.jpg)
91
Advantages/Limitations of nonstandard settings• Advantages
- makes common sense- follows methodological framework (VC-theory)- yields better generalization (but not always)- new directions for research
• Limitations- need to formalize application requirements need to understand application domain- generally more complex learning formulations- more difficult model selection- few known empirical comparisons (to date)
![Page 92: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/92.jpg)
92
Vapnik’s Imperative• Vapnik’s Imperative:
asking the right question (with finite data) most direct learning problem formulation
• ~ Controlling the goals of inference, in order to produce a better-posed problem
• Three types of such restrictions (Vapnik, 2006)- regularization (constraining function smoothness)- SRM ~ constraints on VC-dimension of models- constraints on the mode of inference
• Philosophical interpretation (of restrictions):Find a model that explains well available data and can be easily falsified
![Page 93: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/93.jpg)
93
Course Summary: Knowledge Discovery and Philosophical Connections
• Aristotle: All men by nature desire knowledge
• Einstein: Man has an intense desire for assured knowledge
• Assured Knowledge- religion- belief in reason- belief in science (i.e., genetic causes)
![Page 94: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/94.jpg)
94
Scientific Knowledge
• Scientific Knowledge (last 3-4 centuries):- objective- recurrent events (repeatable by others)- quantifiable (described by math models)
• Knowledge ~ causal, deterministic, logical humans cannot reason well about
- noisy/random data- multivariate (high-dimensional) data
![Page 95: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/95.jpg)
95
The Nature of Knowledge
• Idealism: knowledge is pure thought• Realism: knowledge comes from experience• Western History of knowledge
- Ancient Greece ~ idealism- Classical science ~ realism
• But it is not possible to obtain assured knowledge from empirical data (Hume):Whatever in knowledge is of empirical nature is never certain
![Page 96: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/96.jpg)
96
Digital Age• Most knowledge in the form of data from
sensors (not human sense perceptions)• How to get assured knowledge from data?• Naïve realism: data knowledge ?Wired Magazine, 16/07: We can stop looking for
(scientific) models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot
![Page 97: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/97.jpg)
97
(Over) Promise of ScienceArchimedes: Give me a place to stand, and a
lever long enough, and I will move the worldLaplace: Present events are connected with
preceding ones by a tie based upon the evident principle that a thing cannot occur without a cause that produces it.
Digital Age: more data more knowledgemore connectivity new/better knowledge
![Page 98: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/98.jpg)
98
Classical Statistics
How to obtain assured knowledge from data?• Classical approach:
(1) specify a parametric model(2) estimate its parameters from datamore data better (more accurate) model
• Parametric form is based on first-principle knowledge and therefore correct
![Page 99: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/99.jpg)
99
Philosophical Views on Science
• Karl Popper: Science starts from problems, and not from observations
• Werner Heisenberg :What we observe is not nature itself, but nature exposed to our method of questioning
• Albert Einstein:- God is subtle but he is not malicious.- Reality is merely an illusion, albeit a very persistent one.
![Page 100: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/100.jpg)
100
Predictive Data Modeling Philosophy
• Single ‘correct’ causal model cannot be estimated from data.
• Good predictive models:- not unique- depend on the problem setting ( ~ our method of questioning)
• Predictive Learning methodology is the most important aspect of learning from data.
![Page 101: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/101.jpg)
101
Predictive Modeling: Technical Aspects
• The same methodology applies to financial, bio/medical, weather, marketing etc
• Interpretation of predictive models:- big problem (not addressed in this course)- reflects the need for assured knowledge- requires understanding of the predictive methodology and its assumptions, rather than just interpretable parameterization
![Page 102: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/102.jpg)
102
Non-Technical Aspects• Predictive methodology + philosophy can be
applied to other areas • Examples: writing assignments etc.• Lessons from philosophy and history:
- Amount of information grows exponentially, but the number of ‘big ideas’ remains quite small- Critical thinking ability becomes even more important in our digital age - Freedom of information does not guarantee freedom of thought
![Page 103: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/103.jpg)
103
Predictive Learning and Humanities
• Specialization/ Separation btwn science and humanities – only recently ~ 200 years ago
• Predictive modeling and humanities/culture: - both advocate constraints/ restrictions
• Cultural constraints:~ ethical or religious principles
![Page 104: Introduction to Predictive Learning](https://reader031.fdocuments.net/reader031/viewer/2022012922/56815d96550346895dcbb4d4/html5/thumbnails/104.jpg)
104
References• Vapnik, V. Estimation of Dependencies Based on Empirical Data. Empirical
Inference Science: Afterword of 2006, Springer, 2006 • Cherkassky, V. and F. Mulier, Learning from Data, second edition, Wiley, 2007• Cherkassky, V., Predictive Learning, available at http://vctextbook.com/ 2013 • Cherkassky, V., S. Dhar, and W. Dai, Practical Conditions for Effectiveness of the
Universum Learning, IEEE Trans. on Neural Networks, 22, 8, 1241-1255, 2011• Chapelle, O., Schölkopf, B., and A. Zien, Eds., Semi-Supervised Learning, MIT
Press, 2006• Weston, J., Collobert, R., Sinz, F., Bottou, L. and V. Vapnik, Inference with the
Universum, Proc. ICML, 2006• Wang, X., Hutchingson, R. and Mitchell, T., Training fMRI classifier to
discriminate cognitive states across multiple subjects, Proc. NIPS, 2004• Liang, L. and Cherkassky, V., Learning using structured data: application to fMRI
data analysis, Proc. IJCNN, 2007• Ando, R. and Zhang, T., A Framework for learning predictive structures from
multiple tasks and unlabeled data, JMLR, 2005• Evgeniou, T. and Pontil, M., Regularized multi-task learning, Proc of 10-th Conf.
on Knowledge Discovery and Data Mining, 2004