All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science...
-
Upload
august-taylor -
Category
Documents
-
view
216 -
download
0
Transcript of All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science...
All you wanted to know about Regression…
COSC 526 Class 9
Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]
Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford)
2
Introducing your guest instructor (Feb 10-12)
• Dr. Sreenivas (Rangan) Sukumar
• Staff member at ORNL:– Leader in graph analytics
approaches
– UTK grad…
– “Healthcare Guru” at ORNLSide bar: • The class website location will be shortly updated. The original links
must work – but will be redirected to a new location at EECS.• Approved for space on the EECS website! • Hadoop server is working (finally) and your accounts are also ready
(utk id). More information on log in procedures as well as access to data forthcoming…
3
Last class: Classification with SVMs
• We had a class variable: y– Categorical in nature
– {x1, x2, …, xn} could be anything
• Formulated a quadratic programming problem that would eventually allow us to classify– stochastic gradient descent (SGD)
• Alterations for big datasets: – Minimum enclosing ball (MEB)
– Shrinking the optimization problem
– Incremental and decremental SVM learning
4
This class: predicting a real valued y
• Instead of a categorical class value y, we are going to see how to predict a real valued y
• Various regression algorithms:– Linear regression
– Regression with varying noise
– Non-linear regression
• Adapting regression for big data
5
Part I: Linear Regression
6
Regression
Living Area (sq. ft)
Prince ($ 1000s)
2104 400
1600 300
2400 370
1416 200
3000 540
Living area (sq. ft.)
Pri
ce• Can we predict the prices of other
houses as a function of their living area?
Linear regression helps us with this analysis…
As a recent home buyer (or a buyer interested in the market):
7
Linear regression
• Linear regression assumes that the expected value of the output, given some input is linear
• Simplest way to think about this: y = wx for some unknown w
Living area (sq. ft.)
Pri
ceLiving Area (sq. ft)
Prince ($ 1000s)
2104 400
1600 300
2400 370
1416 200
3000 540
Given the data, how to estimate w….
w
1
8
Some formalism…
• Assume that our data is formed by:
– Noise signals are independent
– Drawn from a Normal distribution
• p(y | w, x) has a normal distribution:– Mean wx
– variance σ2
Noise
9
Linear Regression (1)
• we have a bunch of data {(x1, y1), (x2, y2), … (xn, yn)} which are all evidence about w
• How to infer w (from the data)?
• Bayes rule to our rescue:– Maximum likelihood estimate (MLE) of w
– Because you can do it on a computer!
10
MLE of w
• For which value of w is the data most likely to have this behavior? – i.e., for what w is
maximized?
– i.e., for what w is maximized?
Since we know the distribution, i.e., we assumed that the data came from a normal distribution
2
11
MLE of w
• Now do the log-likelihood trick…
• Equivalently:
now we are in familiar territory…
12
All we have to do is …
• Take the derivative of E(w) w.r.t w and set to 0
0
13
What do we mean by this (graphically)?
• If x=sq. ft., y = price,
is the average price for x = 2014 sq. ft.• If x=height, y = weight,
is the average weight for all people 60 in tall.
14
Multi-linear Regression
• Now instead of a single x, let’s say we have x, where it comes from a d-dimensional spaceLiving Area (sq. ft)
No. of rooms
Prince ($ 1000s)
2104 2 400
1600 2 300
2400 3 370
1416 2 200
3000 4 540
How do we think of doing regression?
• Remember there are d-dimensions • (2 here)
• Can we visualize our data in a way that is easy to “regress”?
15
Matrix algebra to our rescue…
• out(x) = wTx = w1x[1] + w2x[2] + … + wdx[d]
• How do we learn w?
• Let’s define a cost-function
16
MLE is very similar to the simple regression story…
• MLE is given by:
• xTx is a n x n matrix:– where (i,j) th element is
• xTy is a n element vector: – with ith element
17
How to solve this on a computer?
• Let’s say I have an initial guess for w
• I need to search for a suitable w that will make J(w) smaller
• Idea: use gradient descent!
Repeat until convergence: For every j = 1…m:
Calculate gradient
Update
18
Problem(s) with gradient descent
• It will converge: for linear regression, since we have a global minimum, GD will converge to the solution!
• Takes a long time if training examples are large in number:– Each iteration scans through the entire training
dataset
– Can do stochastic gradient descent (SGD) in a similar way we discussed last time…
19
Pesky detail…
• We always talked about the line as if it originated from the origin 0D
• What if this is not the case?
Living area (sq. ft.)
Pri
ce
Living area (sq. ft.)
Pri
ce
20
Let’s fake it… neat trick!
• Create a fake input x0 with a value of 1
(always)
x1 x2 y
2104 2 400
1600 2 300
2400 3 370
1416 2 200
3000 4 540
x0 x1 x2 y
1 2104 2 400
1 1600 2 300
1 2400 3 370
1 1416 2 200
1 3000 4 540
y = w1x1 + w2x2 y = w0x0 + w1x1 + w2x2
= w0 + w1x1 + w2x2
21
Let’s say we know something about the noise added to each data point
• E.g.: I know the variance of the noise added to each data point…
xi σi2 y
0.5 4 0.5
1 1 1
2 0.25 1
2 4 3
3 0.25 2
Now, how do we do the MLE?
22
MLE with varying noise
Assuming independence among noise, plug in the Gaussian equation and simplify;
setting d(LL)/dw = 0 for minimum:
23
Weighted Regression
• We just saw “weighted regression”
• points that have a “higher confidence” and “lower noise” are important
• Rest are weighted by the variance in their noise
24
Part II: Non-linear Regression
25
Non-linear regression…
• Suppose y is related to a function of x in such a way that the predicted values have a non-linear relationship…
xi y1
0.5 0.05
1 2.5
2 3
3 2
3 3
Assume
26
Non-linear MLE
• Ugly, ugly algebra!!! What do we do?– Line search
– Simulated annealing
– GD and SGD
– Newton’s method
– Expectation Maximization!
27
Polynomial Regression…
• All this while, we were talking about linear regression
• But, it may not be the best way to describe data
• Be careful about how to fit the data…
28
Suppose we add an additional term…
• Quadratic regression: Each component is now called a term
• Each column is called a term column
• How many terms in a quadratic regression with p inputs?– 1 constant term
– p linear terms
– (p+1)C2 quadratic terms! => O(m2) terms
Solving our MLE:• Similar to our linear regression w =
(xTx)-1(xTy) • Cost will be O(p6)
29
Generalizing: p inputs, Qth degree polynomial… how many terms?
• = number of unique terms of the form
• = number of unique terms of the form
• = the number of lists of non-negative integers [q0, q1, …, qp]
• =(Q+p)CQ terms!!
30
Notes of caution…
• Is a polynomial with p = 2 better than p = 5?
• Linear fit is underfitting the data: – data shows structure not captured by the model
• Polynomial fit is overfitting the data:– data is fit very strongly by the model…
Moral of the story• Selecting model is important• More important is the selection of
the features!!
31
Locally Weighted Regression (LWR)
• An approach to reduce the dependency on selecting features:– Many datasets don’t have linear descriptors
• We have seen this before: – In the weighted regression model
• How do we choose the right weights?
weights!
32
Using the Kernel Trick once again…
• where Φ(x) is the kernel function
How do we estimate w?
33
Using the Kernel Trick once again…
• where Φ(x) is the kernel function
All ci are held constant. We will just initialize them at random or on a uniformly spaced grid in d dimensions…
KW – kernel width is also held constant. It will be some value that ensures good overlap between the basis functions…
34
How do we estimate w?
• Same as before…– Given the Q basis functions, let’s define a
matrix, Z such that
– Here xk is the the kth input vector…
• Now, we will:
• How to find the ci and KW?– Use BGD / SGD…
– Other methods will work
Also referred to as radial basis functions (RBFs)
35
What are good radial basis function choices?
• We talked about overlaps…
Living area (sq. ft.)
no.
of
room
s
Living area (sq. ft.)
no.
of
room
s
Living area (sq. ft.)
no.
of
room
s
Just about right overlap…
Too little overlap? Too much overlap?
36
Robust Regression…
• Best quadratic fit: – what is the problem
here?
• What would we want?– better fit to the varying
data!
– How can we find the better fitting curve?
37
LOESS-based Robust Regression
• After the initial fit, score each data point to say how well it is fitted
good data point
good data point
Not that bad either
this is a horrible data point
Repeat until convergence: For every k = 1…m:
Let be the kth data point
Let be the estimate of the yk data point
Let wk is large if the data point is
fitted well and very small if it is not fitted well
Redo the regression with the weighted data points
How do we know we have converged? Use expectation maximization (EM)
38
Multilinear Interpolation
• How to create a piecewise, linear fit to the data?
Create a set of “knot points” equally spaced along the data…
Let’s assume that the data points are generated by a noisy function that is allowed to bend only at these knot points…
We can do a linear regression for every segment identified here…
39
How to find the best fit?
• With some algebraic manipulations…
q1 q2 q3 q4 q5 q6
h3
h2
40
41
Can we do classification with this?
• Map y to be {0, 1} – negative and positive class
• Function: Logistic/Sigmoid function
• Note g(θTx)1 as θTx ∞
• g(θTx)0 as θTx -∞
42
How do we do MLE on this?
43
Another approach to maximize L(θ)
• Using Newton’s approach: finding a zero for a function
Hessian: (n x n matrix to keep track of all partial derivatives
44
Generalizing further…
• Regression:
• Classification:
• Begin by defining an exponential family of distributions:
natural parameter
sufficient statistic
log partition function
45
Bernoulli and Gaussian as specific GLMs
46
Softmax Regression
• Instead of a response variable y taking {0, 1}, we can think of having one of k values {1, 2,… k}
• Ex.: Mail classification = {spam, personal mail, work mail, advertisement}
• GLM with multinomial…
47
Part II: What do we do with Big Data?
48
Can we make Regression Faster?
• At least O(p2m):– Where p is the number of features (columns)
– m is the number of training examples
• Usually only a subset of p features, k, is relevant k << p
• What can we do to exploit this?– Variance inflation factor (VIF) regression O(pm)
49
VIF regression
• Evaluation step: – approximate the partial correlation of each
candidate variable (feature xi) with y using a
small pre-sampled set of data
– [stagewise regression]
• Search step:
– Test each xi sequentially using an α-investing
rule
D. Lin, D.P. Foster, L.H. Ungar, VIM Regression, Arxiv 2012
50
Other standard approaches also work…• MapReduce
• Gather/Apply/Scatter (GAS) [to be seen in the future]
• Spark!
• What you need to know? – Regression is one of the most commonly used
ML algorithms
– Many flavors and can be generalized using GLMs
– Research still needs to be carried out for big datasets