CS540 Introduction to Artificial Intelligence Lecture...
Transcript of CS540 Introduction to Artificial Intelligence Lecture...
![Page 1: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/1.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
CS540 Introduction to Artificial IntelligenceLecture 2
Young WuBased on lecture slides by Jerry Zhu, Yingyu Liang, and Charles
Dyer
June 15, 2020
![Page 2: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/2.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
FeedbackAdmin
Please give me feedback on lectures, homework, exams onSocrative, room CS540.
Please report bugs in homework, lecture examples and quizzeson Piazza.
Please do NOT leave comments on YouTube.
Email me (Young Wu) for personal issues.
Email department chair (Dieter van Melkebeek) for issueswith me.
![Page 3: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/3.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Feedback from Last YearAdmin
Too much math.
Time spent on math.
Cannot understand my handwriting.
Mistake on slides.
More examples.
![Page 4: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/4.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Supervised LearningMotivation
Data Features (Input) Output -
Training tpxi1, ..., ximqun1
i�1 tyiun1
i�1 find ”best” f
- observable known -
Test px 11, ..., x 1mq y 1 guess y � f px 1q- observable unknown -
![Page 5: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/5.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Loss Function DiagramMotivation
![Page 6: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/6.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Zero-One Loss FunctionMotivation
An objective function is needed to select the ”best” f . Anexample is the zero-one loss.
f � arg minf
n
i�1
1tf pxi q�yiu
arg minf
objective pf q outputs the function that minimizes the
objective.
The objective function is called the cost function (or the lossfunction), and the objective is to minimize the cost.
![Page 7: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/7.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Squared Loss FunctionMotivation
Zero-one loss counts the number of mistakes made by theclassifier. The best classifier is the one that makes the fewestmistakes.
Another example is the squared distance between thepredicted and the actual y value:
f � arg minf
1
2
n
i�1
pf pxi q � yi q2
![Page 8: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/8.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Function Space DiagramMotivation
![Page 9: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/9.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Hypothesis SpaceMotivation
There are too many functions to choose from.
There should be a smaller set of functions to choose f from.
f � arg minf PH
1
2
n
i�1
pf pxi q � yi q2
The set H is called the hypothesis space.
![Page 10: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/10.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Linear RegressionMotivation
For example, H can be the set of linear functions. Then theproblem can be rewritten in terms of the weights.
�w1, ..., wm, b
� arg min
w1,...,wm,b
1
2
n
i�1
pai � yi q2
where ai � w1xi1 � w2xi2 � ...� wmxim � b
The problem is called (least squares) linear regression.
![Page 11: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/11.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Binary ClassificationMotivation
If the problem is binary classification, y is either 0 or 1, andlinear regression is not a great choice.
This is because if the prediction is either too large or toosmall, the prediction is correct, but the cost is large.
![Page 12: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/12.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Binary Classification Linear Regression DiagramMotivation
![Page 13: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/13.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Activation FunctionMotivation
Suppose H is the set of functions that are compositionsbetween another function g and linear functions.
�w , b
� arg min
w ,b
1
2
n
i�1
pai � yi q2
where ai � g�wT x � b
g is called the activation function.
![Page 14: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/14.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Linear Threshold UnitMotivation
One simple choice is to use the step function as the activationfunction:
g� � � � 1! � ¥0
) �"
1 if � ¥ 00 if � 0
This activation function is called linear threshold unit (LTU).
![Page 15: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/15.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Sigmoid Activation FunctionMotivation
When the activation function g is the sigmoid function, theproblem is called logistic regression.
g� � � � 1
1 � exp�� � �
This g is also called the logistic function.
![Page 16: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/16.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Sigmoid Function DiagramMotivation
![Page 17: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/17.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Cross-Entropy Loss FunctionMotivation
The cost function used for logistic regression is usually the logcost function.
C pf q � �n
i�1
pyi log pf pxi qq � p1 � yi q log p1 � f pxi qqq
It is also called the cross-entropy loss function.
![Page 18: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/18.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Logistic Regression ObjectiveMotivation
The logistic regression problem can be summarized as thefollowing.
�w , b
� arg min
w ,b�
n
i�1
pyi log pai q � p1 � yi q log p1 � ai qq
where ai � 1
1 � exp p�zi q and zi � wT xi � b
![Page 19: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/19.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Optimization DiagramMotivation
![Page 20: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/20.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Logistic RegressionDescription
Initialize random weights.
Evaluate the activation function.
Compute the gradient of the cost function with respect toeach weight and bias.
Update the weights and biases using gradient descent.
Repeat until convergent.
![Page 21: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/21.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Gradient Descent IntuitionDefinition
If a small increase in w1 causes the distances from the pointsto the regression line to decrease: increase w1.
If a small increase in w1 causes the distances from the pointsto the regression line to increase: decrease w1.
The change in distance due to change in w1 is the derivative.
The change in distance due to change in
�wb
�is the gradient.
![Page 22: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/22.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
GradientDefinition
The gradient is the vector of derivatives.
The gradient off pxi q � wT xi � b � w1xi1 � w2xi2 � ...� wmxim � b is:
∇w f �
���������
BfBw1BfBw2...BfBwm
����������
����xi1xi2...xim
���� � xi
∇bf � 1
![Page 23: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/23.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Chain RuleDefinition
The gradient off pxi q � g
�wT xi � b
� � g pw1xi1 � w2xi2 � ...� wmxim � bqcan be found using the chain rule.
∇w f � g 1�wT xi � b
xi
∇bf � g 1�wT xi � b
In particular, for the logistic function g :
g� � � � 1
1 � exp�� � �
g 1� � � � g
� � � �1 � g� � ��
![Page 24: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/24.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Logistic Gradient Derivation 1Definition
![Page 25: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/25.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Logistic Gradient Derivation 2Definition
![Page 26: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/26.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Gradient Descent StepDefinition
For logistic regression, use chain rule twice.
w � w � αn
i�1
pai � yi q xi
b � b � αn
i�1
pai � yi q
ai � g�wT xi � b
, g� � � � 1
1 � exp�� � �
α is the learning rate. It is the step size for each step ofgradient descent.
![Page 27: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/27.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Perceptron AlgorithmDefinition
Update weights using the following rule.
w � w � α pai � yi q xib � b � α pai � yi qai � 1twT xi�b¥0u
![Page 28: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/28.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Learning Rate DiagramDefinition
![Page 29: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/29.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Logistic Regression, Part 1Algorithm
Inputs: instances: txiuni�1 and tyiuni�1
Outputs: weights and biases: w1,w2, ...,wm and b
Initialize the weights.
w1, ...,wm, b � Unif r�1, 1s
Evaluate the activation function.
ai � g�wT xi � b
, g� � � � 1
1 � exp�� � �
![Page 30: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/30.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Logistic Regression, Part 2Algorithm
Update the weights and bias using gradient descent.
w � w � αn
i�1
pai � yi q xi
b � b � αn
i�1
pai � yi q
Repeat the process until convergent.
|C � C prev | ε
![Page 31: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/31.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Stopping Rule and Local MinimumDiscussion
Start with multiple random weights.
Use smaller or decreasing learning rates. One popular choice
isα?t
, where t is the iteration count.
Use the solution with the lowest C .
![Page 32: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/32.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Other Non-linear Activation FunctionDiscussion
Activation function: g� � � � tanh
� � � � e � � e� �
e � � e� �Activation function: g
� � � � arctan� � �
Activation function (rectified linear unit): g� � � � � 1! � ¥0
)
All these functions lead to objective functions that are convexand differentiable (almost everywhere). Gradient descent canbe used.
![Page 33: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/33.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Convexity DiagramDiscussion
![Page 34: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/34.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
ConvexityDiscussion
If a function is convex, gradient descent with any initializationwill converge to the global minimum (given sufficiently smalllearning rate).
If a function is not convex, gradient descent with differentinitializations may converge to different local minima.
A twice differentiable function is convex if and only its secondderivative is non-negative.
In the multivariate case, it means the Hessian matrix ispositive semidefinite.
![Page 35: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/35.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Positive SemidefiniteDiscussion
Hessian matrix is the matrix of second derivatives:
H : Hij � B2fBxiBxj
A matrix H is positive semidefinite if xTHx ¥ 0 @ x P Rn.
A symmetric matrix is positive semidefinite if and only if all ofits eigenvalues are non-negative.
![Page 36: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/36.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Convex Function Example 1Discussion
![Page 37: CS540 Introduction to Artificial Intelligence Lecture 2pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_2_P.pdfCS540 Introduction to Arti cial Intelligence Lecture 2 Young Wu Based on lecture](https://reader035.fdocuments.net/reader035/viewer/2022062603/5f11aed78e020e0ff520a81f/html5/thumbnails/37.jpg)
Generalized Linear Models Logistic Regression Gradient Descent
Convex Function Example 2Discussion