Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith...
Transcript of Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith...
Machine Learning (CSE 446):Practical Issues
Noah Smithc© 2017
University of [email protected]
October 18, 2017
1 / 39
scary words
2 / 39
Outline of CSE 446We’ve already covered stuff in blue!
I Problem formulations: classification, regression
I Supervised techniques: decision trees, nearest neighbors, perceptron, linearmodels, probabilistic models, neural networks, kernel methods
I Unsupervised techniques: clustering, linear dimensionality reduction
I “Meta-techniques”: ensembles, expectation-maximization
I Understanding ML: limits of learning, practical issues, bias & fairness
I Recurring themes: (stochastic) gradient descent, bullshit detection
3 / 39
Today: (More) Best Practices
You already know:
I Separating training and test data
I Hyperparameter tuning on development data
Understanding machine learning is partly about knowing algorithms and partly aboutthe art of mapping abstract problems to learning tasks.
4 / 39
Features Matter
5 / 39
Features Matter
6 / 39
Features Matter
7 / 39
Features Matter
8 / 39
Irrelevant Features
I Decision trees (not too deep)?
I K-nearest neighbors?
I Perceptron?
9 / 39
Irrelevant Features
One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!
I Decision trees (not too deep)?
I K-nearest neighbors?
I Perceptron?
10 / 39
Irrelevant Features
One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!
I Decision trees (not too deep)?
I K-nearest neighbors?
I Perceptron?
11 / 39
Irrelevant Features
One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!
I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!
I K-nearest neighbors?
I Perceptron?
12 / 39
Irrelevant Features
One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!
I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!
I K-nearest neighbors?
I Perceptron?
13 / 39
Irrelevant Features
One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!
I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!
I K-nearest neighbors? /
I Perceptron?
14 / 39
Irrelevant Features
One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!
I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!
I K-nearest neighbors? /I Perceptron?
15 / 39
Irrelevant Features
One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!
I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!
I K-nearest neighbors? /I Perceptron? ,
16 / 39
Irrelevant Features
One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!
I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!
I K-nearest neighbors? /I Perceptron? ,
What about redundant features φj and φj′ such that φj ≈ φj′?
17 / 39
Technique: Feature Pruning
If a binary feature is present in too small or too large a fraction of D, remove it.
18 / 39
Technique: Feature Pruning
If a binary feature is present in too small or too large a fraction of D, remove it.
Example: φ(x) = Jthe word the occurs in document xK
19 / 39
Technique: Feature Pruning
If a binary feature is present in too small or too large a fraction of D, remove it.
Example: φ(x) = Jthe word the occurs in document xK
Generalization: if a feature has variance (in D) lower than some threshhold value,remove it.Note: in lecture, I mistakenly said to remove high-variance features. Mea culpa.
sample mean(φ;D) =1
N
N∑n=1
φ(xn) (call it “φ”)
sample variance(φ;D) =1
N − 1
N∑n=1
(φ(xn)− φ
)2(call it “Var(φ)”)
20 / 39
Technique: Feature Normalization
Center a feature:
φ(x)→ φ(x)− φ
(This was a required step for principal components analysis!)
Scale a feature. Two choices:
φ(x)→ φ(x)√Var(φ)
“variance scaling”
φ(x)→ φ(x)
maxn|φ(xn)|
“absolute scaling”
21 / 39
Technique: Feature Normalization
Center a feature:
φ(x)→ φ(x)− φ
(This was a required step for principal components analysis!)
Scale a feature. Two choices:
φ(x)→ φ(x)√Var(φ)
“variance scaling”
φ(x)→ φ(x)
maxn|φ(xn)|
“absolute scaling”
Remember that you’ll need to normalize test data before you test!
22 / 39
Techniques: Creating New Features
1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:
φj∧j′(x) = φj(x) ∧ φj′(x)
2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.
3. Transformations on features can be useful. For example,
φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)
Example: φ(x) is the count of the word cool in document x.
23 / 39
Techniques: Creating New Features1. Consider two binary features, φj and φj′ . A new conjunction feature can be
defined by:
φj∧j′(x) = φj(x) ∧ φj′(x)
The classic “xor” problem: these points are not linearly separable.
2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.
3. Transformations on features can be useful. For example,
φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)Example: φ(x) is the count of the word cool in document x.
24 / 39
Techniques: Creating New Features1. Consider two binary features, φj and φj′ . A new conjunction feature can be
defined by:
φj∧j′(x) = φj(x) ∧ φj′(x)
Define x[3] = x[1] ∧ x[2].
2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.
3. Transformations on features can be useful. For example,
φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)Example: φ(x) is the count of the word cool in document x.
25 / 39
Techniques: Creating New Features1. Consider two binary features, φj and φj′ . A new conjunction feature can be
defined by:
φj∧j′(x) = φj(x) ∧ φj′(x)
Rotating the view.
2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.
3. Transformations on features can be useful. For example,
φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)Example: φ(x) is the count of the word cool in document x.
26 / 39
Techniques: Creating New Features
1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:
φj∧j′(x) = φj(x) ∧ φj′(x)
2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.
3. Transformations on features can be useful. For example,
φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)
Example: φ(x) is the count of the word cool in document x.
27 / 39
Techniques: Creating New Features1. Consider two binary features, φj and φj′ . A new conjunction feature can be
defined by:
φj∧j′(x) = φj(x) ∧ φj′(x)
2 · x[1] + 2 · x[2]− 4 · x[3]− 1 = 0
2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.
3. Transformations on features can be useful. For example,
φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)Example: φ(x) is the count of the word cool in document x.
28 / 39
Techniques: Creating New Features
1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:
φj∧j′(x) = φj(x) ∧ φj′(x)
Generalization: take the product of two features.
2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.
3. Transformations on features can be useful. For example,
φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)
Example: φ(x) is the count of the word cool in document x.
29 / 39
Techniques: Creating New Features
1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:
φj∧j′(x) = φj(x) ∧ φj′(x)
Generalization: take the product of two features.
2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.
3. Transformations on features can be useful. For example,
φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)
Example: φ(x) is the count of the word cool in document x.
30 / 39
Techniques: Creating New Features
1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:
φj∧j′(x) = φj(x) ∧ φj′(x)
Generalization: take the product of two features.
2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.This is one view of what decision trees are doing!
I Every leaf’s path (from root) is a conjunction feature.I Why not build decision trees, extract the features and toss them into the perceptron?
3. Transformations on features can be useful. For example,
φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)
Example: φ(x) is the count of the word cool in document x.
31 / 39
Techniques: Creating New Features
1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:
φj∧j′(x) = φj(x) ∧ φj′(x)
Generalization: take the product of two features.
2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.This is one view of what decision trees are doing!
I Every leaf’s path (from root) is a conjunction feature.I Why not build decision trees, extract the features and toss them into the perceptron?
3. Transformations on features can be useful. For example,
φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)
Example: φ(x) is the count of the word cool in document x.
32 / 39
EvaluationAccuracy:
A(f) =∑x
D(x, f(x))
=∑x,y
D(x, y) ·{
1 if f(x) = y0 otherwise
=∑x,y
D(x, y) · Jf(x) = yK
where D is the true distribution over data. Error is 1−A; earlier we denoted error“ε(f).”This is estimated using a test dataset 〈x1, y2〉, . . . , 〈xN ′ , yN ′〉:
A(f) =1
N ′
N ′∑i=1
Jf(xi) = yiK
33 / 39
Issues with Test-Set Accuracy
I Class imbalance: if D(∗, not spam) = 0.99, then you can get A ≈ 0.99 by alwaysguessing “not spam.”
I Relative importance of classes or cost of error types.
I Variance due to the test data.
34 / 39
Issues with Test-Set Accuracy
I Class imbalance: if D(∗, not spam) = 0.99, then you can get A ≈ 0.99 by alwaysguessing “not spam.”
I Relative importance of classes or cost of error types.
I Variance due to the test data.
35 / 39
Issues with Test-Set Accuracy
I Class imbalance: if D(∗, not spam) = 0.99, then you can get A ≈ 0.99 by alwaysguessing “not spam.”
I Relative importance of classes or cost of error types.
I Variance due to the test data.
36 / 39
Issues with Test-Set Accuracy
I Class imbalance: if D(∗, not spam) = 0.99, then you can get A ≈ 0.99 by alwaysguessing “not spam.”
I Relative importance of classes or cost of error types.
I Variance due to the test data.
37 / 39
Evaluation in the Two-Class CaseSuppose we have two classes, and one of them, t, is a “target.”
I E.g., given a query, find relevant documents.
Precision and recall encode the goals of returning a “pure” set of targeted instancesand capturing all of them.
actually in the target
class;y = t
believed to be in the target
class;f(x) = t
correctly labeled
as t
A BC
P(f) =|C||B|
=|A ∩B||B|
R(f) =|C||A|
=|A ∩B||A|
F1(f) = 2 · P · RP + R
38 / 39
Another View: Contingency Table
y = t y 6= t
f(x) = t C (true positives) B \ C (false positives) B
f(x) 6= t A \ C (false negatives) (true negatives)
A
39 / 39