Donald “Godel” Rumsfeld

Donald “Godel” Rumsfeld

''Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns, there are things we know we know,'' Rumsfeld said. ''We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns — the ones we don't know we don't know.''

» Rumsfeld talking about the reported lack of WMDs in Iraq (News Conference, April 2003)

Winner of 2003 Foot in the Mouth Award

''We think we know what he means,'' said Plain English Campaign spokesman John Lister. ''But we don't know if we really know.''

12/2 Decisions.. Decisions

•Vote on final •In-class

•(16th 2:40pm)•OR Take-home

• (will be due by 16th)•Clarification on HW5•Participation survey

Learning

Dimensions: What can be learned? --Any of the boxes representing the agent’s knowledge --action description, effect probabilities, causal relations in the world (and the probabilities of causation), utility models (sort of through credit assignment), sensor data interpretation models What feedback is available? --Supervised, unsupervised, “reinforcement” learning --Credit assignment problem What prior knowledge is available? -- “Tabularasa” (agent’s head is a blank slate) or pre-existing knowledge

Inductive Learning(Classification Learning)

• Given a set of labeled examples, and a space of hypotheses

– Find the rule that underlies the labeling

• (so you can use it to predict future unlabeled examples)

– Tabularasa, fully supervised

• Idea:– Loop through all hypotheses

• Rank each hypothesis in terms of its match to data

• Pick the best hypothesisClosely related to * Function learning or curve-fitting (regression)

A classification learning examplePredicting when Rusell will wait for a table

--similar to predicting credit card fraud, predicting when people are likely to respond to junk mail

Ranking hypotheses

A good hypothesis will have fewest false positives(Fh

+) and fewest false negatives (Fh-)

[Ideally, we want them to be zero]

Rank(h) = f(Fh+, Fh

-) --f depends on the domain --in a medical domain False negatives are penalized more --in a junk-mailing domain, False negatives are penalized less

H1: Russell waits only in italian restaurants false +ves: X10, false –ves: X1,X3,X4,X8,X12

H2: Russell waits only in cheap french restaurants False +ves: False –ves: X1,X3,X4,X6,X8,X12

The hypothesis classifies the example as +ve, but it is actually -ve

When do you know you have learned the concept well?

• You can classify all new instances (test cases) correctly always

• Always– May be the training samples

are not completely representative of the test samples

– So, we go with “probably”• Correctly?

– May be impossible if the training data has noise (the teacher may make mistakes too)

– So, we go with “approximately”

• The goal of a learner then is to produce a probably approximately correct (PAC) hypothesis, for a given approximation (error rate) and probability .

• When is a learner A better than learner B?– For the same bounds, A

needs fewer trailing samples than B to reach PAC.

Learning Curves

PAC learning

||log1

log1

||

,,

,

)Pr(

)1,0(

)(

HN

followsasHspacehypothesistheof

sizetheandtorelatedisNwhereconcepttrainingalearnPACto

samplestrainingNneedslearneracaseworsttheinthatshownbecanIt

errorsoffractionthanmoremakeslearnerthe

ifyprobabilitaandrateerroranto

respectwithcorrectelyapproximatprobablyPACconsideredislearnerA

Note: This result only holds for finite hypothesis spaces (e.g. not valid for the space of line hypotheses!)

Inductive Learning(Classification Learning)

• Given a set of labeled examples, and a space of hypotheses

– Find the rule that underlies the labeling

• (so you can use it to predict future unlabeled examples)

– Tabularasa, fully supervised

• Idea:– Loop through all hypotheses

• Rank each hypothesis in terms of its match to data

• Pick the best hypothesis

• Main variations:• Bias: the “sort” of rule are you

looking for?– If you are looking for only

conjunctive hypotheses, there are just 3n

– Search:– Greedy search

– Decision tree learner– Systematic search

– Version space learner– Iterative search

– Neural net learner

The main problem is that the space of hypotheses is too large

Given examples described in terms of n boolean variablesThere are 2 different hypothesesFor 6 features, there are 18,446,744,073,709,551,616 hypotheses

2n

It can be shown that sample complexity of PAC learning is proportional to 1/, 1/ AND log |H|

IMPORTANCE OF Bias in Learning…

“Gavagai” example. -The “whole object” bias in language learning.

More expressive the bias, larger the hypothesis space Slower the learning--Line fitting is faster than curve fitting--Line fitting may miss non-line patterns

Uses different biases in predicting Russel’s waiting habbits

Russell waits

Wait time? Patrons? Friday?

0.3

0.5

full

0.3

0.2

some

0.4

0.3

None

F

T

RW

0.3

0.5

full

0.3

0.2

some

0.4

0.3

None

F

T

RW

Naïve bayes(bayesnet learning)--Examples are used to --Learn topology --Learn CPTs

Neural Nets--Examples are used to --Learn topology --Learn edge weights

Decision Trees--Examples are used to --Learn topology --Order of questionsIf patrons=full and day=Friday

then wait (0.3/0.7)If wait>60 and Reservation=no then wait (0.4/0.9)

Association rules--Examples are used to --Learn support and confidence of association rules

U s e s d i f f e r e n t b i a s e s i n p r e d i c t i n g R u s s e l ’ s w a i t i n g h a b b i t s

R u s s e l l w a i t s

W a i t t i m e ? P a t r o n s ? F r i d a y ?

0 . 3

0 . 5

f u l l

0 . 3

0 . 2

s o m e

0 . 4

0 . 3

N o n e

F

T

R W

0 . 3

0 . 5

f u l l

0 . 3

0 . 2

s o m e

0 . 4

0 . 3

N o n e

F

T

R W

N a ï v e b a y e s( b a y e s n e t l e a r n i n g )- - E x a m p l e s a r e u s e d t o

- - L e a r n t o p o l o g y- - L e a r n C P T s

N e u r a l N e t s- - E x a m p l e s a r e u s e d t o

- - L e a r n t o p o l o g y- - L e a r n e d g e w e i g h t s

D e c i s i o n T r e e s- - E x a m p l e s a r e u s e d t o

- - L e a r n t o p o l o g y- - O r d e r o f q u e s t i o n sI f p a t r o n s = f u l l a n d d a y = F r i d a y

t h e n w a i t ( 0 . 3 / 0 . 7 )I f w a i t > 6 0 a n d R e s e r v a t i o n = n o

t h e n w a i t ( 0 . 4 / 0 . 9 )

A s s o c i a t i o n r u l e s- - E x a m p l e s a r e u s e d t o

- - L e a r n s u p p o r t a n dc o n f i d e n c e o f a s s o c i a t i o n r u l e s

Mirror, Mirror, on the wall Which learning bias is the best of all?

Well, there is no such thing, silly!

--Each bias makes it easier to learn some patterns and harder (or impossible) to learn others:

-A line-fitter can fit the best line to the data very fast but won’t know what to do if the data doesn’t fall on a line

--A curve fitter can fit lines as well as curves… but takes longer time to fit lines than a line fitter.

-- Different types of bias classes (Decision trees, NNs etc) provide different ways of naturally carving up the space of all possible hypotheses

So a more reasonable question is:

-- What is the bias class that has a specialization corresponding to the type of patterns that underlie my data?

-- In this bias class, what is the most restrictive bias that still can capture the true pattern in the data?

--Decision trees can capture all boolean functions

--but are faster at capturing conjunctive boolean functions

--Neural nets can capture all boolean or real-valued functions

--but are faster at capturing linearly seperable functions

--Bayesian learning can capture all probabilistic dependencies

But are faster at capturing single level dependencies (naïve bayes classifier)

12/4

Interactive Review next class!!Minh’s review: Next Monday evening

Rao’s review: Reading day? Vote on participation credit:

Should I consider participation credit or not?

Fitting test cases vs. predicting future casesThe BIG TENSION….

Why Simple is Better?

Training error

Test (prediction) error

Fra

ctio

n in

core

ctly

cla

ssifi

ed

Why not the 3rd?

1

2

3

Review

Learning Decision Trees---How?

Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively

Which one to pick?

Depending on the order we pick, we can get smaller or bigger trees

Which tree is better? Why do you think so??

Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively --if no attributes left to split? (label with majority element)

Would you split on patrons or Type?

N+N-

N1+N1-

N2+N2-

Nk+Nk-

Splitting on feature fk

P+ : N+ /(N++N-)P- : N- /(N++N-)

I(P+ ,, P-) = -P+ log(P+) - P- log(P- )

I(P1+ ,, P1-) I(P2+ ,, P2-) I(Pk+ ,, Pk-)

[Ni+ + Ni- ]/[N+ + N-] I(Pi+ ,, Pi-)

i=1

k

The differenceis the informationgain

So, pick the featurewith the largest Info Gain

I.e. smallest residual info

The Information GainComputation

Given k mutually exclusive and exhaustiveevents E1….Ek whose probabilities are p1….pk

The “information” content (entropy) is defined as

i -pi log2 pi

# expected comparisonsneeded to tell whether agiven example is +ve or -ve

Ex Masochistic Anxious Nerdy HATES EXAM

1 F T F Y

2 F F T N

3 T F F N

4 T T T Y

A simple example

V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1

V(A) = 2/4 * I(1,0) + 2/4 * I(0,1) = 0

V(N) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1

So Anxious is the best attribute to split onOnce you split on Anxious, the problem is solved

I(1/2,1/2) = -1/2 *log 1/2 -1/2 *log 1/2

= 1/2 + 1/2 =1

I(1,0) = 1*log 1 + 0 * log 0 = 0

Learning curves… Given N examples, partition them into N tr the training set and Ntest the test instances Loop for i=1 to |Ntr| Loop for Ns in subsets of Ntr of size I Train the learner over Ns

Test the learned pattern over Ntest and compute the accuracy (%correct)

Evaluating the Decision Trees

Russell Domain“Majority” function(say yes if majority of attributes are yes)

Lesson: Every bias makes something easier to learn and others harder to learn…

Problems with Info. Gain. Heuristics

• Feature correlation: The Costanza party problem– No obvious solution…

• Overfitting: We may look too hard for patterns where there are none– E.g. Coin tosses classified by the day of the week, the shirt I was wearing,

the time of the day etc. – Solution: Don’t consider splitting if the information gain given by the best

feature is below a minimum threshold• Can use the 2 test for statistical significance

– Will also help when we have noisy samples…• We may prefer features with very high branching

– e.g. Branch on the “universal time string” for Russell restaurant example– Branch on social security number to look for patterns on who will get A– Solution: “gain ratio” --ratio of information gain with the attribute A to the

information content of answering the question “What is the value of A?”• The denominator is smaller for attributes with smaller domains.

Neural Network Learning

• Idea: Since classification is really a question of finding a surface to separate the +ve examples from the -ve examples, why not directly search in the space of possible surfaces?

• Mathematically, a surface is a function – Need a way of learning

functions

– “Threshold units”

“Neural Net” is a collection ofwith interconnections

Feed ForwardUni-directional connections

Single Layer Multi-Layer

Recurrent

Bi-directional connections

Any linear decision surface can be representedby a single layer neural net

Any “continuous” decision surface (function) can be approximated to any degree of accuracy by some 2-layer neural net

Can act as associative memory

differentiable

threshold units

w1

w2

t=k

I1

I2

w1

w2

t=k

I1

I2

= 1 if w1I1+w2I2 > k= 0 otherwise

A Threshold Unit

…is sort of like a neuron

Threshold Functions

differentiable

The “Brain” Connection

Perceptron Networks

What happened to the“Threshold”? --Can model as an extra weight with static input

w1

w2

t=k

I1

I2

w1

w2

w0= k

I0=-1

t=0

==

Can Perceptrons Learn All Boolean Functions?--Are all boolean functions linearly separable?

Perceptron Training in Action

http://neuron.eng.wayne.edu/java/Perceptron/New38.htmlA nice applet at:

Any line that separates the +ve & –ve examples is a solution --may want to get the line that is in some sense equidistant from the nearest +ve/-ve --Need “support vector machines” for that

http://neuron.eng.wayne.edu/java/Perceptron/New38.html

Majority function Russell Domain

Perce

ptro

n

Decision Trees

Decision Trees

Perceptron

Comparing Perceptrons and Decision Treesin Majority Function and Russell Domain

Majority function is linearly seperable.. Russell domain is apparently not....

Encoding: one input unit per attribute. The unit takes as many distinct real values as the size of attribute domain

Donald “Godel” Rumsfeld

Documents

Transcript of Donald “Godel” Rumsfeld