Today’s Topics (only on final at a high level; Sec 19.5 and Sec 18.5 readings below are ‘skim...

1

Today’s Topics (only on final at a high level; Sec 19.5 and Sec 18.5 readings below are ‘skim only’)

12/8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 30, Week 14

• HW5 must be turned in by 11:55pm Fri (soln out early Sat)

• Read Chapters 26 and 27 of textbook for Next Tuesday

• Exam (comprehensive, with focus on material since midterm),

Thurs 5:30-7:30pm, in this room, two pages and notes and

simple calculator (log, e, * / + -) allowed

• Next Tues We’ll Cover My Fall 2014 Final (Spring 2013 Next Weds?)

• A Short Introduction to Inductive Logic Programming (ILP) – Sec. 19.5 of textbook

- learning FOPC ‘rule sets’

- could, in a follow-up step, learn MLN weights on these rules

(ie, learn ‘structure’ then learn ‘wgts’)

• A Short Introduction to Computational Learning Theory (COLT) – Sec 18.5 of text


Inductive Logic Programming (ILP)

• Use mathematical logic to– Represent training examples

(goes beyond fixed-length feature vectors)

– Represent learned models (FOPC rule sets)

• ML work in the late ’70s through early ’90s was logic-based, then statistical ML ‘took over’

2


Examples in FOPC(not all have same # of ‘features’)

on(ex1, block1, table) on(ex1, block2, block1) color(ex1, block1, blue) color(ex1, block2, blue) size(ex1, block1, large) size(ex1, block2, small)

< a much larger number of facts are needed to describe example2 >

PosEx1

PosEx2

Learned Concept tower(?E) if

on(?E, ?A, table),on(?E, ?B, ?A).

3


Searching for a Good Rule(propositional-logic version)

P if A

P if B and C

P if C

P if B and D

P is always true

P if B

4

All Possible Extensions of a Clause (capital letters are variables)

Assume we are expanding this node

q(X, Z) p(X, Y)

What are the possible extensions using r/3 ?

r(X,X,X) r(Y,Y,Y) r(Z,Z,Z) r(1,1,1)

r(X,Y,Z) r(Z,Y,X) r(X,X,Y) r(X,X,1)

r(X,Y,A) r(X,A,B) r(A,A,A) r(A,B,1)

and many more …

Choose from: old variables, constants, new vars

Huge branching factor in our search!12/8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 30, Week 14 5


Example: ILP in the Blocks World

Consider this training set

POS

NEG

6

Can you guess an FOPC rule?


Searching for a Good Rule(FOPC version; cap letters are vars)

on(X,Y) POS

true POS

7

blue(X) POS tall(X) POS

Assume we have: tall(X), wide(Y), square(X), on(X,Y), red(X), green(X), blue(X), block(X)

…

POSSIBLE RULE LEARNED: If on(X,Y) block(Y) blue(X) POS

- hard to learn with fixed-length feature vectors!

+

-


Covering Algorithms(learn a rule, then recur; so disjunctive)

+

+

+

++

+

+

-

--

-

-

- -- -

--

-+

++ +

+

+

+

+

+

+

++

+

+

-

--

-

-

- -- -

--

-

Examples covered by Rule 1

Examples Still to Cover; use to learn Rule 2

8


Using Background Knowledge (BK) in ILP

• Now consider adding some domain knowledge about the task being learned

• For example If Q, R, and W are all true

Then you can infer Z is true

• Can also do arithmetic, etc in BK rule bodies

If SOME_TRIG_CALCS_OUTSIDE_OF_LOGIC Then openPassingLane(P1, P2, Radius, Angle)

9


Searching for a Good Ruleusing Deduced Features (eg, Z)

P if A P if CP if B

P if Z

P if B & Z

Note that more BK can lead to slower learning!

But hopefully less search depth needed

P is always true

P if B and DP if B and C

10


Controlling the Search for a Good Rule

• Choose a ‘seed’ positive example, then only consider properties that are true about this example

• Specify argument types and whether arguments are ‘input’ (+) or ‘output’ (-)– Only consider adding a literal

if all of its input arguments already present in rule– For example

enemies(+person, -person)

Only if a variable of type PERSON is already in the rule [eg, murdered(person)], consider adding that person’s enemies

11


Formal Specification of the ILP Task

Given a set of pos examples (P)

a set of neg examples (N)

some background knowledge (BK)

Do induce additional knowledge (AK)such that

BK AK allows all/most in P to be proved

BK AK allows none/few in N to be proved

Technically, the BK also contains all the facts about the pos and neg examples plus some rules

12

CS 540 - Fall 2015 (Shavlik©), Lecture 30, Week 14

ILP Wrapup• Use best-first search with a large beam

• Commonly used scoring function #posExCovered - #negExCoved – ruleLength

• Performs ML without requiring fixed-length-feature-vectors

• Produces human-readable rules(straightforward to convert FOPC to English)

• Can be slow due to large search space

• Appealing ‘inner loop’ for prob logic learning

12/8/15 13

COLT: Probably Approximately Correct (PAC) Learning

PAC theory of learning (Valiant ’84)Given

C class of possible conceptsc C target conceptH hypothesis space (usually H = C) , correctness boundsN polynomial number of examples

12/8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 30, Week 14 14

Probably Approximately Correct (PAC) Learning

• Do with probability 1 - , return an h in H whose accuracy is at least 1 -

• Do this for any probability distribution for the examples

• In other words

Prob[error(h, c) > ] <

h

c

Shaded regions are where errors occur


How Many Examples Needed to be PAC?

Consider finite hypothesis spaces

Let Hbad { h1, …, hz }

• The set of hypotheses whose (‘testset’) error is >

• Goal With high prob, eliminate all items in Hbad via (noise-free)

training examples12/8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 30, Week 14 16

How Many Examples Needed to be PAC?

How can an h look bad, even though it is correct on all the training examples?• If we never see any examples in the

shaded regions• We’ll compute an N s.t. the odds of this

are sufficiently low (recall, N = number of examples)

h

c


Hbad

• Consider H1 Hbad and ex { N }

• What is the probability that H1 is consistent with ex ?

Prob[consistentA(ex, H1)] ≤ 1 - (since H1 is bad its error rate is at least )

The set of N examples


Hbad (cont.)

What is the probability that H1 is

consistent with all N examples?

Prob[consistentB({ N }, H1)] ≤ (1 - )|N|

(by iid assumption)


Hbad (cont.)

What is the probability that some member of Hbad is consistent with the examples in { N } ?

Prob[consistentC({N}, Hbad)]

Prob[consistentB({N}, H1) …

consistentB({N}, Hz)]

≤ |Hbad| x (1-)|N| // P(A B) = P(A) + P(B) - P(A B)

≤ |H| x (1- )|N| // Hbad H Ignore this in upper bound calc


Solving for #Examples, |N|

We want

Prob[consistentC({N}, Hbad)]

≤ |H| x (1-)|N| < Recall that we want the prob of a bad concept surviving to be less than , our bound on learning a poor concept

Assume that if many consistent hypotheses survive, we get unlucky and choose a bad one (we’re doing a worst-case analysis)


Solving for |N|(number of examples needed to be confident of getting a good model)

Solving

|N| > [ log(1/) + log(|H|) ] / -ln(1-)

Since ≤ -log(1-) over [0,1) we get

|N| > [ log(1/) + log(|H|) ] /

(Aside: notice that this calculation assumed we could

always find a hypothesis that fits the training data)

Notice we made NO assumptions

about the prob dist of the data

(other than it does not change)


Example: Number of Instances Needed

AssumeF = 100 binary features

H = all (pure) conjuncts

[3|F| possibilities (i, use fi, use ¬ fi, or ignore fi) so log |H| = |F| log 3 ≈ |F| ]

= 0.01

= 0.01

N = [log(1/)+log(|H|)] / = 100 [log(100) + 100] ≈ 104

But how many real-world concepts are pure conjunctswith noise-free training data?


Agnostic Learning

• So far we’ve assumed we knew the concept class - but that is unrealistic on real-world data

• In agnostic learning we relax this assumption

• We instead aim to find a hypothesis arbitrarily close (ie < error) to the best* hypothesis in our hypothesis space

• We now need |N| ≥ [ log(1/) + log(|H|) ] / 22

(denominator had been just before)* ie, closest to the true

concept


Two Senses of Complexity

Sample complexity (number of examples needed)

vs.

Time complexity (time needed to find h H that is consistent with the training examples)

- in CS, we usually only address time complexity12/8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 30, Week 14 25

Complexity (cont.)

– Some concepts require a polynomial number of examples, but an exponential amount of time (in the worst case)

– Eg, optimally training neural networks is NP-hard (recall BP is a ‘greedy’ algorithm that finds a local min)


Some Other COLT Topics

COLT + clustering + k-NN+ RL

+ SVMs+ ANNs+ ILP, etc.

• Average case analysis (vs. worst case)

• Learnability of natural languages (language innate?)

• Learnability in parallel


Summary of COLT

Strengths

• Formalizes learning task

• Allows for imperfections (eg, and in PAC)

• Work on boosting excellent case of ML theory influencing ML practice

• Shows what concepts are intrinsically hard to learn (eg, k-term DNF*)

* though a superset of this class is PAC learnable!


Summary of COLT

Weaknesses

• Most analyses are worst case

• Hence, bounds often much higher than what works in practice (see Domingos article assigned early this semester)

• Use of ‘prior knowledge’ not captured very well yet


Today’s Topics (only on final at a high level; Sec 19.5 and Sec 18.5 readings below are ‘skim...

Documents

Transcript of Today’s Topics (only on final at a high level; Sec 19.5 and Sec 18.5 readings below are ‘skim...