Topics in Artificial Intelligence Lecture outline - McGill CIMdudek/424/lectures-all.pdf · Why AI...

1

Topics in Artificial Intelligence

308-424A

Gregory Dudek

Office: MC 404

Lecture outline

Introductions

Administrative details

What is AI?

What the course will contain.

Overview of AI sub-topics.

Overview of AI applications.

What is AI?

Is artificial intelligence about solvingproblems, or about core scientificproblems?

If we are working on artificial intelligence,how concerned to we have to be about thenatural kind?

How can we decide that we have solved theproblem?

Answer the following questions:

• What is Intelligence?

• What is Artificial Intelligence?

• When do you expect us to achieve artificialintelligence (already, soon, in 10 years, never)?

AI is about duplicating what the (human) brain DOES.

AI is about duplicating what the (human) brain SHOULD do.

Is AI computer science?

• Yes: it deals with algorithms, efficiency, tractability, etc.

• No: it includes philosophy, cognitive science, andengineering

• Maybe: we doesn’t know yet how to define the area orthe techniques. Maybe it’s a field in it’s own right?

• Who cares: the problems are exciting and important,isn’t this classification needless pedantry?

AI is truly interdisciplinary.

It relates to psychology, neurophysiology,mathematical, control theory (EE), etc.

http://www.cim.mcgill.ca/~dudek/424/ai.html

http://www.cim.mcgil.ca/~dudek

2

Why AI (in it’s broadest sense) is the best partof science (a personal confession):

• Understanding the mind is one of the oldest and mostchallenging questions considered by modern science.

• It allows you to see an idea come to fruition in atangible and useful way.

• You have wide latitude to select a preferred mix oftheory, construction, data collection, and dataanalysis.

• It can have enormous potential practical impact.

Course contents:We will overview selected topics. This is not a

comprehensive view of all of AI (it can’t be).

Key topics include:

PART 1Knowledge representation: predicate calculusSearch: A*, alpha-beta search

PART 2Learning: a couple of flavors

PART 3Perception: vision

What is AI today?

3 stereotypical components to actual systems:

Perception

Reasoning

Action

AI has a whole has fragmented: there are manysub-areas with reduced interaction between them.Perception, and vision in particular, has become a distinct

community.

Robotics (i.e. action) has also become largely separate.

By and large, deliberative reasoning has held on to the title“(traditional) Artificial Intelligence”.

With in each major branch sub-areas havedeveloped.

Within reasoning, different approaches havedeveloped their own styles and even jargon.E.g. neural networks, learning, game playing,

reasoning with uncertainty, randomized search.

Example AI system

Scheduling• Perception: Trivial task description

language.

• Reasoning: Constraint Satisfaction,Stochastic Optimization, Linearprogramming, Genetic Algorithms

• Action: Trivial

Example AI system

Medical Diagnosis (e.g. Pathfinder byHeckermann at Microsoft)

• Perception: Symptoms, test results.

• Reasoning: Bayes Network inference,Machine Learning, Monte-carlosimulation

• Actions: Suggest tests, make diagnoses

3

There are big questions .....• Can we make something that is asintelligent as a human?• Can we make something that is as intelligent as a bee?• Does intelligence depend on a model of the physical

world?• Can we get something that is really evolutionary and self

improving and autonomous and flexible....?

And little questions.....• Can we save this plant $20million a year by improved

pattern recognition?

• Can we save this bank $50million a year by automaticfraud detection?

• Can we start a new industry of handwriting recognition /software agents

Historical ContextComputer-science-based AI is commonly agreed to have

started with “the Dartmouth Conference” in 1956.

Some of the attendees:• John McCarthy: LISP, time-sharing, application of logic to

reasoning.

• Marvin Minsky: Popularized Neural Nets and showed limits ofneural nets,slots and srames.

• Claude Shannon: Information Theory, Open-loop 5-ball juggling

• Allen Newell & Herb Simon: Bounded Rationality, Logic Theorist /General Problem Solver / SOAR

Historical context• Reasoning was once seen as *the* AI problem.

• Chess, and related games, were once consideredpivotal to understanding intelligence.– They are now seen as a sub-domain of limited relevant to

be bulk of AI research.

– While playing chess it a “solved problem”, understandingof humans play chess (so well) is hardly solved at all.

• Vision (almost all of it) was once given to an MITgraduate student as a “summer project”.– More recently, a a major figure said roughly: it is so hard

that “if it were not for the human existence proof, wewould have given up a long time ago”.

Historical context

• Reasoning was once seen as *the* AI problem.

• Chess, and related games, were once considered pivotal tounderstanding intelligence.

• They are now seen as a sub-domain of limited relevant tobe bulk of AI research.

• While playing chess it a "solved problem", understandingof humans play chess (so well) is hardly solved at all.

• Vision (almost all of it) was once given to an MITgraduate student as a "summer project".

• More recently, a a major figure said roughly: it is so hardthat "if it were not for the human existence proof, wewould have given up a long time ago".

Intelligence implies….

• Reasoning (plan)– Modelling the world: objects and interactions

– Inferring implicit relationships

– Problem solving, search for an answer,planning

• Interaction with the outside world (sense &act)– Perception: the inference of objects and

relationships from what sensors deliver.• Sensors deliver “arrays of numbers”

Early Chronology• George Boole, Gottlob Frege, Alfred Tarski: human

thought

• Alan Turing, John von Neumann, Claude Shannon:– Cybernetics

– Equivalence/analogy between computation and thought !!!

• AI: The 40s and 50s– McCulloch and Pitts: Describe neural networks that could compute

any computable function

– Samuels: Checker playing machine that learned to play better.

– "Dartmouth Conference" (1956) : McCarthy: coined term"Artificial Intelligence"

– McCarthy:Defined LISP.

– Newell and Simon: The Logic Theorist. It was able to prove mostof the theorems in Russell and Whiteheadís Principia Mathematica.Bounded Rationality, Logic Theorist becomes General ProblemSolver.

4

• Early Successes

• Minksy: microworlds

• Evan's ANALOGY solved geometricanalogy problems that appear on IQ tests

• Bobrow's STUDENT solved algebraworld problems

• Gelernter: Geometry Theorem Proverused axioms plus diagram information.

Expert Systems and the Commercialization of AI

• Buchanan and Feigenbaum: DENDRAL (1969)

• MYCIN (1976): diagnose infections.

• LUNAR (1973): First natural language question/answersystem used in real life

• Rejuvenation of neural nets– In theory, they can learn almost any function.

• In practice, i t might take a millenium.

– Neural nets, while having obvious limitations, have surpassedhand-crafted systems in some key domains.

State-of-the-Art

• Almost grandmaster chess (ask me about checkers).

• Real-time speech recognition

• Expert systems "aren't really AI anymore", but many exist

The Bad News• Heavily oversold, with ensuing backlash.

• Almost every AI problem in NP-complete.– Lighthill report (1973).

• Perceptrons (a kind of neural network) shown to haveextremely limited representation ability (Minsky andPapert).

• Some of AI seen as poorly formalized hackery or anmathematical self-indulgence..

An early intelligent system: thebrain

• “Intelligent” processing in the brain iscarried out by neurons, mainly in thecerebral cortex.– Roughly 1012 neurons (1011 if you participated

in frosh week) and thousands of connectionsper neuron.

– “Clock speed” (refractory period): 1 to 10milliseconds

– Processing involves massive parallelism anddistributed representation.

Comparison

• Computers– Roughly 10 million transistors per chip

– Parallel machines: hundreds of CPU elements,1010 bits of RAM

– Clock speed: roughly 1 nanosecond

– Recall rate (for stored data) appears muchfaster.

• Does the different hardware imply thatfundamentally different approaches must beused?

What is intelligence?

• Stock answer: “the ability to learn and to solveproblems” [Webster’s]

– The ability to adapt to new situations.

• Your answers (paraphrased):– “The ability to laugh at humorous situations.”

– “Understanding how other agents behave…”

– “The ability to analyze and solve a problem” *

– “The ability to work with abstract concepts” *– “The ability to recognize patterns…”

“Th bilit t i f ti / i

5

What does the future hold?

• Many of you thought artificially intelligentsystems were a long way off.– “Never”

– “In some timescale comparable to the evolutionof intelligence in animals”

– “”In 50 years”

– “Not very soon”

• Some of the same people said things like:– “Intelligence is the ability to solve problems

that would be complex for a human being”

• In 1997 the computer “Deep Blue” playedthe human world chess champion “GarryKasparov”– (whom some have claimed is the best chess

player in history!)

– DB: 200 board positions per move.• 11 ply

– GK: 7 ply?

Monty Newborne at SOCS/McGill has been apioneer in computer chess. Moderated the

Playing Chess

“We’ll never really have artificialintelligence”

• Garry Kasparov:

“I could feel -- I could smell -- anew kind of intelligence acrossthe table.”

• Drew McDermott

“Saying Deep Blue doesn’t really thinkabout chess is like saying an airplane

doesn’t really fly because it doesn’t flapit’s wings”

• Robbins’ problem:– In 1932 E. V. Huntington presented a basis for

Boolean algebra: commutativity, associativityand the Huntington equation.

– Herbert Robbins conjectured it could bereplaced by one simpler equation (the Robbinsequation), leading (later) to Robbins algebras.Are all Robbins algebras Boolean algebras?

– Despite work by Robbins and Huntington andTarski and other, no solution was found.

In November 1997, a computer solvesthe Robbins conjecture.

• First “creative” proof by computer.– Qualitative difference from prior results

based more heavily of exhaustive search suchas the four-color theorem:

• Any planar map can be colored in using 4 colors sothat no two edge-adjacent regions have the samecolor

• Proven in 1976 with a combination of human effortand“sophisticated computing” that enumerated

Learning

• Backgammon:

TD-gammon [Tesauro]– Plays world-champion level backgammon.

– Learns suitable strategies by playing gamesagainst itself.

– Plays millions of games.

– Based on a neural network trained using“backpropagation”: incremental changes basedon observed errors.

• Methodas notgeneralized too well to other

6

Important challenges

• Domain specificity:– Successful systems are restricted to a narrow

domains and specific tasks.

• Coping with noisy data– Most successes have been in domains where the

objectives and the “rules” were closelyspecified and formalized.

• Incorporation of commonsense knowledge– Does every little thing have to be encoded or

derived explicitl y?

Domain specificity

• Natural language systems work “well” onlywhen the “domain of discourse” isrestricted.

• If not, things get very hard very fast.

• Consider these alternative meanings of“give”:– John have Pete a book. [tangible object

delivered]

– John gave Pete a hard time. [mode of behavior]

John gave Pete a black eye [specific action]

Problem progression in Speech

• “Word spotting” is good today. Key is toignore all of an utterance except keywordsof interest.

• Speaker dependent continuous speech: quitegood.

• Speaker independent continuous speech isgetting good. Works well with a limitedvocabulary.

Topic 2: Propositional logic

How to we explicitly represent our knowledge about the world?

References:Dean, Allen, Aloimonos, Chapter 3Russell and Norvig: Chapter 6

One of two or three logical languages we will consider.

Logical languages are analogous to programming languages:systems for describing knowledge that have a rigid syntax.

Logical languages (unlike programming languages) emphasizesyntax. In principle, the semantics is irrelevant (in a narrowsense).

Knowledge Representation

• Most programs are a set of procedures thataccomplish something using rules and“knowledge” embedded in the programitself.

• This is an example of

implicitly encoded information– If you want to change the way Microsoft Word

implements variables in macros, you have tohack the code.

– When my tax program needs to be upgraded for

Explicit knowledge

When we encode rules in a separate rule bookor

Knowledge Base (KB)we have

explicitly encoded(some of) the information of interest.

i.e. the rules are separate from the proceduresfor interpreting them.– Explicit knowledge encoding, in general, makes

it easier to update and manipulate (assuming

7

Knowledge and reasoning

Objective: to explicitly represent knowledgeabout the world.– So that a computer can use it efficiently….

• Simply to use the facts we have encoded

• To make inferences about things it doesn’t knowyet

– So that we can easily enter facts and modify ourknowledge base.

• The combination of a formal language and a

Wff’s

• In practice, with logical languages we combinesymbols to express truths, or relationships, aboutthe world.

• If we put the symbols together in a permitted way,we get a

well-formed formula or wff• A proposition is another term for an allowed

formula.

• A propositional variable is a proposition that isatomic: that it, it cannot be subdivided into other(smaller) propositions

Terminology

• A set of wffs connected by AND's is aconjunction.

• A set of wffs connected by OR's is adisjunction.

• Literals plain propositional variables, ortheir negations: P and ¬ P.

Semantics• We attach meaning to wffs in 2 steps: 1. By

Discovering “new” truths

• Want to be able to generate new sentencesthat must be true, given the facts in the KB.

• Generation of new true sentences from theKB is called

entailment.

• We do this with an inference procedure.• If the inference procedure works “right”:

only get entailed sentences. Then the

Knowing about knowing

• We would like to have knowledge bothabout the world, as well as the state of ourown knowledge (i.e. meta-knowledge).

• Ontological commitments refer to theguarantees given by our logic and KBregarding the real world.

• Epistemological commitments relate to thestates of knowledge, or kinds of knowledge,th t t t

A particular set of truth assignmentsassociated with propositional variables is amodel IF THE ASSOCIATED FORMULA(or formuli) come out with the value true.

e.g. For the formula

(A and B) implies ( C and D)

the assignment

A=true B=true C=true D=true

is a model.

The assignment

A=false B=true C=true D=true

8

Satisfiability

• If *no model is possible * for a formula,then the formula is NOT SATISFIABLE ,otherwise it is satisfiable.

• A Theory is a set of formuli (in the contextof propositional logic).

• If no model is possible for the negation of aformula, then we say the original formula isvalid (also a formula is always true, it is atautology).

A i i ff th t t t i i

Topic 2: Propositional logic

How to we explicitly represent our knowledge about the world?

References:Dean, Allen, Aloimonos, Chapter 3Russell and Norvig: Chapter 6

One of two or three logical languages we will consider.

Logical languages are analogous to programming languages:systems for describing knowledge that have a rigid syntax.

Logical languages (unlike programming languages) emphasizesyntax. In principle, the semantics is irrelevant (in a narrowsense).

Knowledge Representation

• Most programs are a set of procedures thataccomplish something using rules and“knowledge” embedded in the programitself.

• This is an example of

implicitly encoded information– If you want to change the way Microsoft Word

implements variables in macros, you have tohack the code.

– When my tax program needs to be upgraded for

Explicit knowledge

When we encode rules in a separate rule bookor

Knowledge Base (KB)we have

explicitly encoded(some of) the information of interest.

i.e. the rules are separate from the proceduresfor interpreting them.– Explicit knowledge encoding, in general, makes

it easier to update and manipulate (assuming

Knowledge and reasoning

Objective: to explicitly represent knowledgeabout the world.– So that a computer can use it efficiently….

• Simply to use the facts we have encoded

• To make inferences about things it doesn’t knowyet

– So that we can easily enter facts and modify ourknowledge base.

• The combination of a formal language and a

Wff’s

• In practice, with logical languages we combinesymbols to express truths, or relationships, aboutthe world.

• If we put the symbols together in a permitted way,we get a

well-formed formula or wff• A proposition is another term for an allowed

formula.

• A propositional variable is a proposition that isatomic: that it, it cannot be subdivided into other(smaller) propositions

9

Terminology

• A set of wffs connected by AND's is aconjunction.

• A set of wffs connected by OR's is adisjunction.

• Literals plain propositional variables, ortheir negations: P and ¬ P.

Semantics• We attach meaning to wffs in 2 steps: 1. By

Discovering “new” truths

• Want to be able to generate new sentencesthat must be true, given the facts in the KB.

• Generation of new true sentences from theKB is called

entailment.

• We do this with an inference procedure.• If the inference procedure works “right”:

only get entailed sentences. Then the

Knowing about knowing

• We would like to have knowledge bothabout the world, as well as the state of ourown knowledge (i.e. meta-knowledge).

• Ontological commitments refer to theguarantees given by our logic and KBregarding the real world.

• Epistemological commitments relate to thestates of knowledge, or kinds of knowledge,th t t t

A particular set of truth assignmentsassociated with propositional variables is amodel IF THE ASSOCIATED FORMULA(or formulae) come out with the value true.

e.g. For the formula

(A and B) implies ( C and D)the assignment

A=true B=true C=true D=true

is a model.

The assignment

f l C

Satisfiability

• If *no model is possible * for a formula,then the formula is NOT SATISFIABLE ,otherwise it is satisfiable.

• A Theory is a set of formulae (in thecontext of propositional logic).

• If no model is possible for the negation of aformula, then we say the original formula isvalid (also a formula is always true, it is atautology).

A i i ff th t t t i i

Completeness• The set of steps used by a sound procedure togenerate new sentences is a proof.

• If it is possible for find a proof for anysentence that is entailed, then the inferenceprocedure is complete.

• A set of rules is refutation complete: if a setof sentences cannot be satisfied, thenresolution will derive a contradiction. I.e.wecan derive both P and not(P) for somevariable P.

• Effective: can get answer in finite steps

10

Rules of Inference

α β or

• Modus Ponens

• And-Elimination

• Or-Introduction

• Double-Negation Elimination

U it R l ti

⊥ α β

Complexity

• Determination of satisfiability of anarbitrary problem is a key hard problem. Itis in the class of NP-complete problems.

• Except:– For a formula in CNF, if each disjunct has

only 2 literals, we can “efficiently” determinesatisfiability.

• Note: A is valid only if not(A) is notsatisfiable.– Thus validity is a hard question too

Automated Theorem Proving

• Assume proper axioms of the form

(P1∧ P2∧ … Pn) ⇒ Q

• A fact is a propositional variable this isgiven.

• If we want to prove goal Q, we can do thatby proving (P1∧ P2∧ … Pn).– Q is reduced to (P1∧ P2∧ … Pn).

Predicate Calculus

• Also known as first order logic .

• A formal system with a “world” made up of– Objects

– Properties of objects

– Relations between objects.

– Adds quantification over objects topropositional logic.

• Note: second order logic includes quantificationover classes.

FOL components

• Relations can be functions

Hair_color_of()

Is_student()

Took_ai424()

But they don’t have to be

Son_of()

Owns_CD_titled()

FOL terminology

• Terms: represent objects, can be constantsor expressions.

• Predicate symbols: a relation (sometimesfunctional).

• Sentences: as with propositional logic

• Arity : number of arguments to a relation

• Atomic sentence: predicate symbols andtermsOwns_printer_model(brother_of(Sue),HP_D

550)

11

Complexity of ATP in FOL?

• First order logic is universal.– Any inference or computation we know of can

be described.• We can describe the operation of a Turing machines.

– Thus, entailment is semidecidable.• We can’t tell if a computation halts except by

running it an waiting… maybe forever.

• Much effort on restricting FOL to assure itis decidable.– It still may be “exponentially difficult”.

Lecture 4

308-424A

Topics in AI

Gregory Dudek

Predicate calculus cont’d

• Review & new stuff (in PDF form).

Clausal Form

• Any predicate calculus proposition can beconverted into clauses in 6 steps:– Removing implications

– Moving negation inwards

– Skolemising

– Moving universal quantifiers outwards

– Distributing AND over OR (for CNF)

– Putting into clauses (notation only)

• Read Dean,Allen,Aloimonos Sec 3.5 pp. 96

Getting Horn(y)

• We have observed an equivalence betweenarbitrary sentences in FOL and CNF.– This extends to HORN CLAUSES

• Thus, we can use Horn clauses to expressesanything in FOL.

• The PROLOG language uses Horn clausesexplicitly as its notation.

PROLOG

• PROLOG: a logic programming language.– Name derives from PROgramming in LOGic

• Based on theorem proving, first-order logic.

• Small, unusual, influential language.

• Developed in the 1970s.

• On-line: documentation and executables forSOCS (lisa) and home PCs.

12

PROLOG notation

, (comma)

; (semicolon)

:-

not

AND

OR

IMPLIES

NOT

Variables start with uppercase

Predicates & bound variables in lower case

Facts & rules

• In prolog we can state facts like natashalikes nicholas by defining a suitablepredicate:– Likes(natasha, nicholas )

• We can define rules that allow inference.

• Uses closed world assumption: anything isfalse unless it is provably true.

Rules

• Rules:– One predicate as conclusion.

• Implication works to the left.

• Left hand predicate must be a positive literal.

– Resolution and unification are the “internal”mechanisms.

• Prolog is based on satisfying goals using aresolution theorem prover.

PROLOG Examples

likes(X,dudek)

likes(Everybody,cs424)Everybody likes cs424

Likes(richard,X), likes(eric,X)Things likes by both richard and eric

Likes(phil,X) :- likes(eric,X)

If eric likes something, so does phil.

The Montreal Student Domain

goodstudent(X) :- awakeinclass(X), csstudent(X).

csstudent(X) :- smart(X), (adventurous(X) ;sensible(X)).

adventurous(X) :- ( montrealer(X) ;rockclimber(X) ).

awakeinclass(X) :- drinks(X,Y) , hasdrug(Y,Z),stimulant(Z).

smart(X) :- not(rockclimber(X)), reader(X).

Facts about people.

• montrealer(jane).

• smart(jane).

• nerd(jane).

• drinks(jane,coffee).

• montrealer(bob).

• nerd(bob).

• drinks(bob,sprite).

• owns(teapot,ted).

13

More facts...

• reader(ted).

• reader(mary).

• reader(helen).

• fatherof(mary,ted).

• drinks(helen,sprite).

hasdrug(tea,caffiene).hasdrug(tea,tannin).hasdrug(tea,theobromine).hasdrug(coffee,caffiene).hasdrug(coffee,oil).hasdrug(quat,foo).hasdrug(sprite,sugar).stimulant(caffiene).% stimulant(theobromine).

Simple stuff

• reader( ted ).

yes

• reader(X).

X = ted ;

X = mary ;

X = helen ;

no

On-line examples...

Run prolog

This is “open prolog” for the Mac.

Lecture 5

G. Dudek

Topics in AI

McGill University

Today’s lecture

• Administrative issues– Comments on assignment

– PDF files

– Class notes

• Knowledge representation: wrap-up– Prolog details

– Non-monotonic logic

– Forward and backward chaining

• Introduction to search

Don’t care

• Symbol _ (underscore) is used to match apredicate that we don’t plan to use on theright-hand side.

• It’s like a dummy variable.

Eg. likes(a,b).Would return true no matter what a & b

are.

We can uselikes(a, ).

14

Prolog (continued)

• Supports lists of items[] - empty list

[1,2,3] - 3 items

[bob, ted, alice] - three objects

[[a], [1,2,3], []] - a list of lists

To examine a sub-part

[ H | L ]

refers to a list decomposed into a

head:H (the first element)

Lists: [H|T][1,2,3]

- head is 1

[bob, ted, alice]- head is bob

[ [a], [1,2,3], [ ] ]- head is [a]

[ [ [1,2,3]], [1,2,3], [ ] ]- head is [ [1,2,3] ]

[ ]- cannot match

Testing membership

• Now we can easily define a predicate to testfor list membership.

• Step 1: the head

member(H,[H|L]).• First argument is an item.

• Second argument is a list.– This matches if H is the head of the list.

member(bob,[bob,alice]) unifies withmember(H,[H|L)) is we let bob match Hand [bob,alice] match [H|L].

Membership: the body.

• Step 2: if it’s not the head, then there mustbe a sublist for which it is the head.– Recursive definition

• See if the item is the head of the tail portion.

member(Item,[Head|Tail]) :-member(Item,Tail).

Membership: complete

member( Item, [ Item | _ ]).

member(Item, [ _ | Rest ] ):-member(Item, Rest).

Unification: examiningcombinations

• Remember: prolog executionproceeds by repeated unifications,applied recursively.

• Consider:– foo(x) :- bar(x).–bar(x) :- foo(x).

–This will lead to a problem: theifi ti i ll t i t !

15

Recursion fix

• How can we fix the infinite recursion?– Never re-examine an already-considered unifier

(i.e. solution).

1. Within the definition, save the previoussolutions (unifications).

2. Check if the new unifier (solution) is one ofthose.

How?

Use a list!

Improved foo!

Improved foo!

foo(y,[]).

member …

foo(X,L) :- n ot(member(X,L)),bar(X,[X|L]).

bar(X ,L) :- not(member(X,L)),foo(X,[X|L]).

foo(a).

Concept Description Language

• A specialized language for efficientinference.

• Represent– classes of objects,

– sub-classes of classes,

– instances of classes,

– properties of instances (and classes).

• Akin to inheritance in object-orientedprogramming.

Nonmonotonic Logic

A monotonic logic:Things are are theorems remains theorems as we

add additional axioms.

Non-monotonic logic: formulas that wereonce theorems may not remain so as thetheory is augmented.

• Idea: add “default” assumptions to thetheory in the absence of completeknowledge.

• These assumptions may beretracted

Minimal (nonmonotonic) models

• Can induce a preference ordering oninterpretations.

Deductive retrieval• Deductive retrieval uses a KB to storeinformation, and uses rules to achieve goals.

• A system for maintaining a knowledge base.Includes retracting conclusions based oninformation that changes or which is deleted.

• Use forward chaining and backwardchaining.– Forward: start from KB and see what can be

inferred (leading to the goal, we hope). Especiallyhappens when a question involves newknowledge.

• Q. if pigs fly then will I pass?

B k d l d l d ti

16

Search

• Reference: DAA Chapter 4.

• The process of explicitly examining a set ofobjects to find a particular one, or satisfysome goal.

• In the contact, the objects are typicallypossible configurations of some problemrepresentation.

• Search is a central topic in AI– Originated with Newell and Simon's work on

problem solving.Famous book:

``Human Problem Solving'' (1972)

• Automated reasoning is a natural searchtask More recently: Given that almostall AI formalisms (planning, learning, etc.)are NP Complete or worse some from of

Problem Definition

• [State space] -- described by an initial stateand the set of possible actions available(operators).

• A path is any sequence of actions that leadfrom one state to another.

[Goal test] -- applicable to a single state todetermine if it is the goal state.

[Path cost] -- relevant if more than one pathleads to the goal, and we want theh t t th

Example I : Cryptarithmetic

SEND

+ MORE

--------

MONEY

• Find substitution of digits for letters such

th t th lti i ith ti ll

Cryptarithmetic, cont.

• [States:] a (partial) assignment of digitsto letters.

• [Operators:] the act of assigning digitsto letters.

• [Goal test:] all letters have beenassigned digits and sum is correct.

• [Path cost:] zero. All solutions are

Familiar ideas….

• Blind search: just examine “successive”possibilities.

• Depth first search (DFS)

• Breadth first search (BFS).

17

Lecture 6

• Conclusion of KR

• Search– Blind search

Concept Description Language

• A specialized language for efficientinference.

• Represent– classes of objects,

– sub-classes of classes,

– instances of classes,

– properties of instances (and classes).

• Akin to inheritance in object-orientedprogramming.

Nonmonotonic Logic

A monotonic logic:Things are are theorems remains theorems as we

add additional axioms.

Non-monotonic logic: formulas that wereonce theorems may not remain so as thetheory is augmented.

• Idea: add “default” assumptions to thetheory in the absence of completeknowledge.

• These assumptions may beretracted

Minimal (nonmonotonic) models

• Can induce a preference ordering oninterpretations.

Deductive retrieval• Deductive retrieval uses a KB to storeinformation, and uses rules to achieve goals.

• A system for maintaining a knowledge base.Includes retracting conclusions based oninformation that changes or which is deleted.

• Use forward chaining and backwardchaining.– Forward: start from KB and see what can be

inferred (leading to the goal, we hope). Especiallyhappens when a question involves newknowledge.

• Q. if pigs fly then will I pass?

B k d l d l d ti

Search

• Reference: DAA Chapter 4.

• The process of explicitly examining a set ofobjects to find a particular one, or satisfysome goal.

• In the contact, the objects are typicallypossible configurations of some problemrepresentation.

18

Search

• Search is a central topic in AI– Originated with Newell and Simon's work on

problem solving.Famous book:

``Human Problem Solving'' (1972)

• Automated reasoning is a natural searchtask More recently: Given that almostall AI formalisms (planning, learning, etc.)are NP Complete or worse some from of

State space

• In AI, search usually refers to search of astate space.

• State space: the ensemble of possibleconfigurations of the domain of interest.– Like phase space in physics

• E.g.– Chess: The set of allowed arrangements of

pieces on a chess board.

– Speech understanding: The set of possiblearrangements of words that make valid

Search: Problem Definition

State space -- described by an initial stateand the set of possible actions available(operators).– A path is any sequence of actions that lead

from one state to another.

Goal test -- applicable to a single state todetermine if it is a (the) goal state.

Path cost -- relevant if more than onepath leads to the goal, and we want the

Graphs, Trees & Search

• We can visualize generic state space searchin terms of searching a graph or tree.

• Graph search corresponds to looking for aparticular state given an arbitrary transitiondiagram.– A graph is defined as G = (V, E)

– V: set of vertices (i.e. states)

– E : set of transitions ei = (vj,vk). Can be directedor undirected

Traversal

• To traverse means to visit the vertices insome systematic order. You should befamiliar with various traversal methods fortrees:

preorder: visit each node before its children.

postorder: visit each node after its children.

inorder (for binary trees only): visit left subtree,node, right subtree.

Familiar ideas….

• Blind search: just examine “successive”alternative possibilities.

• Does not exploit knowledge of what statesto examine first.– It what order should we consider the states?

• Sequentially along a path? Leads toDepth First Search (DFS)

19

Depth First Search

• Key idea: pursue a sequence of successivestates as long as possible.

unmark all vertices choose some starting vertex x

mark x

list L = x

tree T = x

while L nonempty

choose the vertex v from front oflist

visit v

Depth First Search

DFS illustrated BFS

• Key: explore nodes at the same distancefrom the start at the same time


mark x

list L = x

tree T = x

while L nonempty


visit v

Breadth First Search BFS illustrated

20

Key issues in search

• Here’s what to keep in mind.

• Completeness: are we assured to find asolution (if one exists)?

• Space complexity: how much storage do weneed?

• Time complexity: how many operations dowe need?

• Solution quality: how good is the solution

Example I : Cryptarithmetic

• Find substitution of digits for letters such

• that the resulting sum is arithmeticallycorrect.

• Each letter must stand for a different digit.

SEND+ MORE -------- MONEY

Cryptarithmetic, cont.

• States: a (partial) assignment of digitsto letters.

• Operators: the act of assigning digitsto letters.

• Goal test: all letters have beenassigned digits and sum is correct.

• Path cost: zero. All solutions are

• Solution method?

• Depth first search:

Search performance

• Key issues that determine the nature ofthe problem are:–Branching factor of the search space:

how many options do we have at anytime?• Typically summarized by the worst-case

braching factor, which may be quitepessimistic.

–Solution depth: how long is the path tothe first solution?

Example: knight’s tour

• Tour executed by a chess knight to cover(touch) every square on a chess board.– Sub-problem: Find a path from one position to

another.

• States: possible positions on the chessboard.

• Operators: the ways a knight moves.

• Goal: the positions of the pawns.

• Path cost: the number of moves used to get

21

Knight’s

• How large is the state space?– A knight has up to 8 moves per turn.

– Each possible tour must be verified up to theend of the trip. For a board of width N, thereare N*N squares.

• Thus, each tour can be up to N*N states in length.

– If the correct solutions is found last, we mightconsider up every wrong tour first.

O( 8N^2 ) states to examine!

Example: Knight’s heuristics

• Zero: a board can be considered a dead endif any square has zero paths remaining to it.Any square with no paths to it would beunreachable, so no Knight's Tour wouldexist.

• Two ones: a board with more than onesquare with only one path to it is a deadend. A square with one path left isnecessarily a dead end, so two of themindicate a dead end position.

• How well will we do if we use “blind”DFS?

• That is, if we consider random tours?

• Three heuristics based on the number of pathsremaining to each square were implemented andtested in combination, as well as a representationalspeedup. The optimal combination of theheuristics is to eliminate boards with either asquare with zero paths remaining or a square withtwo ones remaining; this combination led to a 950-fold speedup on a 6x6 board. The representationalspeedup led to a 2.5-fold additional speedup on a6x6 board.

– Michael Bernstein

Heuristics: Knight’s Performance

22

BFS

• Consider a state space with a uniformbranching factor of b

• At each level we have b nodes for everynode we had before

1 + b + b2 + b3 + b4 …. + bd

So, solution at depth d implies we must

expand O(bd) nodes!

• internal nodes (b**d-1)/(b-1)

Depth Nodes Time Memory

0 1 1 millisecond 100 bytes2 111 .1 seconds 11 kilobytes4 11,111 11 seconds 1 megabyte6 106 18 minutes 111 megabytes8 108 31 hours 11 gigabytes

10 1010 128 days 1 terabyte12 1012 35 years 111 terabytes14 1014 3500 years 11,111 terabytes

Lecture 7

• Review

• Blind search

• Chess & search

Depth First Search

• Key idea: pursue a sequence of successivestates as long as possible.


mark x

list L = x

tree T = x

while L nonempty


visit v

BFS

• Key: explore nodes at the same distancefrom the start at the same time


mark x

list L = x

tree T = x

while L nonempty


visit v

Key issues in search

• Here’s what to keep in mind.

• Completeness: are we assured to find asolution (if one exists)?

• Space complexity: how much storage do weneed?

• Time complexity: how many operations dowe need?

• Solution quality: how good is the solution

23

Search performance

• Key issues that determine the nature ofthe problem are:–Branching factor of the search space:

how many options do we have at anytime?• Typically summarized by the worst-case

braching factor, which may be quitepessimistic.

–Solution depth: how long is the path tothe first solution?

Example: knight’s tour

• Tour executed by a chess knight to cover(touch) every square on a chess board.– Sub-problem: Find a path from one position to

another.

• States: possible positions on the chessboard.

• Operators: the ways a knight moves.

• Goal: the positions of the pawns.

• Path cost: the number of moves used to geti i

Knight’s

• How large is the state space?– A knight has up to 8 moves per turn.

– Each possible tour must be verified up to theend of the trip. For a board of width N, thereare N*N squares.

• Thus, each tour can be up to N*N states in length.

– If the correct solutions is found last, we mightconsider up every wrong tour first.

O( 8N^2 ) states to examine!

Example: Knight’s heuristics

• Zero: a board can be considered a dead endif any square has zero paths remaining to it.Any square with no paths to it would beunreachable, so no Knight's Tour wouldexist.

• Two ones: a board with more than onesquare with only one path to it is a deadend. A square with one path left isnecessarily a dead end, so two of themindicate a dead end position.

• How well will we do if we use “blind”DFS?

• That is, if we consider random tours?

24

• Three heuristics based on the number of pathsremaining to each square were implemented andtested in combination, as well as a representationalspeedup. The optimal combination of theheuristics is to eliminate boards with either asquare with zero paths remaining or a square withtwo ones remaining; this combination led to a 950-fold speedup on a 6x6 board. The representationalspeedup led to a 2.5-fold additional speedup on a6x6 board.

– Michael Bernstein

Heuristics: Knight’s Performance

BFS

• Consider a state space with a uniformbranching factor of b

• At each level we have b nodes for everynode we had before

1 + b + b2 + b3 + b4 …. + bd

So, solution at depth d implies we must

expand O(bd) nodes!

• internal nodes (b**d-1)/(b-1)

BFS: solution quality

• How good is the solution quality from BFS?

• Remember that edges in the search tree canhave weights.

• BFS will always find the shallowest(shortest depth) solution first.

• This will also be the minimum-cost solutionif…

BFS & Memory

• For BFS, we must record all the nodes atthe current level.– BFS can have large (enormous) memory

requirements!

• Memory can be a worst constraint thantime.

• When the problem is exponential, however,

Depth Nodes Time Memory

0 1 1 millisecond 100 bytes2 111 .1 seconds 11 kilobytes4 11,111 11 seconds 1 megabyte6 106 18 minutes 111 megabytes8 108 31 hours 11 gigabytes

10 1010 128 days 1 terabyte12 1012 35 years 111 terabytes14 1014 3500 years 11,111 terabytes

25

Comparison to DFS?

• Worst case:– If the search tree can be arbitrarily deep, DFS

may never terminate!

– If the search tree has maximum-depth m, theDFS (at worst) visits every node up to depth m.

– Time complexity O(bm )

• If there are lots of solutions (or you’relucky), DFS may do better then BFS.– If it gets a solution on the first try, it only looks

at d nodes.

Uniform Cost Search

• Note the edges in the search tree may havecosts.

• All nodes as a given depth may not have thesame cost, or desirability.

• Uniform cost search: expand nodes inorder of increasing cost from the start.

• This is assured to find the cheapest pathfirst, so long as ….

Iterative-Deepening

• Combine benefits of DFS (less memory) and BFS(best solution depth/time).

• Repeatedly apply DFS up to amaximum depth “diameter”.

• Incrementally increase the diameter.Unlike BFS we do not store the leaves

Iterative-Deepening Performance

Idea: expand the same nodes as BFS, and inthe same order as BFS!

Don’t save intermediate results, so nodes mustbe re-expanded.

How can this be good?!

This is a classic time-space tradeoff.

Because the search tree is exponential, wastedwork near the top doesn’t matter much.

Asymptotically optimal, complete.

SEMINAR

Deep Blue: IBM's Massively ParallelChess Machine

Wednesday, September 23, 1998

TIME: 11:00 A.M.

Strathcona Anatomy and Dentistry Building

3640 University Street

Room M-1

Dr Gabriel M Silberman

Seminar abstract

IBM's premiere chess system, based on an IBM RS/6000 SPscalable parallel processor, made history by defeatingworld chess champion. Garry Kasparov. Deep Blue'schess prowess stems from its capacity to examine over 200million board positions per second, utilizing the computingresources of a 32-node IBM RS/6000-SP, populated with512 special purpose chess accelerators.

In this talk we describe some of the technology behind DeepBlue, how chess knowledge was incorporated into itssoftware, as well as the attitude of the media and generalpublic during the match.

26

Computer Chess & DB

• Background for tomorrow’s seminar

– Presentation on Deep Blue and Chess(not available here)

Not intextBidirectional search

• Remember that the bottom of the search treeis where most of the work is.

• Idea:– keep the tree smaller

– Search from both the initial state and the goal.• Recall forwards + backwards chaining.

– Each leads to a half-size tree (and with luck,they meet)!

• If the goal is at depth d time is O(b(d/2) )(space too)

Not intext

Lecture 8

• Heuristic search

• Motivational video

• Assignment 2

Informed Search

• Search methods so far based on expandingnodes is search space based on distancefrom the start node.– Obviously, we always know that!

• How about using the estimated distanceh’(n) to the goal!– What’s the problem?

DAA Ch. 4

Heuristic Search

• What is we just have a guess as to thedistance to the goal: a heuristic. (like“Eureka!”)

Best-First Search

• At any time, expand the most promisingnode.

• Recall our general search algorithm (fromlast lecture).

Compare this to uniform-cost searchwhich is, in some sense, the opposite!

Best-First

• Best-First is like DFS

• HOW much like DFS depends on thecharacter of the heuristic

evaluation function h’(n)– If it’s zero all the time, we get BFS

• Best-first is a greedy method.– Greed methods maximize short-term advantage

27

Example: route planning

• Consider planning a path along a roadsystem.

• The straight-line distance from one place toanother is a reasonable heuristic measure.

• Is it always right?– Clearly not: some roads are very circuitous.

Example: The Road to Bucharest

Bucharest

Giurgiu

Urziceni

Hirsova

Eforie

NeamtOradea

Zerind

Arad

Timisoara

LugojMehadia

DobretaCraiova

Sibiu

Fagaras

PitestiRimnicu Vilcea

Vaslui

Iasi

Straightline distanceto Bucharest

0160242161

77151

241

366

193

178

25332980

199

244

380

226

234

374

98

Giurgiu

UrziceniHirsova

Eforie

Neamt

Oradea

Zerind

Arad

Timisoara

Lugoj

Mehadia

Dobreta

Craiova

Sibiu Fagaras

Pitesti

Vaslui

Iasi

Rimnicu Vilcea

Bucharest

71

75

118

111

70

75

120

151

140

99

80

97

101

211

138

146 85

90

98

142

92

87

86

Problem: Too Greedy

• From Arad to Sibiu to Fagaras --- but toRimnicu would have been better.

• Need to consider: cost of getting from startnode (Arad) to intermediate nodes!

Intermediate nodes

• Desirability of an intermediate node– How much it costs to get there

– How much farther one has to go afterwards

• Leads to evaluation function of the form:

e(n) = g(n) + h’(n)

– As before, h’(n)

– Use g(n) express the accumulated cost to get tothis node.

Details...

• Set L to be the initial node(s).

• Let n be the node on L that minimizes toe(n).

• If L is empty, fail.

• If n is a goal node, stop and return it (andthe path from the initial node to n).

• Otherwise, remove n from L and add all ofn's

Finds Optimal Path

• Now expands Rimnicu (f = (140 + 80) + 193 = 413)

over

Faragas(f = (140 + 99) + 178 = 417).

• Q. What if h(Faragas) = 170 (also an

28

Admissibility

• An admissible heuristic always findsthe best (lowest-cost) solution first.– Sometimes we refer to the algorithm being

admissible -- a minor “abuse of notation”.

– Is BFS admissible?

• If

h’(n) <= h(n)

then the heuristic is admissible.

A*

• The effect is that is it never overlyoptimistic (overly adventurous) aboutexploring paths in a DFS-like way.

• If we use this type of evaluation functionwith and admissible heuristic, we havealgorithm A*– A* search

Monotonicity

• Let's also also assume (true for mostadmissible heuristics that e is monotonic,i.e., along any path from the root f neverdecreases.

• Can often modify heuristic to becomemonotonic if it isn’t already.

• E.g. let n be parent of n’. Suppose that g(n)= 3 and h(n) = 4 , so f(n) = 7.and g(n’) = 4 and h(n’) = 2, so f(n’) = 6.

B t b th th h ’ i l

A* is good

• If h’(n) = h(n) then we know “everything”and e(n) is exact.

• If e(n) was exact, we could expand only thenodes on the actual optimal path to the goal.

• Note that in practice, e(n) is always anunderestimate.

• So, we always expand more nodes thanneeded to find the optimal path.

• A* finds the optimal path (first)

• A* is complete

• A* is optimally efficient for a givenheuristic: if we ever skipped expanding anode, we might make a serious mistake.

Video

• Nick Roy (former McGill grad student)describes tour-guide robot on the discoverychannel.

• Note: such a mobile robot might well useA* search.– Obstacles are a major source of the mismatch

between h’ and h.

– Straight-line distance provides a naturalestimate for h’

29

Hill-climbing search

• Usually used when we do not have aspecific goal state.

• Often used when we have asolution/path/setting and we want toimprove it:an iterative improvement algorithm.

• Always move to a successor that thatincreases the desirability (can minimize or

i i t)

Lecture 9

• Heuristic search, continued

A* revisited

• Reminder:with A* we want to find the best-cost (C )path to the goal first.– To do this, all we have to do is make sure our

cost estimates are less than the actual costs.

– Since our desired path has the lowest actualcost, no other path can be fully expanded first,since at some point it’s estimated cost will haveto be higher that the cost C.

Heuristic Functions: Example

Start State Goal State

2

45

6

7

8

1 2 3

4

67

81

23

45

6

7

81

23

45

6

7

8

5

8-puzzle

hC = number of missplaced tiles

hM = Manhattan distance

• Admissible?

• Which one should we use?

30

Choosing a good heuristic

hC <= hM <= hopt

• Prefer hM

Note: Expand all nodes with e(n) = g(n) +h(n) < e*

• So, g(n) < e* - h(n), higher h means fewern's.

• When one h funtion is larger than another,

Inventing Heuristics

• Automatically

• A tile can move from sq A to sq B if

• A is adjacent to B and B is blank.

• (a) A tile can move from sq A to sq B if Ais adjacent to B.

• (b) A tile can move from sq A to sq B if Bis blank.

• (c) A tile can move from sq A to sq B.

IDA*

• Problem with A*: like BFS, uses too muchmemory.

• Can combine with iterative deepening tolimit depth of search.

• Only add nodes to search list L is they arewithin a limit on the e cost.

A Different Approach

• So far, we have considered methods thatsystematically explore the full search space,possibly using principled pruning (A* etc.).

• The current best such algorithms (IDA* /SMA*) can handle search spaces of up to10100 states or around 500 binary valuedvariables. (These are ``ballpark '' figures only!)

And if that’s not enough?

• What if we have 10,000 or 100,000variables / search spaces of up to 1030,000

states?

• A completely different kind of method iscalled for:

Local Search Methodsor

Iterative Im provement Methods

When?

Applicable when we're interested in the GoalState not in how to get there.

E.g. N-Queens, VLSI layout, or map coloring.

31

Iterative Improvement• Simplest case:

– Start with some (initial) state

– Evaluate it’s quality

– Apply a perturbation ∂ to moveto an adjacent state

– Evaluate the new state

– Accept the new state sometimes,based on decision D(∂)

∂ can be random

D(.) can bealways.

Or….

Can we be smarter?

• Can we do something smarter thanA) moving randomly

B) accepting every change?

A) …..

move in the “right” direction

B)

accept changes that get us towards a better

Hill-climbing search

• Often used when we have asolution/path/setting and we want toimprove it, and we have a suitable space:

an iterative improvement algorithm.

• Always move to a successor that thatincreases the desirability (can minimize ormaximize cost).

Hill-climbing problems

• Hill climbing corresponds to functionoptimization by simply moving uphill alongthe gradient.– Why isn’t this a reliable way of getting to a

global maximum?

• 1. Local maxima

• 2. Plateaus

• 3. Ridges

Gradient ascent

• In continuous spaces, there are better waysto perform hill climbing.

– Acceleration methods• Try to avoid sliding along a ridge

– Conjugate gradient methods• Take steps that do not counteract one another.

Stochastic search

• Simply moving uphill (or downhill) isn’tgood enough.– Sometime, have to go against the obvious trend.

• Idea: Randomly make a “non-intuitive” move.– Key question: How often?

• One answer (simple intuition):

32

Simulated Annealing

• Initially act more randomly.• As time passes, we assume we are doing better

and act less randomly.

• Analogy: associate each state with an energy or“goodness”.

• Specifics:– Pick a random successor s2 to the current state

s1.

– Compute ∂E = energy(s2 ) - energy(s1 ).

– If ∂E good (positive) go to the new state.

GA’s

Genetic Algorithms

Plastic transparencies & blackboard.

Adversary Search

Basic formalism

• 2 players.

• Each has to beat the other (competingobjective functions).

• Complete information.

• Alternating moves for the 2 players.

Minimax

• Divide tree into plies

• Propagate information up from terminalnodes.

• Storage needs grow exponentially withdepth.

Lecture 10

• Annealing (final comments)

• Adversary Search

• Genetic Algorithms (genetic search)

A comment

• On minimal length paths….

• The Dynamic-Programming Principle[Winston]:“The best way through a particular, intermediate

place is the best way to it from the startingplace, followed by the best way from it to thegoal. There is no need to look at any otherpaths to or from the intermediate place.”

33

Simulated Annealing

• Initially act more randomly.• As time passes, we assume we are doing better

and act less randomly.

• Analogy: associate each state with an energy or“goodness”.

• Specifics:– Pick a random successor s2 to the current state

s1.

– Compute ∂E = energy(s2 ) - energy(s1 ).

– If ∂E good (positive) go to the new state.

Adversary Search

Basic formalism

• 2 players.

• Each has to beat the other (competingobjective functions).

• Complete information.

• Alternating moves for the 2 players.

The players

• Players want to achieve “opposite” goals:– What’s bad for A is good for B, and vice versa.

• With respect to our static evaluationfunctions, one wants to maximize thefunction, the other wants to minimize it.

• Use a game tree to describe the state space.

Game Tree

• Nodes represent board configurations.

• Edges represent allowed moves from oneconfiguration to another.

• For most 2-player games, alternating levelsin the tree refer to alternating moves by the2 players.

• Static evaluation function measures the

See overhead

MINIMAX

• If the limit of search has been reached,compute the static value of the currentposition relative to the appropriate player.Return the result

• Otherwise, use MINIMAX on the childrenof the current position.– If the level is a minimizing level, return the

minimum of the results

– If the level is a maximizing layer, return themaximum of the results.

34

MINIMAX observations

• The static evaluation function is…– Crucial: all decisions are eventually based of

the value it returns.

– Irrelevant: if we search to the end of the game,we always know the outcome.

• In practice, if same has a large searchspace then the static evaluation is especiallyimportant.

Minimax

• Divide tree into plies

• Propagate information up from terminalnodes.

• Storage needs grow exponentially withdepth.

Alpha-Beta Search

• Can we avoid all the search-tree expansionimplicit in MINIMAX?

• Yes, with a simple observation closelyrelated to A* :

Once you have found a good path, you onlyneed to consider alternative paths that are

better than that one.

The α−β πρινχιπλε

“ Ιφ ψου ηαϖε αν ιδεα τηατ ισσυρελψ βαδ, δο νοτ τακετιµε το σεε ηοω τρυλψ

αωφυλ ιτ ισ. ” [Ωινστον]

The α−β principle.

• “If you have an idea that issurely bad, do not take time to

see how truly awful it is.”[Winston]

α−β cutoff: How good is it?

• What is the best-case improvement foralpha-beta?

• What is the worst-case improvement foralpha-beta?

• Best case: only examine one leftmost “pre-terminal” nodes fully.

• Worst-case:

Not intext

35

Progressive deepening

• How do you deal with (potential) timepressure?

• Search progressively deeper: first depth 1,then depth 2, etc.

• As before, ratio of interior to leaf nodes is

bd(b-1)/(bd-1) or roughly (b-1)

Not intextIssues

• Horizon effect

• Search-until-quiescent heuristic

• Singular-extension heuristic

Not intext

GA’s

Genetic Algorithms

Plastic transparencies & blackboard.

Lecture 11

• Learning– What is learning?

– Supervised vs. unsupervised

– Supervised learning• ALVINN demo (cf. DAA p. 181)

• Inductive inference

Problem Solving as Search:recap

Uninformed search:

• DFS

• BFS

• Uniform cost search

• time / space complexity– size search space: up to approx. 1011

nodes

Better….

• Informed search: use heuristic functionguide to goal– Greedy search

– A* search: provably optimal• Search space up to approx.\ $1025

– Local search (incomplete)

– Greedy / hillclimbing / GSAT

– Simulated annealing\\

– Genetic Algorithms / Genetic Programming

– Larger search spaces

36

Adversary search

Adversary search / game playing

– Minimax• Up to around 1010 nodes, 6 --- 7 ply in

chess.

–alpha-beta pruning• Up to around 1020 nodes, 14 ply in

chess.

bl i l

Genetic Algorithms

• Example:worked out on blackboard.

– Using an GA to generate a course lectureschedule.

Kinds of Learning (Q&A)

• What do you associate with the termlearning?

What is for you the prototypical learning task?

•Memorization and r ote learning like

flashcards?

•Skill acquisition such as lear ning to ski

or learning to do symbolic i ntegration ?

•Theory or discovery lear ning like

discovering a new economics model f or

predicting market fluctuations?

What is learning?Definition: “… changes to thecontent and organization of anagent's knowledge enabling it toimprove its performance on aparticular task or population oftasks” [Herb Simon].

Types of learning

• Inferential basis– Inductive learning

– Deductive learning

Inductive

• inductive learning and the acquisition ofnew knowledge– inferring generalities from particulars - note

that this type of learning is not sound - e.g.,learn what foods served at the cafeteria aredigestible

37

Deductive

• deductive learning and the organization ofexisting knowledge– making explicit deductive consequences of

existing axioms - this type of learning generallyis sound - e.g., expedite deductive inference byadding new axiomsfrom forall x, loves(x,x) andforall x,y, loves(x,y) -> (has-money(x) -> pay-bill(x,y)) we can conclude thatforall x , has-money(x) -> pay-bill (x,x)

Learning: pedagogicalclassification

• Pedagogical basis– How to we teach the learning system?

– Supervised vs. unsupervised.

Supervised

• supervised learning involves a teacher thatprovides examples of the form(description,solution).– I.e. Pair where the solution for a specific

problem instance is explicitly indicated.

• The description is a representation of aparticular problem instance or situation andsolution is a representation of the desiredsolution or response

Unsupervised

• unsupervised learning need not involve ateacher at all.

• if it does teacher only provides rewards andpunishments.

• The learner not only has to figure out whatconstitutes a problem instance but also hasto figure out an appropriate response

• e.g., learn chess by trial and error or discoverylearning as in learning new concepts inmathematics (perfect numbers) or game playing

Concepts as functions

• f:X -> Y

Input spaceX = 0,1 x 0,1 x ... 0,1 = 0,1 n

i.e., the set of all possible assignments to n booleanvariables

X = Rn where R is the real numbers

X = set of all descriptions of some class of objects

f:X -> Y

Output space• Y = 0,1 this is called concept

learning

• Y = 1,2,3,...,n this is calledclassification– think of the integers 1 through n as representing

classes e.g., compact, sporty, subcompact,midsize, full size, too_damn_big

• Y = set of possible actions to takel i d i f i

38

Supervised learning• Example:

• The learning system is given a set of trainingexamples of the form (x,y).

• For example, ((0 1),0), ((1 0),0), ((1 1),1) aretraining examples for a two input booleanfunction.

conjunction.

Example presentation

• Batch problems all of the examples aregiven at once

• Online problems, the learner is given oneexample at a time and is assumed not tohave the storage necessary to keep track ofall of the examples seen so far.

• Why is online interesting? Examples?

A learning system

• ALVINN demo video….

• This system does real learning.– Is it supervised?

– Is it batch?

• For now, consider learning problems inwhich the training examples are noise free.

• This is not always the case.– e.g., what if the driver made a bad decision

while training ALVINN?

• Generally, some sort of statistical analysis isneeded to avoid being misled by mis-classified training examples.

O h l ?

•Learning web-page preferences.

•Movie selections (grouchy mood).

•Medical diagnosis (erraticsymptom/dishonest patient).

Learning: formalism

Come up with some function f such that

• f(x) = yfor all training examples (x,y) and

• f (somehow) generalizes to yet unseenexamples.

I ti d ’t l d it f tl

Inductive bias: intro

• There has to be some structure apparent inthe inputs in order to support generalization.

• Consider the following pairs from the phonebook.

Inputs Outputs

Ralph Student 941-2983

Louie Reasoner 456-1935

Harry Coder 247-1993

Fred F lintstone ???-????

There is not much to go on here.•Suppose we were to add zip code information.•Suppose phone numbers were issued based on the spellingof a person's last name.•Suppose the outputs were user passwords?

39

Example 2

• Consider the problem of fitting a curve to aset of (x,y) pairs.

| x x

|-x----------x---

| x

|__________x_____

– Should you fit a linear, quadratic, cubic, piece-wise linear function?

– It would help to have some idea of how smooththe target function is or to know from whatfamily of functions (e g polynomials of degree

Inductive Bias: definition

• This "some idea of what to choose from" iscalled an inductive bias.

• TerminologyH, hypothesis space - a set of functions tochoose from

C, concept space - a set of possible functionsto learn

• Often in learning we search for a hypothesisf in H that is consistent with the trainingexamples, i.e., f(x)= y for all training

Lecture 14

• Learning– Inductive inference

– Probably approximately correct learning

What is learning?

Key point: all learning can be seen aslearning the representation of a function.

Will become clearer with more examples!

Example representations:

• propositional if-then rules

• first-order if-then rules

• first-order logic theories

• decision trees

• neural networks

Learning: formalism

Come up with some function f such that

• f(x) = yfor all training examples (x,y) and

• f (somehow) generalizes to yet unseenexamples.

I ti d ’t l d it f tl

Inductive bias: intro

• There has to be some structure apparent inthe inputs in order to support generalization.

• Consider the following pairs from the phonebook.

Inputs Outputs

Ralph Student 941-2983

Louie Reasoner 456-1935

Harry Coder 247-1993

Fred F lintstone ???-????

There is not much to go on here.•Suppose we were to add zip code information.•Suppose phone numbers were issued based on the spellingof a person's last name.•Suppose the outputs were user passwords?

40

Example 2

• Consider the problem of fitting a curve to aset of (x,y) pairs.

| x x|-x----------x---| x|__________x_____

– Should you fit a linear, quadratic, cubic, piece-wise linear function?

– It would help to have some idea of how smooththe target function is or to know from whatf il f f ti ( l i l f d

Inductive Learning

Given a collection of examples (x,f(x)), returna function h that approximates f.

h is called the hypothesis and is chosen fromthe hypothesis space.

• What if f is not in the hypothesis space?

Inductive Bias: definition

• This "some idea of what to choose from" iscalled an inductive bias.

• TerminologyH, hypothesis space - a set of functions tochoose from

C, concept space - a set of possible functionsto learn

• Often in learning we search for a hypothesisf in H that is consistent with the trainingexamples,

Which hypothesis?

oo

oo

(c)

oo

o

oo

(a)

oo

o

oo

(b)

oo

o

oo

(d)

o

Bias explanation

How does learning algorithm decide

Bias leads them to prefer one hypothesis overanother.

Two types of bias:

• preference bias (or search bias) dependingon how the hypothesis space is explored,you get different answers

• restriction bias (or language bias), the“language” used: Java, FOL, etc. (h is not

Issues in selecting the bias

Tradeoff (similar in reasoning):

more expressive the language, the harder tofind (compute) a good hypothesis.

Compare: propositional Horn clauses withfirst-order logic theories or Java programs.

• Also, often need more examples.

41

Occam’s Razor

• Most standard and intuitive preference bias:

Occam’s Razor(aka Ockham’s Razor)

The most likely hypothesis isthe simplest one that isconsistent will all of the

observations.

Implications

• The world is simple.

• The chances of an accidentally correctexplanation are low for a simple theory.

Probably Approximately Correct(PAC) LearningTwo important questions that we have yet to

address:

• Where do the training examples comefrom?

• How do we test performance, i.e., are wedoing a good job learning?

• PAC learning is one approach to dealing withthese questions.

Classifier example

Consider learning the predicate Flies(Z) = true,false.

We are assigning objects to one of two categories:recall we call this a classifier.

Suppose that X = pigeon,dodo,penguin,747, Y =true,false, and that

Pr(pigeon) = 0.3 Flies(pigeon)= true

Pr(dodo) = 0.1 Flies(dodo) =false

• Note that if we mis-classified dodos but goteverything else right, then we would still bedoing pretty well in the sense that 90% ofthe time

• we would get the right answer.

• We formalize this as follows.

• The approximate error associated with ahypothesis f is

• error(f) = ∑ x | f(x) not= Flies(x)Pr(x)

• We say that a hypothesis is

approximately correct with error at most εif

42

• The chances that a theory is correctincreases with the number of consistentexamples it predicts.

• Or….

• A badly wrong theory will probably beuncovered after only a few tests.

PAC: definition

Relax this requirement by not requiring that thelearning program necessarily achieve a small errorbut only that it to keep the error small with highprobability .

Probably approximately correct (PAC) withprobability δ and error at most ε if, givenany set of training examples drawnaccording to the fixed distribution, theprogram outputs a hypothesis f such that

PAC

• Idea:

• Consider space of hypotheses.

• Divide these into “good” and “bad” sets.

• Want to assure that we can close in on theset of good hypotheses that are closeapproximations of the correct theory.

PAC Training examples

Theorem:

If the number of hypotheses |H| is finite, thena program that returns an hypothesis that isconsistent with

ln(δ /|H|)/ln(1- ε)

training examples (drawn according to Pr) isguaranteed to be PAC with probability δand error bounded by ε.

PAC theorem: proof

If f is not approximately correct then Error(f) > ε so theprobability of f being correct on one example is < 1 - ε andthe probability of being correct on m examples is < (1 - ε)m.

Suppose that H = f,g. The probability that f correctlyclassifies all m examples is < (1 - ε )m. The probabilitythat g correctly classifies all m examples is < (1 - ε )m. Theprobability that one of f or g correctly classifies all mexamples is < 2 * (1 - ε )^m.

To ensure that any hypothesis consistent with m trainingexamples is correct with an error at mostε with probability

Generalizing, there are |H| hypotheses in therestricted hypothesis space and hence theprobability that there is some hypothesis in H thatcorrectly classifies all m examples is bounded by

|H|(1- ε )m.

Solving for m in

|H|(1- ε )m < δ

we obtain

m >= ln(δ /|H|)/ln(1- ε ).

43

Stationarity

• Key assumption of PAC learning:

Past examples are drawn randomly from thesame distribution as future examples:stationarity.

The number m of examples required is calledthesample complexity.

A class of concepts C is said to be PAClearnable for a hypothesis space H if(roughly) there exists an polynomial timealgorithm such that:

for any c in C, distribution Pr, epsilon, anddelta,

if the algorithm is given a number of trainingexamples polynomial in 1/epsilon and1/d lt th ith b bilit 1 d lt th

Overfitting

Consider error in hypothesis h over:

• training data: error train (h)

• entire distribution D of data: errorD (h)

• Hypothesis h \in H overfits training data if

– there is an alternative hypothesis h’ \in Hsuch that• errortrain (h) < errortrain (h’)

but

Lecture 14

• Learning– Probably approximately correct learning

(cont’d)

– Version spaces

– Decision trees

PAC: definition

Relax this requirement by not requiring that thelearning program necessarily achieve a small errorbut only that it to keep the error small with highprobability .

Probably approximately correct (PAC) withprobability δ and error at most ε if, givenany set of training examples drawnaccording to the fixed distribution, theprogram outputs a hypothesis f such that

PAC Training examples

Theorem:

If the number of hypotheses |H| is finite, thena program that returns an hypothesis that isconsistent with

m = ln(δ /|H|)/ln(1- ε)

training examples (drawn according to Pr) isguaranteed to be PAC with probability δand error bounded by ε.

44

We want….

• PAC (so far) describes accuracy of thehypothesis, and the chances of finding sucha concept.

– How may examples do we need to rule out the“really bad” hypotheses.

• We also want the process to proceedquickly.

PAC learnable spaces

A class of concepts C is said to be

PAC learnable for a hypothesis space Hif there exists an polynomial time algorithm A

such that:

for any c ∈ C, distribution Pr, ε , and δ ,

if A is given a quantity of training examplespolynomial in 1/ ε and 1/ δ,

then with probability 1- δ

the algorithm will return a hypothesis f fromH h h

Observations on PAC

• PAC learnability doesn’t tell us how to findthe learning algorithm.

• The number of examples needed growsslowly as the concept space increases, andwith other key parameters.

Example

• Target and learned concepts areconjunctions with up to n predicates. (Thisis our bias.)– Each predicate might be appear in either

positive or negated form, or be absent: 3options.

– This gives 3n possible conjunctions in thehypothesis space.

Result

• I have such a formula in mind.

• I’ll give you some examples.

• You try to guess what the formula is.

A concept that matches all our examples willbe PAC if m is at least

n/ε ln ( 3/ δ )

How

• How can we actually find a suitableconcept?

• One key approach: start with the examplesthemselves, and try to generalize.

• E.g. Given f(3,5) and f(5,5).– We might try replacing the first argument with

a variable X: f(X,5).

45

Version Space [DAA 5.3]

• Deals with conjunctive concepts.

• Consider a concept C as being identifiedwith the set of positive examples itassociated with.– C:even numbered hearts = 3,5 ,7 ,9 .

• A concept C1 is a specialization of conceptC2 if the examples associated with C1 are a

Specialization/GeneralizationCards

Black Red

Odd red Even red

Even- (red is implied)

3 5 7 9

Immediate

• Immediate Specialization: no intermediate.

• Red is not the immediate specialization of2-of-hearts.

• Red is the immediate specialization ofhearts and diamonds.– Note: This observation depends on knowing the

hypothesis space restriction.

Algorithm outline

• Incrementally process training data.

• Keep list of most and least specific conceptsconsistent with the observed data.– For two concepts A and B that are consistent

with the data,the concept C= (A AND B) willalso be consistent yet more specific.

• Tied in a subtle way to conjunctions.– Disjunctive concepts can be obtained trivially

by joining examples but they’ re not interesting

• 4 :no

• 5 :yes

• 5 ♣ :no

• 7 :yes

• 9 ♠ ---

• 3 yes

VS example

Cards

Black Red

Even red Odd red

Odd- (red is implied)

3 5 7 9

Algorithm specifics

• Maintain two bounding concepts:– The most specialized (DAA: specific boundary)

– The broadest (DAA: general boundary).

• Each example we see is either positive (yes) ornegative (no).

• Positive examples (+) tend to make the conceptmore general (or inclusive). Negative examples (-) are used to make the concept more exclusive (toreject them).

46

Observations

• It allows you to GENERALIZE from atraining set to examples never-before-seen!!!– In contrast, consider table lookup or rote

learning.

• Why is that good?1 It allows you to infer things about new data (the

whole point of learning)

2 It allows you to (potentially) remember old data

Restaurant Selector

Example attributes:

• 1. Alternate

• 2. Bar

• 3. Fri/Sat

• 4. Hungry

• 5. Patrons

• 6. Price

• etc.

f ll P t ( F ll) AND

Example 2

Maybe we should have made a reservation?(using a decision tree)

• Restaurant lookup: you’ve heard Joe’s isgood.

• Lookup Joe’s

• Lookup Chez Joe• Lookup Restaurant Joe’s

Decision trees: issues

• Constructing a decision tree is easy…really easy!– Just add examples in turn.

• Difficulty: how can we extract a simplifieddecision tree?– This implies (among other things) establishing

a preference order (bias) among alternative

Office size example

Training examples:

1. large ^ cs ^ faculty -> yes

2. large ^ ee ^ faculty -> no

3. large ^ cs ^ student -> yes

4. small ^ cs ^ faculty -> no

5. small ^ cs ^ student -> no

The questions about office size, departmentand status tells use something about the

47

Decision tree #1

size

/ \

large small

/ \

dept no 4,5

/ \

cs ee

/ \

Decision tree #2

status / \ faculty student / \ dept dept / \ / \ cs ee ee cs / \ / \ size no ? size / \ / \

Making a tree

How can we build a decision tree (that mightbe good)?

Objective: an algorithm that builds a decision treefrom the root down.

Each node in the decision tree is associated with aset

of training examples that are split among itschildren.

Procedure: Buildtree

If all of the training examples are in the same class,

then quit,

else 1. Choose an attribute to split the examples.

2. Create a new child node for each attributevalue.

3. Redistribute the examples among thechildren

according to the attribute values.

4. Apply buildtree to each child node.

A Bad tree

• To identify an animal(goat,dog,housecat,tiger)

• Is it a dog?

• Is it a housecat?

• Is it a tiger?

• Is it a goat?

• A good tree?

• It is a cat? (if yes, what kind.)

• Is it a dog?

• Max depth 2 questions.

48

Best Property

• Need to select property / feature / attribute

• Goal find short tree (Occam's razor)– Base this on MAXIMUM depth

• select \bf most informative feature– One that best splits (classifies) the examples

Entropy

• measures the (im) purity in collection S ofexamples

• Entropy(S) =

- p_+ log_2 (p_+) - p_- log_2 (p_-)

• p_+ is proportion of positive examples.

• p is proportion of negative examples

• Learning– Decision trees

• Building them

• Building good ones

– Sub-symbolic learning• Neural networks

Decision trees: issues

• Constructing a decision tree is easy…really easy!– Just add examples in turn.

• Difficulty: how can we extract a simplifieddecision tree?– This implies (among other things) establishing

a preference order (bias) among alternative

Office size example

Training examples:

1. large ^ cs ^ faculty -> yes

2. large ^ ee ^ faculty -> no

3. large ^ cs ^ student -> yes

4. small ^ cs ^ faculty -> no

5. small ^ cs ^ student -> no

The questions about office size, departmentand status tells use something about the

Decision tree #1

size

/ \

large small

/ \

dept no 4,5

/ \

cs ee

/ \

49

Decision tree #2

status / \ faculty student / \ dept dept / \ / \ cs ee ee cs / \ / \ size no ? size / \ / \

Making a tree

How can we build a decision tree (that mightbe good)?

Objective: an algorithm that builds a decision treefrom the root down.

Each node in the decision tree is associated with aset

of training examples that are split among itschildren.

Procedure: Buildtree

If all of the training examples are in the same class,

then quit,

else 1. Choose an attribute to split the examples.

2. Create a new child node for each value ofthe attribute.

3. Redistribute the examples among thechildren

according to the attribute values.

4. Apply buildtree to each child node.

A “Bad” tree

• To identify an animal(goat,dog,housecat,tiger)

Is it a wolf?

Is it in the cat family?

Is it a tiger?

wolf

cat

tiger

dog

noyes

yes

yes

no

no

• Max depth 3.

• To get to fish or goat, it takes threequestions.ions.

• In general, a bad tree for N categories cantake N questions.

• Can’t we do better? A good tree?

• Max depth 2 questions.M ll l (N) ti

Cat family?

Tiger? Dog?

50

Best Property

• Need to select property / feature / attribute

• Goal: find short tree (Occam's razor)1. Base this on MAXIMUM depth

2. Base this on the AVERAGE depthA) over all leaves

B) over all queries

• select most informative feature

Optimizing the tree

All based on buildtree.

To minimize maximum depth, we want tobuild a balanced tree.

• Put the training set (TS) into anyorder.

• For each question Q– Construct a K-tuple of 0s and 1s

• The jth entry in the tuple is

– 1 if the jth instance in the TS has answerYES to Q

– 0 if it has answer NO

Min Max Depth

• Minimize max depth:

• At each query, come as close as possible tocutting the number of samples in the subtreein half.

• This suggests the number of questions persubtree is given by the log2 of the number ofsample categories to be subdivided.

Entropy

Measures the (im) purity in collection S ofexamples

Entropy(S) =

- [ p+ log2 (p+) + p- log2 (p-) ]

• p+ is the proportion of positive examples.

• p is the proportion of negati e e amples

Example

• S, 14 examples, 9 positive, 5 negative

Entropy([9+,5-]) =

-(9/14) log2(9/14) - (5/14)log2(5/14) =

0.940

Intuition / Extremes

• Entropy in collection is zero if all examplesin same class.

• Entropy is 1 if equal number of positive andnegative examples.

Intuition:

If you pick random example,how many bits do you needto specify what class the

51

Entropy: definition

• Often referred to a “randomness”.

• How useful is a question:– How much guessing does knowing an answer

save?

• How much “surprise” value is there in aquestion.

Information Gain

General definition

• Entropy(S) =

c

∑ 1

pi log2 (pi)

• In this lecture we consider some alternativehypothesis spaces based on continuousfunctions. Consider the following booleancircuit.

x1 ---------------|\

|\ | ----+

x2 ---+--| -------|/ +------|\

| |/ NOT AND | -----f(x1,x2,x3)

+-----------|\ +------|/

| ----+ OR

x3----------------|/

The topology is fixed and logic elements arefixed so there is a single Boolean function.

Is there a fixed topology that can be used torepresent

a family of functions?

Yes! Neural-like networks (aka artificialneural networks) allow us this flexibility

The idealized neuron

• Artificial neural networks come in several“flavors”.– Most of based on a simplified model of a

neuron.

• A set of (many) inputs.

• One output.

• Output is a function of the sum on the

52

Today’s Lecture

• Tangential questions (warm up)

• Administrative Details

• Learning– Decision trees: cleanup & details

– Sub-symbolic learning• Neural networks

• Why do I use an Apple Macintosh?– Who cares

– It’s elegant

– I respect Apple’s innovation.

• On learning...

Administrativia

• Please signup AGAIN using the web page.– There was a bug in the CGI/HTML connection.

• Note that the normal late policy will notapply to the project.– You **must** submit the electronic

(executable) on time, or it may not beevaluated!

– It must run on LINUX. Be certain to compileand test it on one of the linux machines in the

ID3

Considered information implicit in a query about aset of examples.– This provides the total amount of information implicit

in a decision tree.

– Each question along the tree provides some fraction ofthis total information.

• How much ??

• Consider information gain per attribute A.– Gain(Q:X) = E(Q:X) - E(A:X)

– Info needed to complete tree is weighted sum ofthe subtrees

i

v

=∑ 1

Is there a fixed circuit network topology thatcan be used to represent

a family of functions?

Yes! Neural-like networks (a.k.a. artificialneural networks) allow us this flexibilityand more; we can represent arbitraryfamilies of continuous functions using fixed

Neural Networks?

Artificial Neural Netsa.k.a.

Connectionist Nets

(connectionist learning)a.k.a.

Sub-symbolic learning

a.k.a.

Perceptron learning (a special case)

53

The idealized neuron

• Artificial neural networks come in several“flavors”.– Most of based on a simplified model of a

neuron.

• A set of (many) inputs.

• One output.

• Output is a function of the sum on the

Why neural nets?

• Motives:– We wish to create systems with

abilities akin to those of thehuman mind.

• The mind is usually assumed to bebe a direct consequence of thestructure of the brain.

– Let’s mimic the structure of thebrain!

– By using simple computingelements, we obtain a systemthat might scale up easily to

Not intext

Not intextReal and fake neurons

• Signals inneurons arecoded by “spikerate”.

• In ANN’s, inputscan be either:– 0 or 1 (binary)

– [0,1]

– [-1,1]

– R (real)

• Each input Ii hasan associated

Not intext

Not intext

54

Inductive bias?

• Where’s the inductive bias?– In the topology and architecture of the network.

– In the learning rules.

– In the input and output representation.

– In the initial weights.

Not intextSimple neural models

• Oldest ANN model is McCulloch-Pittsneuron [1943] .– Inputs are +1 or -1 with real-valued weights.

– If sum of weighted inputs is > 0, then theneuron “fires” and gives +1 as an output.

– Showed you can comput logical functions.

– Relation to learning proposed (later!) byDonald Hebb [1949].

• Perceptron model [Rosenblatt, 1958].– Single-layer network with same kind of neuron

Not intext

Perceptron nets

Perceptron Network Single Perceptron

InputUnits Units

Output InputUnits Unit

Output

OIj Wj,i Oi Ij Wj

Perceptron learning

• Perceptron learning:– Have a set of training examples (TS) encoded

as input values (I.e. in the form of binaryvectors)

– Have a set of desired output values associatedwith these inputs.

• This is supervised learning.

– Problem: how to adjust the weights to makethe actual outputs match the training examples.

• NOTE: we to not allow the topology to change![You should be thinking of a question here ]

Learning algorithm

• Desired Ti Actual output Oi• Weight update formula (weight from unit j

to i):

Wj,i = Wj,I + k* xj * (Ti - Oi)

Where k is the learning rate.

• If the examples can be learned (encoded),then the perceptron learning rule will findthe weights.

Perceptrons: what can they learn?

• Only linearly separable functions [Minsky& Papert 1969].I 1

I 2

I 1

I 2

I 1

I 2

?

(a) (b) (c)and or xor

0 1

0

1

0

1 1

0

0 1 0 1

I 2I 1I 1 I 2I 1 I 2

55

More general networks

• Generalize in 3 ways:– Allow continuous output values [0,1]

– Allow multiple layers.• This is key to learning a larger class of functions.

– Allow a more complicated function thanthresholded summation

[why??]

Generalize the learning rule to accommodatethis: let’s see how it works.

The threshold

• The key variant:– Change threshold into a differentiable

function

– Sigmoid, known as a “soft non-linearity”(silly).

M = ∑xiwi

Today’s Lecture

• Neural networks– Training

• Backpropagation of error (backprop)

– Example

– Radial basis functions

Recall: training

For a single input-output layer, we couldadjust the weights to get linearclassification.– The perceptron computed a hyperplane over

the space defined by the inputs.• This is known as a linear classifier.

• By stacking layers, we can compute a widerrange of functions.

• Computeerror derivative

output

inputs

Hiddenlayer

• “Train” the weights to correctly classify aset of examples (TS: the training set).

• Started with perceptron, which usedsumming and a step function, and binaryinputs and outputs.

• Embellished by allowing continuousactivations and a more complex “threshold”

The Gaussian

• Another continuous, differentiable functionthat is commonly used is the Gaussianfunction.

Gaussian(x) =

• where σ is the width of the Gaussian.

• The Gaussian is a continuous, differentiableversion of the step function.

ex 22σ

ex 22σ

56

What is learning?

• For a fixed set of weights w1,...,wn

f(x1,...,xn) = Sigma(x1 w1 + ... + xn wn)represents a particular scalar function of n

variables.

• If we allow the weights to vary, then we canrepresent a family of scalar function of nvariables.

F(x1,...,xn,w1,...,wn) = Sigma(x1 w1 + ... + xn

wn)

Basis functions

• Here is another family of functions. In thiscase, the family is defined by a linearcombination of basis functions

g1,g2,...,gn.

The input x could be scalar or vector valued.

F(x,w1,...,wn) = w1 g1(x) + ... + wn gn(x)

Combining basis functions

We can build a network as follows:

g 1(x) --- w 1 ---\

g 2(x) --- w 2 ----\

... \

... ∑ --- f(x)

/

/

g n(x) --- w n ---/

E.g. From the basis 1,x,x2 we can buildquadratics:

Receptive Field

• It can be generalized to an arbitrary vectorspace (e.g., Rn).

• Often used to model what are called“ localized receptive fields” in biologicallearning theory.– Such receptive fields are specially designed to

represent the output of a learned function on asmall portion of the input space.

– How would you approximate an arbitrarycontinuous function using a sum of gaussians or

Backprop

• Consider sigmoid activation functions.

• We can examine the output of the net as afunction of the weights.– How does the output change with changes in

the weights?

– Linear analysis: consider partial derivative ofoutput with respect to weight(s).

• We saw this last lecture.

– If we have multiple layers, consider effect oneach layer as a function of the preceding

Backprop observations

• We can do gradient descent in weight space.

• What is the dimensionality of this space?– Very high: each weight is a free variable.

• There are as many dimensions as weights.

• A “typical” net might have hundreds of weights.

• Can we find the minimum?– It turns out that for multi-l ayer networks, the

error space (often called the “energy” of thenetwork) is NOT CONVEX . [so?]

– Commonest approach: multiple restart gradient

57

Success? Stopping?

• We have a training algorithm (backprop).

• We might like to ask:– 1. Have we done enough training (yet)?

– 2. How good is our network at solving the problem?

– 3. Should we try again to learn the problem (from thebeginning)?

• The first 2 problems have standard answers:– Can’t just look at energy. Why not?

• Because we want to GENERALIZE across examples. “Iunderstand multiplication: I know 3*6=18, 5*4=20.”

What can we learn?

• For any mapping from input to output units,we can learn it if we have enough hiddenunits with the right weights!

• In practice, many weights means difficulty.

• The right representation is critical!

• Generalization depends on bias.– The hidden units form an internal

representation of the problem. make theml hi l

Representation

• Much learning can be equated with selecteda good problem representation.– If we have the right hidden layer, things

become easy.

• Consider the problem of face recognitionfrom photographs. Or fingerprints.– Digitized photos: a big array (256x256 or

512x512) of intensities.

– How do we match one array to another?

Faces (an example)

• What is an important property to measurefor faces?– Eye distance?

– Average intensity• BAD!

– Nose width?

– Forehead height?

• These measurements form the basisfunctions for describing faces.

BUT NOT NECESSARILY photographs!!!

Radial basis functions

• Use “blobs” summed together to create anarbitrary function.– A good kind of blob is a Gaussian: circular,

variable width, can be easily generalized to 2D,3D, ....

Topology changes

• Can we get by with fewer connections?

• When every neuron from one layer isconnected to every layer in the next layer,we call the network fully-connected.

• What if we allow signals to flowbackwards to a preceding layer?Recurrent networks

58

Today’s Lecture• Neural networks

– Backprop example

• Clustering & classification: case study– Sound classification: the tapper

• Recurrent nets• Nettalk

• Glovetalk– video

• Radial basis functions

• Unsupervised learning

Not intextBackprop demo

• Consider the problem of learning torecognize handwritten digits.– Each digit is a small picture

– Sample the location in the picture: these arepixels (picture elements)

– A grid of pixels can be used as input to anetwork

– Each digit is a training example.

– Use multiple output units, one to classify eachpossible digit.

Not intext

Clustering & recognition

• How to we recognize things?– By learning salient features.

• That’s what the hidden layer is doing.

• Another aspect of this is clustering togetherdata that is associated.1. What do you observe (what features to

extract)?

2. How do you measure similarity in “featurespace”?

3. What do you measure with respect to?

Not intextThe tapper: a case study

See overheads…..

Not intext

Temporal features?

• Question: how do you deal with timevarying inputs?

• We can represent time explicitly orimplicitly.

• The tapper represented time explicitly byrecording a signal as a function of time.– Analysis is then static on a signal f(t) where t

Not intextDifficulties of explicit time

• It is storage intensive (more units, moresignal most be stored and manipulated).

• It is “weight intensive”: more units, moreweights?

• You must determine a priori how larger thetemporal sampling window much be!

So what?More weights -> harder training. (less bias)

Not intext

59

Topology changes

• Can we get by with fewer connections?

• When every neuron from one layer isconnected to every layer in the next layer,we call the network fully-connected.

• What if we allow signals to flowbackwards to a preceding layer?Recurrent networks

Recurrent nets

• Simplest instance:– Allow signals to flow in various directions

between layers

• Can now trivially compute

f(t) = f(x)+ 0.5 f(t-1)

Time is implicit in the topology and weights.

Difficult to avoid blurring in time, however.

Example: nettalk

• Learn to read English: output is audiospeech

Firstgradtext,before &aftertraining.

DictionaryText, getting progressivelyBetter.

Today’s Lecture• Administrative:

– Midterm

– Assignment 4

• Recurrent nets• Nettalk (continued)

• Glovetalk– video

• Radial basis functions (RBF’s)

• Unsupervised learning– Kohonen nets

Not intext

Example: nettalk

• Learn to read English: output is audiospeech

Firstgradtext,before &aftertraining.

DictionaryText, getting progressivelyBetter.

NETtalk details

Reference: Sejnowski & Rosenberg, 1986, Cognitive Science, 14, pp.179-211.

Mapped English text to phonemes.Phonemes then converted to sound by a separate

engine.

• 3-layer network trained using backprop– 203-80-26 MPL (multi-layer perceptron)

• 203 input units

– Encoded 7 consecutive characters.• window in time before and after the current sound.

60

Example: glovetalk

• Input: gestures

• Output: speech

• Akin to reading sign language, butHIGHLY simplified.– Input is encoded as electrical signals from,

essentially, a Nintendo Power Glove.• Various joint angles as a function of time.

Today’s Lecture•

• Learning (wrap-up)–Reinforcement learning

–Unsupervised learning• Kohonen nets

• Planning

Not intext

Reinforcement Learning

• So far, we had a well-defined set of trainingexamples.

What if feedback is not so clear?

E.g., when playing a game, only after manyactions

final result: win, loss, or draw.

• In general, agent exploring environment,

DAA pp. 231

• Issue: delayed rewards / feedback.– Exacerbates the credit assignment problem.

ASK NOW if you don’t recall what this is!

• Field: reinforcement learning

• Main success: Tesauro's backgammonplayer

(TD Gammon).

Illustration

Imagine agent wandering around inenvironment.

• How does it learn utility values of eachstate?

• (i.e., what are good / bad states? avoid badones...)

Reinforcement learning

• Compare: in backgammon game: states =boards.only clear feedback in final states(win/loss).(We will assume, for now, that we haven’t

cooked up a good heuristic evaluationfunction.)

We want to know utility of the other states

• Intuitively: utility = chance of winning

61

Policy

• Key things we want to learn is a policy

• Jargon for the “table” that associatesactions with states.

– It can be non-deterministic.

• Learning the expected utility of a specific

How: strategies

• Three strategies:

• (a) ``Sampling'' (Naive updating)

• (b) ``Calculation'' / ``Equation solving''– (Adaptive dynamic programming)

• (c) ``in between (a) and (b)''

(Temporal Difference Learning --- TDlearning)

Naive updating

• Comes from adaptive control theory.

• (a) ``Sampling'' --- agent makes randomruns through environment; collect statisticson final payoff for each state (e.g. when at(2,3), how often do you reach +1 vs. -1?)

• Learning algorithm keeps a running averagefor each state. Provably converges to true

Example: Reinforcement

(a)

1 2 3

1

2

3

1

+ 1

4

START

1

+1

.5

.33

.5

.33 .5

.33

.5

.33

.33

.33

(b)

.5

.5

.5

.5

.5

.5

.5

.5

.33

.33

.33

Stochastic transition network

(a)

2 3

1

+ 1

4

1

+1

.5

.33

.5

.33 .5

.33

.5

1.0.33

.33

.33

(b)

1.0.5

.5

.5

.5

.5

.5

.5

.5

.33

.33

.33

1 2

1

2

3

(c)

0.0380

0.0380

0.0886 0

0.1646

0.2911

0.

0.

62

Stochastic transition network

1

+1

.5

.33

.5

.33 .5

.33

.5

1.0.33

.33

.33

(b)

1.0.5

.5

.5

.5

.5

.5

.5

.5

.33

.33

.33

1 2 3

1

2

3

1

+ 1

4

(c)

0.0380

0.0380

0.0886 0.2152

0.1646

0.2911

0.4430

0.5443 0.7722

Issues

• Main drawback: slow convergence.See next figure.

• In relatively small world takes agent

• over 1000 sequences to get a reasonablysmall (< 0.1) root-mean-square errorcompared with true expected values.

• Question: Is sampling necessary?

• Can we do something completely different?

Temporal difference learning

• Combine ``sampling'' with ``calculation'’ orstated differently: it's using a samplingapproach to solve the set of equations.

• Consider the transitions, observed by awandering agent.

• Use an observed transition to adjust theutilities of the obeserved states to bring

Temporal difference learning

• U(i) : utility in state i

• R(i): Reward

• When observing a transition from i to jbring U(i) value closer to that of U(j)

• Use update rule:

• U(i) = U(i) + k(R(i) + U(j) - U(i))**

• k is the learning rate parameter.

• Rule is called the temporal difference or TD

Learning by exploration

• Issue raised in text: What if we need to(somehow) move in state space to acquiredata.

• Only then, can we learn.

We have 3 coupled problems:– 1. Where are we (in state space)

– 2. Where should we go next

– 3. How do we get thereSkipping #2 & #3 gives “passive learning” as opposed

to “active learning”

Mobile Robotics Research

• Mobile Robotics exemplifies these issues.

• The canonical problems:

Where am I(position estimation)

How do I get there

(path planning)

M i d l t i

63

Where am I

(position estimation)

How do I get there(path planning)

Mapping and exploration(occupancy and uncertainty)

Pose estimation:vision, sonar, GPS

Dynamic programming,A*

Geometry, occupancygrids, graphs,uncertainty

Mobile Robotics ResearchUnsupervised learning

• Given a set of input examples, can wecategorize them into meaningful groups?

• This appears to be a common aspect ofhuman intelligence.– A precursor to recognition, hypothesis

formation, etc.

• Several mechanisms can be used.

Unsupervised learning byclustering

Two important ideas:

1. Self-organization: let the learning systemre-arrange it’s topology or coverage of thefunction space.We saw some of this in the tapper

We some some of this with backprop.

2. Competitive learning: aspects of therepresentation compete to see which givesh b l i

Not intextKohonen nets

• AKA Kohonen feature maps

• A graph G that encodes learning– Think of vertices as embedded in a metric

space, eg. The plane.

– Inputs represented as points in an N-dimensional space

• Based on number of degrees of freedom (number ofparameters)

• Every node connected to all inputs

Not intext

Objective

• Want the nodes of our graph to distributethemselves to “explain” the data.– Nodes that are close to each other are

associated with related examples.

– Nodes that are distant from one another arerelated to unrelated examples.

• This leads to an associative memory– A storage device that allows us to find items

that are related to one another

Kohonen: intuition

• Two key ideas:

1. Let nodes represent points in “examplespace”.– Let them learn away of covering/explaining

data.

2. Let nodes cover the space “suitably”.

Using both competition with each other and

64

Kohonen learning

We are given a set of nodes ni and a set ofexamples si.

For each training examples s i

Compute distance d between s i andeach node n j descrbied by weightswj,i .

d = sqrt ( ( s i wj,iT ) 2 )

Find node with min distance

For each node close to n

Kohonen parameters

Need two functions:

• (Weighted) Distance from nodes to inputs

• Distance between nodes

Two constants:– A threshold K on inter-node distance

– A learning rate c

Kohonen example

• Here’s the demo you’ll use for assignment 4

Kohonen: issues?

• Exploits both distribution of examples and theirfrequency to establish the mapping.

• Accomplishes clustering and recognition.

• May serve as a model of biological

associative memory.• Can project a high-dimensional space to a lower

dimensional comparison space:

dimensionality reduction.

W h d t h th f t i d

Planning: the idea [DAA:ch 7]

A systematic approach to achieving goals.

Examples:– Getting home

– Surviving the term

– Fixing your car

Optimizing vs. Satisficing.

Planning & search

• Many planning problems can be viewed assearch:– 1.

• States are world configurations.

• Operations are actions that can be taken

• Find a set of intermediate states from current to goalworld configuration.

– 2.• States are plans

• Operations are changes to the sequence of possibleactions.

65

Planning differs from search

• Planning has several attributes thatdistinguish it from basic search.

• Conditional plans: plans that explicitly indicateobservations that can be made, and actions thatdepend on those observations.– If the front of MC is locked, go to the Chem building.

– If she takes my rook, then capture the Queen.

• Situated activity: respond to things as theydevelop.

Planning

Like problem solving (search) but with keydifferences:

• Can decompose a big problem into potentiallyindependent sub-parts that can be solvedseparately.

• May have incomplete state descriptions at times.– E.g. when to tie ones shoes while walking.

– Tie-shoes should be done whenever they are untied,without regard to other aspects of the state description.

• Direct connection between actions andcomponents of the current state description

Not intext

Planning: general approach

• Use a (restrictive) formal language todescribe problems and goals.– Why restrictive? More precision and fewer

states to search

• Have a goal state specification and an initialstate.

• Use a special-purpose planner to search forl i

Basic formalism

• Basic logical formalism derived fromSTRIPS.

• State variables determine what actions canor should be taken: in this context they areconditions– Shoe_untied()

– Door_open(MC)

Today’s Lecture

• Planning

Planning: general approach

• Use a (restrictive) formal language todescribe problems and goals.– Why restrictive? More precision and fewer

states to search

• Have a goal state specification and an initialstate.

• Use a special-purpose planner to search for

66

Basic formalism

• Basic logical formalism derived fromSTRIPS.

• State variables determine what actions canor should be taken: in this context they areconditions– Shoe_untied()

– Door_open(MC)

Going forwards

• All state variables are true or false,but some may not be defined at a certainpoint on our

State Progression.A planner based on this is a progression planner.

Idea:In a state S,

Can apply operator X=(P,A,D).

Leads to new state T

Constancy

• Important caveat

• When we go from one state to another,

we assume that the

only changes were thosethat resulted explicitly from the

Additions and Deletions.

Gi thi ti th t X

Aside: FOL with time

• One approach is a variation of first-orderlogic called situation calculus [McCarthy].– Events take place at specific times.

– Some predicates are fluents and only apply forcertain ranges in time.

– A situation is a temporal interval over whichall the predicates remain fixed.

– This material from Ch. 6 will largely be skipped. Wewil cover, or have covered, 6.6 on in class.

Going backwards

• Remember backwards chaining?

• State at the goal G.

• Assuming the deletions aren’t there forsome operator X– Why?

• Can chain backwards by adding what wouldhave been deleted and removing whatwould have been added

Means/ends analysis

• How can we get from initial to final?– Assume the states and operators are given.

– What’s the right path? How to we measuredistance?

• Means/ends analysis assumes we simplyreduce the number of things that make ourcurrent state different from out goal.

67

STRIPS

• STRIPS is an old planning language• STanford Research Institute Problem Solver.

– Less expressive than situation calculus

– Initial state:

At(office) & NOT(Have(Video)) & Have(Cash) & Have(Uncooked-kernels)

– Goal stateAt(H ) & H (Vid ) & H (C k d P )

Schemas

• Basic operators assume a completespecification of the state in which they areapplied.

• This can be tedious– An operator schema is a “generic” operator

that has variables in it• Related to axiom schemas

• Related to unification in logic (e.g. prolog)E.g.

Least Commitment Planning

• When we formulate a plan intuitively, weoften think of doing things in a specificsequenceeven when the sequencing is arbitrary.– This may not be wise.

• This can leads to re-shuffling actions...which is undesirable.

Partially ordered planA

D

EGB

CF

Terminology

• Constraints on sequencing, requirementsfor operators, links relating operators,conflicts between operators in a given plan.

For a plan:• Sound

– Plan steps obey constraints on sequencing

– Successful

• Systematic– Doesn’t “waste” effort

• Complete– Generates a plan if one exists.

Links & ConflictsConsumerProducer

ConsumerProducer

Clobb

erer

A conflict involves a link, & a step that messes it up.

68

Refinement

Fix conflicts by creating a new one from anold one.– Keep old structures (links, producers,

consumers, constraints) but add new constraints

• If there are conflicts, resolve them byadding constraints: move a clobbererbefore of after the link it’s hitting.– (if you can).

If th fli t ti f

Applications of planning

• Planning for Shakey the robot– Climb boxes

– Push things

– Move around

• Blocks world– Moving blocks

– Piling them onto one another

– Clearing the tops of chosen blocks

Configuration Space Planning Issues

Today’s Lecture• Computational Vision– Images

– Image formation in brief

– Image processing: filtering• Linear filters

• Non-linear operations

• Signatures

• Edges

– Image interpretation• Edge extraction

• Grouping

– Scene recovery

Items in bluewill (may) beCovered later

What’s an image?

• The ideal case:– We have an a continuous world (at macroscopic scales).

– We have continuous images to that world.

– Images are 2-dimensional projections of a three-dimensional world.

• In addition, there are other key factors in the world thatdetermine the image:

– Object reflectance properties (wh ite or gray shirt?)

– L ight s ource posit ion

» A lt ers intensit ies (day/n ight, shading &ch iaros curo)

» A lt ers shadow s

I ti

69

Digital images: synopsis• 2D continuous image a(x,y) divided into Nrows and M columns.– The intersection of a row and a column termed a

pixel.

– Value assigned to integer coordinates [m,n] withm=0,1,2,...,M-1 and n=0,1,2,...,N-1 is a[m,n].•In fact, in most cases a(x ,y) is a function of

many variables including depth z, color λ , andtime t.

•I.e. we really have a(x,y ,z,y, λ ,t ).•Most vision deals with the case of 2D, monochromatic (“black and white”), static images.

Image formation

Key processes:1. Light comes from source

2. Hits objects in scene

3. Is reflected towardsA) other objects [return to step 2]

B) the camera lens

4. Passes through lens being focussed on imaging surface

5. Interacts with transducers on surface to produceelectrical signals

In the eye, the lens is that of the eye, the imagingsurface is the retina, and the transducers are rodsandcones

The Human Visual System

• Digression of the biology of the earlyhuman visual system.

(On blackboard: no notes available.)

Vision, image processing, and AI

• Image processing:– Image -> image

– Often characterized as data reduction, filtering

– Signal-level transformation

• Traditional AI:– predicates -> predicates

– What’s hidden in the data: inference, “datamining”

– Symbol-level transformation• (“Non traditional” AI: neural nets uncertainty etc )

Vision problemsA) Images -> scenes

Known as “shape-from” or “shape from X”Examples:• Recovery of scene structure from a sequence of pictures: shape-

from-mot ion

• Recovery of scene structure from shading: shape-from-shading

• Recovery of scene structure from how shadows are cast: shape-from- shado ws (actually called “shape-from-darknes s”)

• Recovery of shape of changes in texture: shape-from-t exture

B) Images -> predicatesSeveral variations, generally less mature.

Object recognition, functional interpretation, supportrelations.

What we want/need is (B), but (A) seems

“easier” or is a natural prerequisi te

What is vision?

In general, vision involves the recovery of allthose things that determine the world:– Material properties, shadows, etc.

as well as the functional and categoricalrelationships between or pertaining toobjects!

70

Image Processing

• Better understood than vision.

• Produce new images or arrays of datawithout worrying (much) about“interpretations.”

∆ø**πø ß

Image processing“operator”

(often a form of filtering)

Filtering

• Given an input signal f(t) we can compute atransformed description g(t).– Key requirement: the dimensionality of the

domain and range is the same.

• This transformed signal is derived from f(t)by the application of either linear(multiplication/addition) or non-linearoperatorsE.g.

– g(t) = f(t) + f(t -1) + f(t+1) [linear]

Filtering in 2D

• Note that filtering applies in essentially thesame way in– 1-D signals,

– 2-D images,

– or even higher dimensional spaces.

Filters: more flavors

• Additional key characterization is the thedegree of locality of the filter.– Does it look

• At a single point?

• At a region…. And is the region symmetric?

• At everything?

Convolution

• For vision and image processing, the mostimportant class of filtering operation isconvolution.– This is almost the same a correlation.

– Convolution for 2 signals

– c(t) = a(t) * b(t)

– c(x,y) = a(x,y) * b(x,y)

For discrete signals

Convolution: specifics

• Typically we convolve a signal with akernel

• Note that convolution is distributive,associative, and commutative.

Impulse

71

Sample Image E.g. Blurring

• Convolve the input with a kernel thatcombines information from a range ofspatial locations.

• What is the precise shape of the kernel?

• Why might this be useful?

Blur picture Edges

• Boundaries are thought to be critical toimage interpretation.– Why do cartoons look as reasonable as they do?

– Idea: detect the boundaries of objects in images.

The Sobel Edge operator

• Filter for horizontal and vertical edges,combine.

Edges

72

Edge linking

Image Noise

• Images usually are corrupted by severaltypes of “noise”

• Digital noise

• Shadows

• Shiny spots (specularities)

• Camera irregularities

• Bad assumptions about what’s beingcomputed (“model noise”).

Gaussiannoise

DotsLines

E.g.: sample noise1

2

3

Edge detection: trickier than itseems

Example: Median Filtering

• A classical non-linear filter.

• Over a window, compute the median valueof the signal.

• This is the value of the filter.

• This can be considered a non-linear form ofaveraging.– Note it never produces values that weren’t

Median filter: radius 1

73

Median filter: radius 2 Median filter: radius 4


5


Notes

4

3

2

1

Today’s Lecture• Computational Vision– Images

– Image formation in brief (+reading)

– Image processing: filtering• Linear filters

• Non-linear operations

• Signatures

• Edges

– Image interpretation• Edge extraction

• Grouping

– Scene recovery

Color code:•Done•Today•Next class(or not at all)

Image formation

Key processes:1. Light comes from source

2. Hits objects in scene

3. Is reflected towardsA) other objects [return to step 2]

B) the camera lens

4. Passes through lens being focussed on imaging surface

5. Interacts with transducers on surface to produceelectrical signals

In the eye, the lens is that of the eye, the imagingsurface is the retina, and the transducers are rodsandcones

74

Scene “Reconstruction”

• Solve the inverse problem: find the scenethat produced the image.

• Things to account for:– The camera

– The geometry of the scene

– The interaction of light with objects in thescene.

C d l b

Recovery

• Typically simplified models of illuminationand reflectance are employed.– One light source.

– All objects have the same matte reflectance.• Sometimes: don’t worry about occlusion.

• Sometimes: don’t worry about shading.

• Recovery of geometry is known as scenereconstruction.

The Sobel Edge operator

• Filter for horizontal and vertical edges,combine.

Edge extraction

• Edges: places where the intensity changes:hence there is a large derivative.

• Each edge has an amplitude and orientation thatcan be expressed as a combination of orthogonalcomponents in the x and y directions.

Sobel detector specifics

• Sobel edge detector– Convolve image with a pair of operators, S and S’

S*I and S’*I

– The edge map was the Pythagorean sum of the twoconvolutions

E = (S * I) 2 +(S' * I) 2

The Sobel kernel, as an example, can be broken down intotwo parts:

1. A smoothing operation (discussed last class)

reduces effect of noise, sets scale (i.e. size)b d/d F( ) d/d k i t ik i t

Other edge operators• Older edge detectors neglected the smoothing step:

– Larry Roberts developed one of the first (2x2 Robert’soperator[s]).

– The Prewitt operator was based on assuming we could fita smooth surface to the data and then differentiate.

– Hueckel operator: use least-squares line fitting

• More recent work involves smoothing with a two-dimension Gaussian:

• Gaussian provides localization in space, as well asfrequency.

• Rather than explicitly compute (1D form):

1

2 2

2 2

22

πσσe

x y− +( )

∂∂t G I( * )

( )*∂∂t G I

( )*∂∂t G I

Can be precomputed

75

Edges: modern methods

• Instead to looking for peaks in thederivative, look for zero-crossings in thesecond derivative.

• A recent operator developed by Canny usesmultiple scales (sizes) to improve edge detection,based on optimizing the notion of what an edge is.Wanted:

– Detection: detect an edge iff it’s there

– Uniqueness: detect each edge just once

– Localization:detect it at the right place

Edges

• An UNREALISTICALLY simple example:– Edges from the Canny Operator.

Input image

• Consider: extract the edge elements +grouping them into contours.

Sobel detector output

Canny operator output

• Note the effect of non-maximumsuppression.

Human Edge Detection

• Overheads

76

Today’s Lecture• Computational Vision

– Biological vision with emphasis on grouping

– Scene recovery

– Recognition

Mostly not on computer

77

Shape from Shading

• [on blackboard]

• Intensity i = f (e,g,n)– Reflected intensity depends on viewing position

and light source position (assumed known)ANDsurface normal.

– Given the knowns we can estimate the surfacenormal, although there is usually a “circular”ambiguity that is resolved by assumingsomething about surface structure.

Review

Topics in Artificial Intelligence Lecture outline - McGill CIMdudek/424/lectures-all.pdf · Why AI...

Documents

Transcript of Topics in Artificial Intelligence Lecture outline - McGill CIMdudek/424/lectures-all.pdf · Why AI...