TDT4171 Artificial Intelligence Methods Lecture 8 – Learning from Observations

Norwegian University of Science and Technology

Helge Langseth IT-VEST 310 [email protected]

1

TDT4171 Artificial Intelligence Methods

Outline

2

1

Summary from last time

2

Chapter 18: Learning from observations Learning agents Inductive learning Decision tree learning Measuring learning performance Overfitting

3

Summary

TDT4171 Artificial Intelligence Methods

Summary from last time

Summary from last time Sequential decision problems Assumptions: Stationarity, Markov assumption, Additive rewards, infinite horizon with discount Model classes: Markov decision processes/Partially Observable Markov Decision Processes Algorithm: Value iteration / policy iteration

Intuitively, MDPs combine probabilistic models over time (filtering, prediction) with maximum expected utility principle.

3

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Chapter 18 – Learning goals Being familiar with: Motivation for learning Decision tree formalism Decision tree learning Information Gain for structuring model learning Overfitting – and what to do to avoid it

4

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Learning agents

Learning This is the second part of the course: We have learned about representations for uncertain knowledge Inference in these representations Making decisions based on the inferences Now we will talk about learning the representations: Supervised learning: Decision trees Instance-based learning/Case-based reasoning Artificial Neural Networks Reinforcement learning

5

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Learning agents

Why do Learning? Learning modifies the agent’s decision mechanisms to improve performance Learning is essential for unknown environments, i.e., when designer lacks omniscience Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down Interesting link on the webpage to the article The End of Theory (from Wired Magazine). Well worth a read!

6

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Learning agents

Learning agents Performance standard

Sensors

Critic

changes Learning element

knowledge

Performance element

learning goals Problem generator

Agent 7

Environment

feedback

experiments

Effectors

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Learning agents

Learning element Design of learning element is dictated by. . . what type of performance element is used which functional component is to be learned how that functional component is represented what kind of feedback is available

8

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Learning agents

Learning element Design of learning element is dictated by. . . what type of performance element is used which functional component is to be learned how that functional component is represented what kind of feedback is available Example scenarios: Performance element

Component

Representation

Feedback

Alpha−beta search

Eval. fn.

Weighted linear function

Win/loss

Logical agent

Transition model

Successor−state axioms

Outcome

Utility−based agent

Transition model

Dynamic Bayes net

Outcome

Simple reflex agent

Percept−action fn

Neural net

Correct action

Supervised learning: correct answers for each instance Reinforcement learning: occasional rewards 8

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Inductive learning

Inductive learning Simplest form: Learn a function from examples f is the target function   O O X X An example is a pair {x, f (x)}, e.g.,  X

  , +1 

Problem: Find hypothesis h ∈ H s.t. h ≈ f given a training set of examples This is a highly simplified model of real learning: Ignores prior knowledge Assumes a deterministic, observable “environment” Assumes examples are given 9

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Inductive learning

Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)

x

10

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Inductive learning

Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)

x

10

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Inductive learning

Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)

x

10

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Inductive learning

Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)

x Which curve is better? – and WHY?? 10

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Inductive learning

Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)

x

10

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Inductive learning

Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)

x

Ockham’s razor: maximize consistency and simplicity! 10

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Inductive learning

Attribute-based representations Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where the authors will/won’t wait for a table Example X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12

Alt T T F T T F F F F T F T

Bar F F T F F T T F T T F T

Fri F F F T T F F F T T F T

Hun T T F T F T F T F T F T

Attributes Pat Price Rain Some $$$ F Full $ F Some $ F Full $ T Full $$$ F Some $$ T None $ T Some $$ T Full $ T Full $$$ F None $ F Full $ F

Res T F F F T T F T F T F F

Type French Thai Burger Thai French Italian Burger Thai Burger Italian Thai Burger

Est 0–10 30–60 0–10 10–30 >60 0–10 0–10 0–10 >60 10–30 0–10 30–60

Target WillWait T F T T F T F T F F F T

Classification of examples is positive (T) or negative (F) 11

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Decision trees One possible representation for hypotheses: Decision Trees Patrons?

None F

Some

Full

T

WaitEstimate?

>60

30−60

F

10−30

Alternate?

No

Yes

Reservation?

No

Yes

Bar?

No F

Example X3 X12 12

Alt F T

T

No

Fri/Sat?

No F

0−10 Hungry?

T

Yes T

Alternate?

No T

Yes

Fri F T

Yes Raining?

No

T

Bar T T

T

Yes

F

Hun F T

Attributes Pat Price Rain Some $ F Full $ F

Res F F

Type Burger Burger

Yes T

Est 0–10 30–60

Target WillWait T T

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Converting a decision-tree to rules Patrons?

None F

Some

Full

T

WaitEstimate?

>60

30−60

F

10−30

No

Yes

Bar?

F

13

Hungry?

Yes

Reservation?

No

No

0−10

Alternate?

T

No

Fri/Sat?

No F

T

Yes T

Yes

T

Yes Alternate?

No

Yes

T

Raining?

No

T

F

Yes T

IF THEN

(Patrons?=Full) ∧ (WaitEstimate?=0-10) Wait? = True

IF THEN ...

(Patrons?=Full) ∧ (WaitEstimate?=30-60) ∧ (Alternate?=Yes) ∧ (Fri/Sat?=No) Wait? = False TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

When to consider decision trees Instances describable by attribute–value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Examples: Diagnosis Credit risk analysis Classifying email as spam or ham

14

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Decision tree representation Decision tree representation: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification

How can we represent these functions using a decision-tree? A ∧ B, A ∨ B, A XOR B. (A ∧ B) ∨ (A ∧ ¬B ∧ C) m of n: At least m of A1 , A2 , . . . , An (try n = 3, m = 2). Discuss with your neighbour for a couple of minutes. 15

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Expressiveness A decision tree can represent ANY function of the inputs. E.g., for Boolean functions, truth table row → path to leaf:

16

A

B

F F T T

F T F T

A

A xor B F T T F

F

T

F

T

F

T

F

T

T

F

B

B

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Expressiveness A decision tree can represent ANY function of the inputs. E.g., for Boolean functions, truth table row → path to leaf: A

B

F F T T

F T F T

A

A xor B F T T F

F

T

F

T

F

T

F

T

T

F

B

B

There is a consistent decision tree for any training set: Just add one path to leaf for each example (works whenever f is deterministic in x) Note! The tree will probably not generalize to new examples → Prefer to find more compact decision trees. 16

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Hypothesis spaces How many distinct decision trees with n Boolean attributes??

17

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions n = number of distinct truth tables with 2n rows = 22 E.g., with 6 Boolean attributes, there are 18 446 744 073 709 551 616 (2 · 1019 ) trees

17

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions n = number of distinct truth tables with 2n rows = 22 E.g., with 6 Boolean attributes, there are 18 446 744 073 709 551 616 (2 · 1019 ) trees How many purely conjunctive hypotheses (e.g., hungry ∧ ¬rain)??

17

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions n = number of distinct truth tables with 2n rows = 22 E.g., with 6 Boolean attributes, there are 18 446 744 073 709 551 616 (2 · 1019 ) trees How many purely conjunctive hypotheses (e.g., hungry ∧ ¬rain)?? Each attribute can be in (positive), in (negative), or out ⇒ 3n distinct conjunctive hypotheses E.g., with 6 Boolean attributes, there are 729 rules of this kind

17

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions n = number of distinct truth tables with 2n rows = 22 E.g., with 6 Boolean attributes, there are 18 446 744 073 709 551 616 (2 · 1019 ) trees How many purely conjunctive hypotheses (e.g., hungry ∧ ¬rain)?? Each attribute can be in (positive), in (negative), or out ⇒ 3n distinct conjunctive hypotheses E.g., with 6 Boolean attributes, there are 729 rules of this kind More expressive hypothesis space: increases chance that target function can be expressed increases number of hypotheses consistent w/ training set ⇒ may get worse predictions 17

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Decision tree learning Aim: find a small tree consistent with the training examples Idea: (recursively) choose “best” attribute as root of (sub)tree function DTL(examples, attributes, default) returns a DT if examples is empty then return default else if all examples have same class then return class else if attributes is empty then return Mode(examples) else best ← Choose-Attribute(attributes, examples) tree ← a new decision tree with root test best for each value vi of best do exi ← {elements of examples with best = vi } subtree ← DTL(exi , attributes − best, Mode(examples)) add a branch to tree with label vi and subtree subtree return tree

18

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Search in the hypothesis space

+ – +

...

A2

A1 + – +

+ – +

+



...

A2

A2 + – +



+ – +



A4

A3



+

...

...

The Big Question: How to choose the “split-attribute” 19

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Search in the hypothesis space

+ – +

...

A2

A1 + – +

+ – +

+



...

A2

A2 + – +



+ – +



A4

A3



+

...

...

DEMO: Random selection 19

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative”

Type?

Patrons? None

Some

Full

French

Italian

Thai

Patrons? is a better choice – gives information about the classification

20

TDT4171 Artificial Intelligence Methods

Burger

Chapter 18: Learning from observations

Decision tree learning

Information Information answers questions! The more clueless we are about the answer initially, the more information is contained in the answer. Scale: 1 bit = answer to Boolean question with prior h0.5, 0.5i. Information in an answer when prior is hp1 , . . . , pn i is H(hp1 , . . . , pn i) = −

n X

pi log2 pi

i=1

(also called entropy of the prior hp1 , . . . , pn i)

21

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Information contd. Suppose we have p positive and n negative examples at root H(hp/(p + n), n/(p + n)i) bits needed to classify new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit.

22

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Information contd. Suppose we have p positive and n negative examples at root H(hp/(p + n), n/(p + n)i) bits needed to classify new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit. An attribute splits the examples E into subsets Ei , we hope each needs less information to classify. . . Let Ei have pi positive and ni negative examples: H(hpi /(pi + ni ), ni /(pi + ni )i) bits needed to classify ⇒ expected number of bits per example over all branches is X pi + n i H(hpi /(pi + ni ), ni /(pi + ni )i) p+n i

For Patrons?, this is 0.459 bits, for Type? this is (still) 1 bit

22

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Information contd. Suppose we have p positive and n negative examples at root H(hp/(p + n), n/(p + n)i) bits needed to classify new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit. An attribute splits the examples E into subsets Ei , we hope each needs less information to classify. . . Let Ei have pi positive and ni negative examples: H(hpi /(pi + ni ), ni /(pi + ni )i) bits needed to classify ⇒ expected number of bits per example over all branches is X pi + n i H(hpi /(pi + ni ), ni /(pi + ni )i) p+n i

For Patrons?, this is 0.459 bits, for Type? this is (still) 1 bit Heuristic: Choose the attribute that minimizes the remaining information needed to classify new example 22

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Decision tree learning

Example contd. Decision tree learned from the 12 examples: Patrons?

None F

Some

Full Hungry?

T

Yes Type?

French T

Italian

No F

Thai

Burger T

Fri/Sat?

F

No F

Yes T

Substantially simpler than “true” tree – a more complex hypothesis isn’t justified by small amount of data DEMO: Gain selection 23

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Measuring learning performance

Performance measurement Question: How do we know that h ≈ f ? Answer: Try h on a new test set of examples (use same distribution over example space as training set)

% correct on test set

Learning curve = % correct on test set as a function of training set size 1 0.9 0.8 0.7 0.6 0.5 0.4 0 10 20 30 40 50 60 70 80 90 100 Training set size 24

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Measuring learning performance

Performance measurement contd. Learning curve depends on. . . realizable (can express target function) vs. non-realizable non-realizability can be due to missing attributes or restricted hypothesis class (e.g., thresholded linear function) redundant expressiveness (e.g., loads of irrelevant attributes) % correct 1

realizable redundant nonrealizable

# of examples

25

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Overfitting

Overfitting in decision trees Consider adding “noisy” training examples X13 and X14 : Example X13 X14

Alt F F

Bar T T

Fri T T

Hun T T

Attributes Pat Price Rain Some $$ T Some $ T

Res T T

Type French Thai

Est 0–10 0–10

Target WillWait F F

Reality check: What is the effect on the tree we learned earlier? Patrons?

None F

Some

Full

T

Hungry?

Yes Type?

French T

Italian

No F

Thai

Burger No F

26

T

Fri/Sat?

F

Yes T

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Overfitting

Overfitting Consider error of hypothesis h over Training data: errort (h) Entire distribution D of data (often approximated by measurement on test-set): errorD (h) Overfitting Hypothesis h ∈ H overfits training data if there is an alternative hypothesis h′ ∈ H such that errort (h) < errort (h′ ) and errorD (h) > errorD (h′ )

27

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Overfitting

Overfitting (cont’d) 0.9 0.85

Accuracy

0.8 0.75 0.7 0.65 0.6

On training data On test data

0.55 0.5 0

10

20

30

40

50

60

70

80

90

Size of tree (number of nodes)

28

TDT4171 Artificial Intelligence Methods

100

Chapter 18: Learning from observations

Overfitting

Avoiding Overfitting How can we avoid overfitting? stop growing when data split not statistically significant grow full tree, then post-prune How to select “best” tree: Measure performance over training data (statistical tests needed) Measure performance over separate validation data set

29

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Overfitting

Reduced-Error Pruning Split data into training and validation set Do until further pruning is harmful: 1

Evaluate impact on validation set of pruning each possible node (plus those below it)

2

Greedily remove the one that most improves validation set accuracy

⇒ Produces smallest version of most accurate subtree

30

TDT4171 Artificial Intelligence Methods

Chapter 18: Learning from observations

Overfitting

Effect of Reduced-Error Pruning 0.9 0.85 0.8

Accuracy

0.75 0.7 0.65 0.6

On training data On test data On test data (during pruning)

0.55 0.5 0

10

20

30

40

50

60

70

80

90

Size of tree (number of nodes)

31

TDT4171 Artificial Intelligence Methods

100

Summary

Summary Learning needed for unknown environments, lazy designers Learning agent = performance element + learning element Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples Decision tree learning using information gain Learning performance = prediction accuracy measured on test set Next week: Agnar Aamodt talks about case-based reasoning. (His paper is on the webpage, and is part of the curriculum.) 32

TDT4171 Artificial Intelligence Methods