TDT4171 Artificial Intelligence Methods Lecture 8 – Learning from Observations
Norwegian University of Science and Technology
Helge Langseth IT-VEST 310
[email protected]
1
TDT4171 Artificial Intelligence Methods
Outline
2
1
Summary from last time
2
Chapter 18: Learning from observations Learning agents Inductive learning Decision tree learning Measuring learning performance Overfitting
3
Summary
TDT4171 Artificial Intelligence Methods
Summary from last time
Summary from last time Sequential decision problems Assumptions: Stationarity, Markov assumption, Additive rewards, infinite horizon with discount Model classes: Markov decision processes/Partially Observable Markov Decision Processes Algorithm: Value iteration / policy iteration
Intuitively, MDPs combine probabilistic models over time (filtering, prediction) with maximum expected utility principle.
3
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Chapter 18 – Learning goals Being familiar with: Motivation for learning Decision tree formalism Decision tree learning Information Gain for structuring model learning Overfitting – and what to do to avoid it
4
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Learning agents
Learning This is the second part of the course: We have learned about representations for uncertain knowledge Inference in these representations Making decisions based on the inferences Now we will talk about learning the representations: Supervised learning: Decision trees Instance-based learning/Case-based reasoning Artificial Neural Networks Reinforcement learning
5
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Learning agents
Why do Learning? Learning modifies the agent’s decision mechanisms to improve performance Learning is essential for unknown environments, i.e., when designer lacks omniscience Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down Interesting link on the webpage to the article The End of Theory (from Wired Magazine). Well worth a read!
6
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Learning agents
Learning agents Performance standard
Sensors
Critic
changes Learning element
knowledge
Performance element
learning goals Problem generator
Agent 7
Environment
feedback
experiments
Effectors
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Learning agents
Learning element Design of learning element is dictated by. . . what type of performance element is used which functional component is to be learned how that functional component is represented what kind of feedback is available
8
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Learning agents
Learning element Design of learning element is dictated by. . . what type of performance element is used which functional component is to be learned how that functional component is represented what kind of feedback is available Example scenarios: Performance element
Component
Representation
Feedback
Alpha−beta search
Eval. fn.
Weighted linear function
Win/loss
Logical agent
Transition model
Successor−state axioms
Outcome
Utility−based agent
Transition model
Dynamic Bayes net
Outcome
Simple reflex agent
Percept−action fn
Neural net
Correct action
Supervised learning: correct answers for each instance Reinforcement learning: occasional rewards 8
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Inductive learning
Inductive learning Simplest form: Learn a function from examples f is the target function O O X X An example is a pair {x, f (x)}, e.g., X
, +1
Problem: Find hypothesis h ∈ H s.t. h ≈ f given a training set of examples This is a highly simplified model of real learning: Ignores prior knowledge Assumes a deterministic, observable “environment” Assumes examples are given 9
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Inductive learning
Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)
x
10
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Inductive learning
Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)
x
10
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Inductive learning
Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)
x
10
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Inductive learning
Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)
x Which curve is better? – and WHY?? 10
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Inductive learning
Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)
x
10
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Inductive learning
Inductive learning method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples Example – curve fitting: f (x)
x
Ockham’s razor: maximize consistency and simplicity! 10
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Inductive learning
Attribute-based representations Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where the authors will/won’t wait for a table Example X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
Alt T T F T T F F F F T F T
Bar F F T F F T T F T T F T
Fri F F F T T F F F T T F T
Hun T T F T F T F T F T F T
Attributes Pat Price Rain Some $$$ F Full $ F Some $ F Full $ T Full $$$ F Some $$ T None $ T Some $$ T Full $ T Full $$$ F None $ F Full $ F
Res T F F F T T F T F T F F
Type French Thai Burger Thai French Italian Burger Thai Burger Italian Thai Burger
Est 0–10 30–60 0–10 10–30 >60 0–10 0–10 0–10 >60 10–30 0–10 30–60
Target WillWait T F T T F T F T F F F T
Classification of examples is positive (T) or negative (F) 11
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Decision trees One possible representation for hypotheses: Decision Trees Patrons?
None F
Some
Full
T
WaitEstimate?
>60
30−60
F
10−30
Alternate?
No
Yes
Reservation?
No
Yes
Bar?
No F
Example X3 X12 12
Alt F T
T
No
Fri/Sat?
No F
0−10 Hungry?
T
Yes T
Alternate?
No T
Yes
Fri F T
Yes Raining?
No
T
Bar T T
T
Yes
F
Hun F T
Attributes Pat Price Rain Some $ F Full $ F
Res F F
Type Burger Burger
Yes T
Est 0–10 30–60
Target WillWait T T
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Converting a decision-tree to rules Patrons?
None F
Some
Full
T
WaitEstimate?
>60
30−60
F
10−30
No
Yes
Bar?
F
13
Hungry?
Yes
Reservation?
No
No
0−10
Alternate?
T
No
Fri/Sat?
No F
T
Yes T
Yes
T
Yes Alternate?
No
Yes
T
Raining?
No
T
F
Yes T
IF THEN
(Patrons?=Full) ∧ (WaitEstimate?=0-10) Wait? = True
IF THEN ...
(Patrons?=Full) ∧ (WaitEstimate?=30-60) ∧ (Alternate?=Yes) ∧ (Fri/Sat?=No) Wait? = False TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
When to consider decision trees Instances describable by attribute–value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Examples: Diagnosis Credit risk analysis Classifying email as spam or ham
14
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Decision tree representation Decision tree representation: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification
How can we represent these functions using a decision-tree? A ∧ B, A ∨ B, A XOR B. (A ∧ B) ∨ (A ∧ ¬B ∧ C) m of n: At least m of A1 , A2 , . . . , An (try n = 3, m = 2). Discuss with your neighbour for a couple of minutes. 15
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Expressiveness A decision tree can represent ANY function of the inputs. E.g., for Boolean functions, truth table row → path to leaf:
16
A
B
F F T T
F T F T
A
A xor B F T T F
F
T
F
T
F
T
F
T
T
F
B
B
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Expressiveness A decision tree can represent ANY function of the inputs. E.g., for Boolean functions, truth table row → path to leaf: A
B
F F T T
F T F T
A
A xor B F T T F
F
T
F
T
F
T
F
T
T
F
B
B
There is a consistent decision tree for any training set: Just add one path to leaf for each example (works whenever f is deterministic in x) Note! The tree will probably not generalize to new examples → Prefer to find more compact decision trees. 16
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Hypothesis spaces How many distinct decision trees with n Boolean attributes??
17
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions n = number of distinct truth tables with 2n rows = 22 E.g., with 6 Boolean attributes, there are 18 446 744 073 709 551 616 (2 · 1019 ) trees
17
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions n = number of distinct truth tables with 2n rows = 22 E.g., with 6 Boolean attributes, there are 18 446 744 073 709 551 616 (2 · 1019 ) trees How many purely conjunctive hypotheses (e.g., hungry ∧ ¬rain)??
17
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions n = number of distinct truth tables with 2n rows = 22 E.g., with 6 Boolean attributes, there are 18 446 744 073 709 551 616 (2 · 1019 ) trees How many purely conjunctive hypotheses (e.g., hungry ∧ ¬rain)?? Each attribute can be in (positive), in (negative), or out ⇒ 3n distinct conjunctive hypotheses E.g., with 6 Boolean attributes, there are 729 rules of this kind
17
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions n = number of distinct truth tables with 2n rows = 22 E.g., with 6 Boolean attributes, there are 18 446 744 073 709 551 616 (2 · 1019 ) trees How many purely conjunctive hypotheses (e.g., hungry ∧ ¬rain)?? Each attribute can be in (positive), in (negative), or out ⇒ 3n distinct conjunctive hypotheses E.g., with 6 Boolean attributes, there are 729 rules of this kind More expressive hypothesis space: increases chance that target function can be expressed increases number of hypotheses consistent w/ training set ⇒ may get worse predictions 17
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Decision tree learning Aim: find a small tree consistent with the training examples Idea: (recursively) choose “best” attribute as root of (sub)tree function DTL(examples, attributes, default) returns a DT if examples is empty then return default else if all examples have same class then return class else if attributes is empty then return Mode(examples) else best ← Choose-Attribute(attributes, examples) tree ← a new decision tree with root test best for each value vi of best do exi ← {elements of examples with best = vi } subtree ← DTL(exi , attributes − best, Mode(examples)) add a branch to tree with label vi and subtree subtree return tree
18
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Search in the hypothesis space
+ – +
...
A2
A1 + – +
+ – +
+
–
...
A2
A2 + – +
–
+ – +
–
A4
A3
–
+
...
...
The Big Question: How to choose the “split-attribute” 19
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Search in the hypothesis space
+ – +
...
A2
A1 + – +
+ – +
+
–
...
A2
A2 + – +
–
+ – +
–
A4
A3
–
+
...
...
DEMO: Random selection 19
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative”
Type?
Patrons? None
Some
Full
French
Italian
Thai
Patrons? is a better choice – gives information about the classification
20
TDT4171 Artificial Intelligence Methods
Burger
Chapter 18: Learning from observations
Decision tree learning
Information Information answers questions! The more clueless we are about the answer initially, the more information is contained in the answer. Scale: 1 bit = answer to Boolean question with prior h0.5, 0.5i. Information in an answer when prior is hp1 , . . . , pn i is H(hp1 , . . . , pn i) = −
n X
pi log2 pi
i=1
(also called entropy of the prior hp1 , . . . , pn i)
21
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Information contd. Suppose we have p positive and n negative examples at root H(hp/(p + n), n/(p + n)i) bits needed to classify new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit.
22
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Information contd. Suppose we have p positive and n negative examples at root H(hp/(p + n), n/(p + n)i) bits needed to classify new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit. An attribute splits the examples E into subsets Ei , we hope each needs less information to classify. . . Let Ei have pi positive and ni negative examples: H(hpi /(pi + ni ), ni /(pi + ni )i) bits needed to classify ⇒ expected number of bits per example over all branches is X pi + n i H(hpi /(pi + ni ), ni /(pi + ni )i) p+n i
For Patrons?, this is 0.459 bits, for Type? this is (still) 1 bit
22
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Information contd. Suppose we have p positive and n negative examples at root H(hp/(p + n), n/(p + n)i) bits needed to classify new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit. An attribute splits the examples E into subsets Ei , we hope each needs less information to classify. . . Let Ei have pi positive and ni negative examples: H(hpi /(pi + ni ), ni /(pi + ni )i) bits needed to classify ⇒ expected number of bits per example over all branches is X pi + n i H(hpi /(pi + ni ), ni /(pi + ni )i) p+n i
For Patrons?, this is 0.459 bits, for Type? this is (still) 1 bit Heuristic: Choose the attribute that minimizes the remaining information needed to classify new example 22
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Decision tree learning
Example contd. Decision tree learned from the 12 examples: Patrons?
None F
Some
Full Hungry?
T
Yes Type?
French T
Italian
No F
Thai
Burger T
Fri/Sat?
F
No F
Yes T
Substantially simpler than “true” tree – a more complex hypothesis isn’t justified by small amount of data DEMO: Gain selection 23
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Measuring learning performance
Performance measurement Question: How do we know that h ≈ f ? Answer: Try h on a new test set of examples (use same distribution over example space as training set)
% correct on test set
Learning curve = % correct on test set as a function of training set size 1 0.9 0.8 0.7 0.6 0.5 0.4 0 10 20 30 40 50 60 70 80 90 100 Training set size 24
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Measuring learning performance
Performance measurement contd. Learning curve depends on. . . realizable (can express target function) vs. non-realizable non-realizability can be due to missing attributes or restricted hypothesis class (e.g., thresholded linear function) redundant expressiveness (e.g., loads of irrelevant attributes) % correct 1
realizable redundant nonrealizable
# of examples
25
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Overfitting
Overfitting in decision trees Consider adding “noisy” training examples X13 and X14 : Example X13 X14
Alt F F
Bar T T
Fri T T
Hun T T
Attributes Pat Price Rain Some $$ T Some $ T
Res T T
Type French Thai
Est 0–10 0–10
Target WillWait F F
Reality check: What is the effect on the tree we learned earlier? Patrons?
None F
Some
Full
T
Hungry?
Yes Type?
French T
Italian
No F
Thai
Burger No F
26
T
Fri/Sat?
F
Yes T
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Overfitting
Overfitting Consider error of hypothesis h over Training data: errort (h) Entire distribution D of data (often approximated by measurement on test-set): errorD (h) Overfitting Hypothesis h ∈ H overfits training data if there is an alternative hypothesis h′ ∈ H such that errort (h) < errort (h′ ) and errorD (h) > errorD (h′ )
27
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Overfitting
Overfitting (cont’d) 0.9 0.85
Accuracy
0.8 0.75 0.7 0.65 0.6
On training data On test data
0.55 0.5 0
10
20
30
40
50
60
70
80
90
Size of tree (number of nodes)
28
TDT4171 Artificial Intelligence Methods
100
Chapter 18: Learning from observations
Overfitting
Avoiding Overfitting How can we avoid overfitting? stop growing when data split not statistically significant grow full tree, then post-prune How to select “best” tree: Measure performance over training data (statistical tests needed) Measure performance over separate validation data set
29
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Overfitting
Reduced-Error Pruning Split data into training and validation set Do until further pruning is harmful: 1
Evaluate impact on validation set of pruning each possible node (plus those below it)
2
Greedily remove the one that most improves validation set accuracy
⇒ Produces smallest version of most accurate subtree
30
TDT4171 Artificial Intelligence Methods
Chapter 18: Learning from observations
Overfitting
Effect of Reduced-Error Pruning 0.9 0.85 0.8
Accuracy
0.75 0.7 0.65 0.6
On training data On test data On test data (during pruning)
0.55 0.5 0
10
20
30
40
50
60
70
80
90
Size of tree (number of nodes)
31
TDT4171 Artificial Intelligence Methods
100
Summary
Summary Learning needed for unknown environments, lazy designers Learning agent = performance element + learning element Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples Decision tree learning using information gain Learning performance = prediction accuracy measured on test set Next week: Agnar Aamodt talks about case-based reasoning. (His paper is on the webpage, and is part of the curriculum.) 32
TDT4171 Artificial Intelligence Methods