Bayesian Classification and Regression Tree Analysis (CART) Teresa Jacobson Department of Applied Mathematics and Statistics Jack Baskin School of Engineering UC Santa Cruz

March 11, 2010

Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Introduction Bayesian Model Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Example Extensions and future work Bibliography

Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

What is CART?

The general aim of classification and regression tree analysis: given a set of observations yi and associated variables xij , i = 1 : n and j = 1 : p, find a way of using x to partition the observations into homogeneously distributed groups, then use the group to predict y . Use binary trees to recursively split observations with yes/no questions about variables in x. Assume each end or terminal node has a homogeneous distribution.

Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

How do we do this?

Seminal work by Breiman et al[1] was surprisingly Bayesian, involving the elicitation of priors and risk/utility functions on misclassification. However, the actual tree generation methods were still very ad-hoc. After this work was published a large number of different ad-hoc methods appear, as well as attempts to combine them to produce better inferential strategies. Methods are largely deterministic in nature and produce one tree per method.

Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Going Bayesian: The Problem

! p

=?

1

1

Image courtesy of Diesel-stock, Diesel-stock.deviantart.com. Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW)

Notation Notation follows that of Wu, Tjelmeland and West (WTW)[7]. Observations yi , “regressors” xi , i ∈ I = {1 : n}, j ∈ 1 : k. We wish to predict y ∈ Y based associated x ∈ X = X1 × · · · × Xk . Nodes u with the root note denoted as node 0 and each non-terminal node u with children nodes 2u + 1 (left) and 2u + 2 (right). Trees are then defined as appropriate subsets of the set N = {0, 1, 2, . . . }. Write the number of nodes of a tree T as m(T ). Splitting: For each node U: Choose a predictor variable index kT (u) and a splitting threshold τT (u) ∈ XkT (u) . We then assign y to the left child of u if xkT (u) ≤ τT (u). Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW)

Example tree Example tree from iris data height=4, log(p)=134.866 Petal.Width 1.5

Sepal.Length 6.2

Petal.Width 0.6

1 ●

Sepal.Length 5.9

8e−04 30 obs 2 ●

3 ●

0.0045 17 obs

0.0023 13 obs

Jacobson

4 ●

5 ●

0.0017 11 obs

0.0086 19 obs

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW)

Likelihood

Each terminal node (leaf) viewed as a random sample from some distribution with density φ(·|θu ) where θu is dependently only on the leaf. Usually φ is either multinomial (categorical outcomes) or normal (continuous outcomes).

Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW)

Tree prior Simplify by using a prior of the form p(Θ, T ) = p(Θ|T )p(T ) and Chipman, George, and McCulloch (CGM) specify p(T ) implicitly by using a tree-generating process: 1. Begin by setting T to be the trivial one-node tree 2. Split a node with probability psplit (u, T ) 3. If a node splits, assign a splitting rule τT (u) according to some distribution p(τT (u)|u, T ). Update T to reflect the new tree, and repeat steps 2 and 3. Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW)

Tree prior (cont.) Consider psplit (u, T ) = α(1 + du )−β ,

β ≥ 0; 0 ≤ α ≤ 1

where dn is the node depth. Consider finite splitting values. Suggestion: choose k uniformly from available predictors and then τ from the set of observed values if xk is quantitative or from the available subsets if qualitative. For Θ, use iid normal-inverse-gamma for Θ|T if constructing a regression tree and Dirichlet if constructing a classification tree. CGM suggest choosing hyperparameters based on fitting a greedy tree model. Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW)

Fitting procedure Proceed through MCMC. Interest focuses on the steps for sampling the tree structure. CGM use a Metropolis-Hastings step with a transition kernel choosing randomly among four steps: I

Grow: Pick a terminal node and split into two children nodes,

I

Prune: Pick a parent of two terminal nodes and collapse,

I

Change: Pick an internal node and reassign the splitting rule,

I

Swap: Pick a parent-child pair and swap splitting rules, unless the other child of the parent has the same pair, in which case give both children the splitting rule of the parent.

All steps are reversible, so the Markov chain is reversible. Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW)

Limitations

I

Relatively slow mixing: tendency to stay in local area

I

Tendency to get “stuck” in a local mode: CGM suggest repeated restarting either from trivial tree or trees found by other methods such as bootstrap bumping

I

No single tree output; no good way of picking one “good” tree from sample

Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW)

WTW propose two significant improvements to CGM’s method: I

Improved prior on tree structure: the “pinball prior”,

I

New M-H method, “tree restructure” move.

They also allow for infinite splitting moves, via a prior on the space of splitting values. A prior with finite point masses would duplicate that of CGM as a special case.

Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW)

Pinball prior Idea: generate some number of terminal nodes m(T ), then “cascade” these nodes down the tree, randomly splitting left/right with some probability until nodes define individual leaves. I

Specify prior density for tree size, m(T ) ∼ α(m(T )). Natural: Poisson, m(T ) = 1 + Pois(λ) for some specified λ.

I

Construct a prior density for splitting, β(ml(u) (T )|mu (T )), where ml(u) (T ) is the number sent left from some number mu (T ) that have cascaded down to node u. There are a number of choices for β, e.g. uniform or binomial.

Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW)

Tree restructure move Idea: Restructure the tree branches without changing the terminal categories. I

Begin at node 0

I

Recursively identify possible splitting rules that leave terminal categories unchanged

I

Choose some splitting rule, repeat until terminal nodes fully specified

This move radically restructures the tree without affecting categorization and eliminates the tendency to get stuck near local maxima: effective exploration of posterior → better mixing, better posterior inference. Jacobson

Bayesian CART

Outline Introduction Bayesian Model Example Extensions and future work Bibliography

Example Iris data: We wish to use sepal length and petal width to predict petal length. Divide data into two sets: 30 of each species for tree creation, 20 for evaluation. > iris.subsample.index iris.train iris.test bcart.iris