Advances in Decision Tree Construction

Advances in Decision Tree Construction Johannes Gehrke Cornell University [email protected] http://www.cs.cornell.edu/johannes Wei-Yin Loh Univ...

Author: Albert Lee

15 downloads 0 Views 308KB Size

Report

Download PDF

Recommend Documents

Decision Tree: nagdmc entropy tree

Nutrition Route Decision Tree

SIF DEVELOPER DECISION TREE

Shopper Insights Decision Tree

SVM and Decision Tree

THE SHOPPING DECISION TREE

Phylogenetic tree construction

DATA MINING DECISION TREE INDUCTION

Long-term Care Decision Tree

Classification and Regression Tree Construction

Advances in Risk Informed Decision Making IAEA s Approach

JAGGERY MARKETING EXPERT SYSTEM USING DECISION TREE

Recent Advances in the Construction of Solar Arrays for CubeSats

Classification Lecture 1: Basics, Decision Tree

Exhibit 7A 1 A Decision Tree

Elegant Decision Tree Algorithm for Classification in Data Mining

DATA TRANSFORMATION FOR DECISION TREE ENSEMBLES

Improving Decision Tree Performance by Exception Handling

Decision Tree-Based Algorithms for Implementing Bot AI in UT2004

Efficient Determination of Dynamic Split Points in a Decision Tree

Decision Tree Approach to Discovering Fraud in Leasing Agreements

A Heuristic for Steiner Tree Construction

SAH KD-Tree Construction on GPU

Decision Making In Fuzzy Logic Environment Using ID3 Decision Tree Algorithm

Advances in Decision Tree Construction Johannes Gehrke Cornell University [email protected] http://www.cs.cornell.edu/johannes

Wei-Yin Loh University of Wisconsin-Madison [email protected] http://www.stat.wisc.edu/~loh KDD 2001 Tutorial: Advances in Decision Trees

Gehrke and Loh

Tutorial Overview O

Part I: Classification Trees O O O O O O O O

Introduction Classification tree construction schema Split selection Pruning Data access Missing values Evaluation Bias in split selection

(Short Break) O Part II: Regression Trees KDD 2001 Tutorial: Advances in Decision Trees

Gehrke and Loh

Tutorial Overview O

Part I: Classification Trees O O O O O O O O

Introduction Classification tree construction schema Split selection Pruning Data access Missing values Evaluation Bias in split selection

(Short Break) O Part II: Regression Trees KDD 2001 Tutorial: Advances in Decision Trees

Gehrke and Loh

T3-1

Classification Goal: Learn a function that assigns a record to one of several predefined classes.

Gehrke and Loh

KDD 2001 Tutorial: Advances in Decision Trees

Classification Example O

Example training database O

O O

O

Two predictor attributes: Age and Car-type (Sport, Minivan and Truck) Age is ordered, Car-type is categorical attribute Class label indicates whether person bought product Dependent attribute is

categorical

KDD 2001 Tutorial: Advances in Decision Trees

Age Car 20 M 30 M 25 T 30 S 40 S 20 T 30 M 25 M 40 M 20 S

Class Yes Yes No Yes Yes No Yes Yes Yes No Gehrke and Loh

Types of Variables O

Numerical: Domain is ordered and can be

O

Nominal or categorical: Domain is a finite set

represented on the real line (e.g., age, income)

O

without any natural ordering (e.g., occupation, marital status, race) Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)

KDD 2001 Tutorial: Advances in Decision Trees

Gehrke and Loh

T3-2

Definitions O O O

O

Random variables X1, …, Xk (predictor variables) and Y (dependent variable) Xi has domain dom(Xi), Y has domain dom(Y) P is a probability distribution on dom(X1) x … x dom(Xk) x dom(Y) Training database D is a random sample from P A predictor d is a function d: dom(X1) … dom(Xk) Æ dom(Y)

KDD 2001 Tutorial: Advances in Decision Trees

Gehrke and Loh

Classification Problem O O

C is called the class label, d is called a classifier. Take r be record randomly drawn from P. Define the misclassification rate of d: RT(d,P) = P(d(r.X1, …, r.Xk) != r.C)

Problem definition: Given dataset D that is a random sample from probability distribution P, find classifier d such that RT(d,P) is minimized. (More on regression problems in the second part of the tutorial.) KDD 2001 Tutorial: Advances in Decision Trees

Gehrke and Loh

Goals and Requirements Goals: O O

To produce an accurate classifier/regression function To understand the structure of the problem

Requirements on the model: O O O

High accuracy Understandable by humans, interpretable Fast construction for very large training databases

KDD 2001 Tutorial: Advances in Decision Trees

Gehrke and Loh

T3-3

What are Decision Trees?

Age =30

Car Type Minivan

YES

Sports, Truck

Minivan YES Sports, Truck NO

YES

NO

YES

0

30

KDD 2001 Tutorial: Advances in Decision Trees

60 Age Gehrke and Loh

Decision Trees O O O

A decision tree T encodes d (a classifier or regression function) in form of a tree. A node t in T without children is called a leaf node. Otherwise t is called an internal node. Each internal node has an associated splitting predicate. Most common are binary predicates. Example splitting predicates: O O O

Age 0

KDD 2001 Tutorial: Advances in Decision Trees

Gehrke and Loh

Internal and Leaf Nodes Internal nodes: O

Binary Univariate splits: O O

O

Binary Multivariate splits: O

O

Numerical or ordered X: X