Components of Learning β€’ Suppose that a bank wants to automate the process of evaluating credit card applications. – Input π‘₯ (customer information that is used to make a credit application). – Target function 𝑓: 𝑋 β†’ π‘Œ (ideal formula for credit approval), where 𝑋 and π‘Œ are the input and output space, respectively. – Dataset 𝐷 of input-output examples π‘₯1 , 𝑦1 , … , π‘₯𝑛 , 𝑦𝑛 . – Hypothesis (skill) with hopefully good performance: 𝑔: 𝑋 β†’ π‘Œ (β€œlearned” formula to be used) 1

Data Preprocessing

Classification & Regression

Components of Learning Unknown Target Function 𝑓: 𝑋 β†’ π‘Œ (ideal credit approval formula)

Training Examples π‘₯1 , 𝑦1 , … , π‘₯𝑛 , 𝑦𝑛 (historical records of credit customers)

Hypothesis Set 𝐻

Learning Algorithm 𝐴

Final Hypothesis π‘”β‰ˆπ‘“ (learned credit approval formula)

(set of candidate formulas)

Use data to compute hypothesis 𝑔 that approximates target 𝑓 2

Simple Learning Model: The Perceptron For 𝐱 = π‘₯1 , … , π‘₯𝑑 (β€œfeatures of the customer”), compute a weighted score and: 𝑑

Approve credit if

𝑀𝑖 π‘₯𝑖 > threshold , 𝑖=1 𝑑

Deny credit if

𝑀𝑖 π‘₯𝑖 < threshold . 𝑖=1

3

Perceptron: A Mathematical Description This formula can be written more compactly as 𝑑

𝑕 𝐱 = sign

𝑀𝑖 π‘₯𝑖 βˆ’ threshold , 𝑖=1

where 𝑕 𝐱 = +1 means β€˜approve credit’ and 𝑕 𝐱 = βˆ’ 1 means β€˜deny credit’; sign 𝑠 = +1 if 𝑠 > 0 and sign 𝑠 = βˆ’1 if 𝑠 < 0. This model is called a perceptron. 4

Data Preprocessing

Classification & Regression

Perceptron: A Visual Description

5

π‘₯1

π‘₯2

π‘₯3

𝑦

1

0

0

-1

1

0

1

1

1

1

0

1

1

1

1

1

0

0

1

-1

0

1

0

-1

0

1

1

1

0

0

0

-1

Input Nodes

π‘₯1 π‘₯2

π‘₯3

0.3

0.3

Output Node

Ξ£

0.3

𝑑 = 0.4

𝑦

Data Preprocessing

Classification & Regression

Data

Perceptron Limitations π‘₯1

π‘₯2

𝑦 = π‘₯1 XOR π‘₯2

1

1

0

1

0

1

0

1

1

0

0

0

Model

Input

6

π‘₯1

𝑀1 Output

Ξ£ π‘₯2

𝑀2 𝑑

𝑦

0,1

1,1

0,0

1,0

The following cannot all be true: 𝑀1 Γ— 1 + 𝑀2 Γ— 1 < 𝑑 𝑀1 Γ— 1 + 𝑀2 Γ— 0 > 𝑑 𝑀1 Γ— 0 + 𝑀2 Γ— 1 > 𝑑 𝑀1 Γ— 0 + 𝑀2 Γ— 0 < 𝑑

2

Perceptrons (in ℝ ) 𝑕 𝐱 = 𝑠𝑖𝑔𝑛 𝑀0 + 𝑀1 π‘₯1 + 𝑀2 π‘₯2

The perceptron is a linear (binary) classifier:

7

β€’ Customer features x: points on the plane β€’ Labels 𝑦: (+1), (-1) β€’ Hypothesis 𝑕: line (divide positive and negative)

2

Perceptron Learning (in ℝ ) The algorithm will find an initial 𝐰 = 𝑀0 , 𝑀1 , 𝑀2 from the training data. At each iteration 𝑑, 𝐰 has a current value, 𝐰 𝑑 . The algorithm picks a misclassified instance, say 𝐱 𝑑 , 𝑦 𝑑 , and uses it to update 𝐰 𝑑 : 𝐰 𝑑+1 =𝐰 𝑑 +𝑦 𝑑 𝐱 𝒕

8

Summary of Perceptrons β€’ Simple and intuitive mathematical formulation β€’ Fast. β€’ A linear classifier or discriminator. – Only capable of learning linearly separable patterns.

β€’ Models a binary output variable.

9

Bayesian Learning β€’ In many applications, the relationship between the feature set and the class variable is non-deterministic.

10

Conditional Probability Conditional Probability is useful for understanding the dependencies among random variables. The conditional probability for the random variable π‘Œ given the random variable 𝑋, denoted as 𝑃 π‘Œ 𝑋 , is defined as: 𝑃 𝑋, π‘Œ 𝑃 π‘Œπ‘‹ = 𝑃 𝑋 If 𝑋 and π‘Œ are independent (the value of one variable has no impact on the other), then 𝑃 π‘Œ 𝑋 = 𝑃 π‘Œ . 11

Data Preprocessing

Classification & Regression

Conditional Probability 𝑝 π‘Œ

𝑝 𝑋, π‘Œ π‘Œ=1 π‘Œ

π‘Œ=0 𝑋

12

𝑝 𝑋

𝑝 π‘‹π‘Œ=0

𝑋

𝑋

Bayes’ Theorem The conditional probabilities 𝑃 π‘Œ 𝑋 and 𝑃 𝑋 π‘Œ , where 𝑋 and π‘Œ are random variables, can be expressed in terms of one another using a formula known as Bayes’ theorem: 𝑃 π‘‹π‘Œ 𝑃 π‘Œ 𝑃 π‘Œπ‘‹ = . 𝑃 𝑋

13

Law of Total Probability If 𝑋1 , 𝑋2 , … , π‘‹π‘˜ is the set of mutually exclusive and exhaustive outcomes of a random variable 𝑋, then the denominator of Bayes’ theorem can be expressed as: π‘˜

𝑃 𝑋 =

π‘˜

𝑃 𝑋, π‘Œπ‘– = 𝑖=1

𝑃 𝑋 π‘Œπ‘– 𝑃 π‘Œπ‘– . 𝑖=1

This is called the law of total probability. 14

Bayes’ Theorem for Classification Using Bayes’ theorem for classification problems, we get 𝑃 πΉπ‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’ 𝑆𝑒𝑑 πΆπ‘™π‘Žπ‘ π‘  𝑃 πΆπ‘™π‘Žπ‘ π‘  𝑃 πΆπ‘™π‘Žπ‘ π‘  πΉπ‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’ 𝑆𝑒𝑑 = 𝑃 πΉπ‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’ 𝑆𝑒𝑑 Posterior Probability ∝ Class Conditional Probability Γ— Prior

15

Data Preprocessing

Classification & Regression

Bayesian Decisions 𝑝 𝑋, π‘Œ

π‘Œ=1 π‘Œ=0 𝑋 𝑝 π‘‹π‘Œ=0

16

𝑝 π‘‹π‘Œ=1

𝑋

β€’ Notice that feature values for class 1 (π‘Œ = 1) tend to be larger than those for class 0 (π‘Œ = 0). The dashed line denotes the decision boundary between classes 0 and 1 based on feature 𝑋.

Example Application of Bayes’ Theorem πΆπ‘œπ‘›π‘ π‘–π‘‘π‘’π‘Ÿ a football game between two rival teams: Team 0 and Team 1. Support Team 0 wins 65% of the time and Team 1 wins the remaining matches. Among the games won by Team 0, only 30% of them come from playing on Team 1’s football field. On the other hand, 75% of the victories for Team 1 are obtained while playing at home. If Team 1 is to host the next match between the two teams, which team will most likely emerge as the winner?

17

Example Application of Bayes’ Theorem Let 𝑋 be the team hosting the match and π‘Œ be the winner of the match. Both 𝑋 and π‘Œ can take on values from the set 0,1 . Then:

18

Probability Team 0 wins is 𝑃 π‘Œ = 0 = 0.65. Probability Team 1 wins is 𝑃 π‘Œ = 1 = 1 βˆ’ 0.65 = 0.35. Probability Team 1 hosted the match it won is 𝑃 𝑋 = 1 π‘Œ = 1 = 0.75. Probability Team 1 hosted the match won by Team 0 is 𝑃 𝑋 = 1 π‘Œ = 0 = 0.3.

Example Application of Bayes’ Theorem 𝑃 𝑋 =1 π‘Œ =1 ×𝑃 π‘Œ =1 𝑃 π‘Œ=1𝑋=1 = 𝑃 𝑋=1 𝑃 𝑋 =1 π‘Œ =1 ×𝑃 π‘Œ =1 = 𝑃 𝑋 = 1, π‘Œ = 1 + 𝑃 𝑋 = 1, π‘Œ = 0 𝑃 𝑋 =1 π‘Œ =1 ×𝑃 π‘Œ =1 = 𝑃 𝑋=1π‘Œ=1 𝑃 π‘Œ=1 + 𝑋=1π‘Œ=0 𝑃 π‘Œ=0

19

0.75 Γ— 0.35 = 0.75 Γ— 0.35 + 0.3 Γ— 0.65 = 0.5738

Data Preprocessing

Classification & Regression

Classification Paradigms Discriminative Determine 𝑃 πΆπ‘™π‘Žπ‘ π‘ π‘˜ 𝐱 Directly 25

Generative Model 𝑃 𝐱 πΆπ‘™π‘Žπ‘ π‘ π‘˜ and 𝑃 πΆπ‘™π‘Žπ‘ π‘ π‘˜ Separately and Use Bayes’ Theorem to Find 𝑃 πΆπ‘™π‘Žπ‘ π‘ π‘˜ 𝐱

1

5

Mass

Mass

25 5 1 10 10 20

Size

Size

Classification Paradigms: Discriminative β€’ Consider learning coin classification 25

1

5

1

10

10 21

5

Mass

Mass

25

Size

Size

Classification Paradigms: Generative β€’ Consider learning coin classification

5 1

25 5

Mass

Mass

25

1

10 10 22

Size

Size