Components of Learning β’ Suppose that a bank wants to automate the process of evaluating credit card applications. β Input π₯ (customer information that is used to make a credit application). β Target function π: π β π (ideal formula for credit approval), where π and π are the input and output space, respectively. β Dataset π· of input-output examples π₯1 , π¦1 , β¦ , π₯π , π¦π . β Hypothesis (skill) with hopefully good performance: π: π β π (βlearnedβ formula to be used) 1
Data Preprocessing
Classification & Regression
Components of Learning Unknown Target Function π: π β π (ideal credit approval formula)
Training Examples π₯1 , π¦1 , β¦ , π₯π , π¦π (historical records of credit customers)
Hypothesis Set π»
Learning Algorithm π΄
Final Hypothesis πβπ (learned credit approval formula)
(set of candidate formulas)
Use data to compute hypothesis π that approximates target π 2
Simple Learning Model: The Perceptron For π± = π₯1 , β¦ , π₯π (βfeatures of the customerβ), compute a weighted score and: π
Approve credit if
π€π π₯π > threshold , π=1 π
Deny credit if
π€π π₯π < threshold . π=1
3
Perceptron: A Mathematical Description This formula can be written more compactly as π
π π± = sign
π€π π₯π β threshold , π=1
where π π± = +1 means βapprove creditβ and π π± = β 1 means βdeny creditβ; sign π = +1 if π > 0 and sign π = β1 if π < 0. This model is called a perceptron. 4
Data Preprocessing
Classification & Regression
Perceptron: A Visual Description
5
π₯1
π₯2
π₯3
π¦
1
0
0
-1
1
0
1
1
1
1
0
1
1
1
1
1
0
0
1
-1
0
1
0
-1
0
1
1
1
0
0
0
-1
Input Nodes
π₯1 π₯2
π₯3
0.3
0.3
Output Node
Ξ£
0.3
π‘ = 0.4
π¦
Data Preprocessing
Classification & Regression
Data
Perceptron Limitations π₯1
π₯2
π¦ = π₯1 XOR π₯2
1
1
0
1
0
1
0
1
1
0
0
0
Model
Input
6
π₯1
π€1 Output
Ξ£ π₯2
π€2 π‘
π¦
0,1
1,1
0,0
1,0
The following cannot all be true: π€1 Γ 1 + π€2 Γ 1 < π‘ π€1 Γ 1 + π€2 Γ 0 > π‘ π€1 Γ 0 + π€2 Γ 1 > π‘ π€1 Γ 0 + π€2 Γ 0 < π‘
2
Perceptrons (in β ) π π± = π πππ π€0 + π€1 π₯1 + π€2 π₯2
The perceptron is a linear (binary) classifier:
7
β’ Customer features x: points on the plane β’ Labels π¦: (+1), (-1) β’ Hypothesis π: line (divide positive and negative)
2
Perceptron Learning (in β ) The algorithm will find an initial π° = π€0 , π€1 , π€2 from the training data. At each iteration π‘, π° has a current value, π° π‘ . The algorithm picks a misclassified instance, say π± π‘ , π¦ π‘ , and uses it to update π° π‘ : π° π‘+1 =π° π‘ +π¦ π‘ π± π
8
Summary of Perceptrons β’ Simple and intuitive mathematical formulation β’ Fast. β’ A linear classifier or discriminator. β Only capable of learning linearly separable patterns.
β’ Models a binary output variable.
9
Bayesian Learning β’ In many applications, the relationship between the feature set and the class variable is non-deterministic.
10
Conditional Probability Conditional Probability is useful for understanding the dependencies among random variables. The conditional probability for the random variable π given the random variable π, denoted as π π π , is defined as: π π, π π ππ = π π If π and π are independent (the value of one variable has no impact on the other), then π π π = π π . 11
Data Preprocessing
Classification & Regression
Conditional Probability π π
π π, π π=1 π
π=0 π
12
π π
π ππ=0
π
π
Bayesβ Theorem The conditional probabilities π π π and π π π , where π and π are random variables, can be expressed in terms of one another using a formula known as Bayesβ theorem: π ππ π π π ππ = . π π
13
Law of Total Probability If π1 , π2 , β¦ , ππ is the set of mutually exclusive and exhaustive outcomes of a random variable π, then the denominator of Bayesβ theorem can be expressed as: π
π π =
π
π π, ππ = π=1
π π ππ π ππ . π=1
This is called the law of total probability. 14
Bayesβ Theorem for Classification Using Bayesβ theorem for classification problems, we get π πΉπππ‘π’ππ πππ‘ πΆπππ π π πΆπππ π π πΆπππ π πΉπππ‘π’ππ πππ‘ = π πΉπππ‘π’ππ πππ‘ Posterior Probability β Class Conditional Probability Γ Prior
15
Data Preprocessing
Classification & Regression
Bayesian Decisions π π, π
π=1 π=0 π π ππ=0
16
π ππ=1
π
β’ Notice that feature values for class 1 (π = 1) tend to be larger than those for class 0 (π = 0). The dashed line denotes the decision boundary between classes 0 and 1 based on feature π.
Example Application of Bayesβ Theorem πΆπππ ππππ a football game between two rival teams: Team 0 and Team 1. Support Team 0 wins 65% of the time and Team 1 wins the remaining matches. Among the games won by Team 0, only 30% of them come from playing on Team 1βs football field. On the other hand, 75% of the victories for Team 1 are obtained while playing at home. If Team 1 is to host the next match between the two teams, which team will most likely emerge as the winner?
17
Example Application of Bayesβ Theorem Let π be the team hosting the match and π be the winner of the match. Both π and π can take on values from the set 0,1 . Then:
18
Probability Team 0 wins is π π = 0 = 0.65. Probability Team 1 wins is π π = 1 = 1 β 0.65 = 0.35. Probability Team 1 hosted the match it won is π π = 1 π = 1 = 0.75. Probability Team 1 hosted the match won by Team 0 is π π = 1 π = 0 = 0.3.
Example Application of Bayesβ Theorem π π =1 π =1 Γπ π =1 π π=1π=1 = π π=1 π π =1 π =1 Γπ π =1 = π π = 1, π = 1 + π π = 1, π = 0 π π =1 π =1 Γπ π =1 = π π=1π=1 π π=1 + π=1π=0 π π=0
19
0.75 Γ 0.35 = 0.75 Γ 0.35 + 0.3 Γ 0.65 = 0.5738
Data Preprocessing
Classification & Regression
Classification Paradigms Discriminative Determine π πΆπππ π π π± Directly 25
Generative Model π π± πΆπππ π π and π πΆπππ π π Separately and Use Bayesβ Theorem to Find π πΆπππ π π π±
1
5
Mass
Mass
25 5 1 10 10 20
Size
Size
Classification Paradigms: Discriminative β’ Consider learning coin classification 25
1
5
1
10
10 21
5
Mass
Mass
25
Size
Size
Classification Paradigms: Generative β’ Consider learning coin classification
5 1
25 5
Mass
Mass
25
1
10 10 22
Size
Size