Topics: I. Probability Axioms II. Interpretations of Probability III. Bayesian Confirmation Theory

07. Probability and Confirmation

Motivation: Confirmation is an epistemic notion; can’t be described in purely syntactic (formal logical) terms. Fundamental Question: How does evidence contribute to the credibility of a theory? • HD describes the confirmation relation as deductive entailment (syntactic relation). • Instance Confirmation describes the confirmation relation in terms of the formal notion of an instance of a universal sentence (syntactic notion). • The Bayesian Account describes the confirmation relation in terms of the notions of probability and degrees of belief. Probability (Pr) function: A function that assigns a number between 0 and 1 to events. “Pr(B/A)” means “the probability of event B given event A”

I. Probability Axioms A1: 0 ≤ Pr(B/A) ≤ 1.

Every probability is a unique real number between 0 and 1.

A2: If A  B ("A entails B"), then Pr(B/A) = 1. A3: Special Addition Rule. If B and C are mutually exclusive (can’t both occur simultaneously), then Pr(B or C/A) = Pr(B/A) + Pr(C/A). A4: General Multiplication Rule. Pr(B & C/A) = Pr(B/A)Pr(C/A & B)

3 Theorems (Consequences of Axioms): (1)

Negation Rule. Pr(∼B/A) = 1 − Pr(B/A)

Proof: Pr(B or ∼B/A) = 1 = Pr(B/A) + Pr(∼B/A)

(2) Rule of Total Probability. Pr(C/A) = Pr(B/A)Pr(C/A & B) + Pr(∼B/A)Pr(C/A & ∼B) Proof: Pr(C/A) = Pr((B&C) or (∼B&C)/A) = Pr(B&C/A) + Pr(∼B&C/A)

(A3)

= Pr(B/A)Pr(C/A&B) + Pr(∼B/A)Pr(C/A&∼B)

(A4)

C is logically equivalent to (B & C) or (∼B & C)

1

Example: Frisbee Factory. Part 1. Machine #1 (M1): produces 800 frisbees/day, 1% of which are defective. Machine #2 (M2): produces 200 frisbees/day, 2% of which are defective. A = getting a frisbee produced on May Day

∼B = getting a frisbees produced by M2

B = getting a frisbee produced by M1

C = getting a defective frisbee

Question: What is the probability of getting a defective May Day frisbee? What is Pr(C/A)? Note:

Pr(C/A) = Pr(B/A) × Pr(C/A & B) + Pr(∼B/A) × Pr(C/A & ∼B)

Prob of getting a defective frisbee, given it's a May Day frisbee

Prob of getting an M1 frisbee, given it's a May Day frisbee

=

×

Prob of getting a defective frisbee, given it's an M1 May Day frisbee

+

Prob of getting an M2 frisbee, given it's a May Day frisbee

×

Prob of getting a defective frisbee, given it's an M2 May Day frisbee

= 0.8 × 0.01 + 0.2 × 0.02 = 0.012 Prediction: Given a cause, what is the effect?

(3)

Bayes' Theorem

Pr(B/A &C ) = =

Proof:

Note:

Pr(B/A)Pr(C /A & B) , Pr(C /A)

Pr(C /A) ≠ 0

Pr(B/A)Pr(C /A & B) Pr(B/A)Pr(C /A & B) + Pr(∼ B/A)Pr(C /A& ∼ B)

Pr(B&C/A) = Pr(B/A)Pr(C/A&B) Pr(C&B/A) = Pr(C/A)Pr(B/A&C)

(A4) (A4)

Bayes' Theorem lets us calculate "inverse" probabilities; i.e., probabilities of past events, based on present events.

Example: Frisbee Factory. Part 2. Question: Given a defective frisbee, what is the probability that it's an M1 frisbee? What is Pr(B/A & C)? Pr(B/A & C ) =

=

Pr(B/A) Pr(C /A & B) Pr(B/A) Pr(C /A & B) + Pr(∼ B/A) Pr(C /A& ∼ B) 0.8 × 0.01 0.8 × 0.001 + 0.2 × 0.02

= 2/3

"Retrodiction": Given an effect, what is the cause?

2

II. Interpretations of Probability 1. Classical Interpretation Pr(A) =

Laplace 1814 A Philosophical Essay on Probabilities

# of favorable cases of A # of equally possible cases of A's type

Ex1. A = getting an even number on roll of standard die favorable cases of A = {2, 4, 6} equally possible cases of A's type = {1, 2, 3, 4, 5, 6}

Pr(A) =

3 6

Why "equally" possible? Ex2. A = getting two heads on the flip of two non-biased coins favorable cases of A = {HH} possible cases of A's type = {HH, HT, TT} equally possible cases of A's type = {HH, HT, TH, TT}

Pr(A) =

1 4

Principle of Indifference (PI): Two outcomes are equally possible if we have no reason to prefer one to the other. Problem: PI may lead to violations of the Probability Axioms! Ex3. Joe the Sloppy Bartender Joe's sloppy mix for a 3:1 martini: Anywhere from 2:1 to 4:1 Two properties associated with Joe's sloppy martini: (1) Ratio of gin to vermouth: From 2:1 to 4:1 (2) Proportion of vermouth: From 1/3 = 20/60 to 1/5 = 12/60 Now:

PI:

3:1 martini = 3 parts gin, 1 part vermouth 2 non-linearly related properties!

Consider the two following outcomes for Property (1): • Next martini will have a ratio of gin to vermouth of 2:1 to 3:1. • Next martini will have a ratio of gin to vermouth of 3:1 to 4:1. Both outcomes are equally possible!

Now: Consider two outcomes for Property (2): • Next martini will have a proportion of vermouth of 20/60 to 16/60. • Next martini will have a proportion of vermouth of 16/60 to 12/60. PI: Both outcomes are equally possible! But! The PI has now given us contradictory predictions! (a) According to the PI applied to Property (1), there's a 50% chance that Joe's next martini will have a proportion of vermouth between 20/60 and 15/60 (=1/4). (b) According to the PI applied to Property (2), there's a 50% chance that Joe's next martini will have a proportion of vermouth between 20/60 and 16/60.

Two different probabilties for the same outcome: Violation of Axiom 1

3

2. Frequency Interpretation Pr(A) = the limit of the sequence of relative frequencies of A.

Relative frequency of A =

# of actual occurances of A # of actual occurances of events of A's type

So: A sequence of relative frequencies = sequence of real numbers! Def. A sequence of real numbers has a limit L just when there is a member of the sequence afterwhich all other members are very close to L. If there is no such member, then the sequence has no limit. Ex1. A = getting a head on flip of non-biased coin Actual flips: H T H T T T H H T T H T T T H .... Relative frequencies of A: 1/1, 1/2, 2/3, 2/4, 2/5, 2/6, 3/7, ... Pr(A) = limit of the sequence {1/1, 1/2, 2/3, 2/4, 2/5, 2/6, 3/7, ...} = 1/2. What this means: As the members of the sequence go to infinity, if there is a member afterwhich all other members are very close to some number L, then Pr(A) = L. If not, then Pr(A) is not defined. (And we believe, in this case, that the limit exists and is 1/2.)

Two Problems (i) How are limiting frequencies determined? finite data (behavior of finite initial sequence)

limit extrapolate

(behavior of infinite sequence)

How? And what justifies this process?

ASIDE: Mathematically, the problem is to determine a unique formula that describes a given sequence of relative frequencies. Once we have such a formula, determining if it has a limit, and what it is, is not difficult. But an infinite number of formulas are compatible with any finite initial sequence of real numbers!

Claim: Any limit L is compatible with any finite initial data sequence

4

Ex. Coin flips

⎧ ⎪ ⎪ ⎪m ⎫ ⎪ Let ⎨ ⎬ = ⎪ ⎪ ⎪ ⎩n ⎪ ⎭

⎧ ⎪ ⎪ # heads ⎫ ⎪ ⎪ ⎨ ⎬ be a finite sequence of relative frequencies. ⎪ ⎪ ⎪ ⎩ # total flips ⎪ ⎭

Suppose limiting frequency = 1/2.

⎧ ⎪ ⎪ ⎪m + a ⎫ ⎪ ⎬ also has limiting frequence = 1/2, where a ≤ b. ⎪ ⎪ n + b ⎪ ⎪ ⎩ ⎭

Then the sequence ⎨

What this means: (a) Add any sequence of b flips, a of which are heads, to {m/n} and limiting frequency remains unchanged. (b) Chop off any sequence of b flips, a of which are heads, from {m+a/n +b} and limiting frequency remains unchanged.

So: Strong Claim: The observed relative frequencies in any finite sample are irrelevant to whether a limiting frequency exists and what it is.

(ii) How are Single Case Probabilities Explained? How does the Frequentist explain the probability of a single occurance? Ex. Outcome of 2008 election. What does it mean to say "The probability of a democrat winning is 85%"?

3. Propensity Interpretation Pr(A) = a measure of the causal tendancy (propensity) to produce A. Ex.

A = getting a 6 on roll of a standard die Pr(A) = 1/2 = measure of the propensity in the die to produce 6 upon one roll

explains single case probability Problem: How are "Inverse" Probabilities Explained? Note: The Propensity Interpretation explains the probability of an outcome in terms of its cause. But:

Bayes' Theorem lets us calculate the probability of a cause in terms of its outcome/effect!

So:

The Propensity Interpretation cannot explain all types of probabilities. 5

Important Distinction Propensity Interp:

Probabilities measure the tendancy of a mechanism to produce a single outcome.

Frequency Interp:

Probabilities are properties of sequences of outcomes.

Both agree that probabilities are objective features of physical systems.

4. Personalist Interpretation Pr(A) = the degree of belief in A of an ideal rational person whose set of beliefs conform to the Probability Axioms.

Def 1.

An incoherent set of beliefs is a set that does not conform to the Probability Axioms.

Def 2. A Dutch Book is a set of bets such that the subject loses no matter what the outcome of the event wagered on. Claim: A person is subject to a Dutch Book if and only if that person holds an incoherent set of beliefs. Example Let A = getting heads on next flip of coin. ∼A = getting tails on next flip of coin. Suppose:

Joe's degree of belief in A = 2/3.

"2:1 odds" means "Joe is willing to risk $2 to gain $1"

⇒ 2:1 odds for heads

Joe's degree of belief in ∼A = 2/3. ⇒ 2:1 odds for heads Only 2 possible outcomes: (A) Heads

(B) Tails

Heads bet pays off: +$1

Heads bet fails:

−$2

Tails bet fails:

Tails bet pays off:

+$1

−$2 −$1 net

−$1 net

Joe loses no matter what! Joe has succumbed to a Dutch Book! Why Did Joe Succumb To a Dutch Book? Because his set of beliefs did not conform to the Probability Axioms. In particular, he failed to abide by the Negation Rule: Pr(A/K) + Pr(∼A/K) = 1, where K = "background knowledge".

6

Problem: Too "subjective"? Any set of beliefs will be judged rational so long as it conforms to the Probability Axioms. (Intuition: You can subscribe to really strange, WEIRD beliefs, so long as your belief system is consistent!) This means that the Personalist Interpretation does not explain or prescribe how the values of probabilities are to be set/determined (a problem similar to one faced by the Frequency Interpretation).

5. Logical Interpretation Pr(A) = a weighted sum of all state descriptions of the universe in which A is true.

state description = a description of a possible way the universe could be ordered. Example: Simple universe with 3 individuals and one property Individuals:

a, b, c

Property:

F

State Descriptions: 1. Fa & Fb & Fc

5. Fa & ∼Fb & ∼Fc

2. Fa & Fb & ∼Fc

6. ∼Fa & Fb & ∼Fc

3. Fa & ∼Fb & Fc

7. ∼Fa & ∼Fb & Fc

4. ∼Fa & Fb & Fc

8. ∼Fa & ∼Fb & ∼Fc

How to assign weights? Carnap's Method: (a)

Group state descriptions according to similar structure.

(b)

Assign equal weights to structure descriptions.

(c)

Assign equal weights to all state descriptions with same structure.

State Description

Weight

Structural Description

Weight

1.

Fa & Fb & Fc

1/4

All F

1/4

2.

Fa & Fb & ∼Fc

1/12

3.

Fa & ∼Fb & Fc

1/12

2 F, 1 ∼F

1/4

4.

∼Fa & Fb & Fc

1/12

5.

Fa & ∼Fb & ∼Fc

1/12

6.

∼Fa & Fb & ∼Fc

1/12

1 F, 2 ∼F

1/4

7.

∼Fa & ∼Fb & Fc

1/12

8.

∼Fa & ∼Fb & ∼Fc

1/4

No F

1/4

7

Now:

Suppose A is Fa (i.e., "Individual a has property F").

Then:

A is true in state descriptions 1, 2, 3, 5, which have weights 1/4, 1/12, 1/12, 1/12.

So:

Pr(A) = 1/4 + 1/12 + 1/12 + 1/12 = 1/2

Note:

Logical probabilities are not unique! They depend on how weights are assigned to state descriptions. Call Carnap's method of assignment m*. (So m*(A) = 1/2.)

But:

Why pick Carnap's method m*?

Claim: It makes possible a confirmation function c* that "learns from experience".

Def.

c*(H/E) = m*(H & E)/m*(E) where c*(H/E) means "The degree of confirmation of H given E"

Example: Let:

H = Fc, E = Fa

Then:

m*(E) = 1/2 m*(H & E) = 1/4 + 1/12 = 1/3

(Fa & Fc are both true in state descriptions 1 and 3)

So:

c*(H/E) = 2/3

Now:

Suppose evidence was E' = Fa & Fb

Then:

m*(E') = 1/4 + 1/12 = 1/3

(Fa & Fb are both true in state descriptions 1 and 2)

m*(H & E') = 1/4 So:

c*(H/E') = 3/4

Thus:

c*(H/E') > c*(H/E)!

What this means: H is confirmed by E' to a greater extent than it is confirmed by E. This makes sense: The claim that c has property F gains more support from the knowledge that a and b have property F than it does from the knowledge that just a has property F.

Now:

Suppose we had chosen a method of assigning weights that spread them equally over state descriptions, instead of structure descriptions (Wittgenstein's proposal).

Claim: This particular method, call it m†, produces a confirmation function that does not learn from experience!

8

Wittgenstein's Method m† State Description

Weight

1.

Fa & Fb & Fc

1/8

2.

Fa & Fb & ∼Fc

1/8

3.

Fa & ∼Fb & Fc

1/8

4.

∼Fa & Fb & Fc

1/8

5.

Fa & ∼Fb & ∼Fc

1/8

6.

∼Fa & Fb & ∼Fc

1/8

7.

∼Fa & ∼Fb & Fc

1/8

8.

∼Fa & ∼Fb & ∼Fc

1/8

Let c†(H/E) = m†(H & E)/m†(E)

Now: If:

H = Fc, E = Fa

Then:

m†(E) = 4 × 1/8 = 1/2 m†(H & E) = 2 × 1/8 = 1/4

(Fa & Fc are both true in state descriptions 1 and 3)

So:

c†(H/E)

Now:

Suppose evidence was E' = Fa & Fb

Then:

m†(E') = 2 × 1/8 = 1/4

= 1/2 (Fa & Fb are both true in state descriptions 1 and 2)

m†(H & E') = 1/8 So:

c†(H/E') = 1/2

Thus:

c†(H/E') = c†(H/E)!

What this means: H is confirmed by E' to the same extent as it is confirmed by E. This does not makes sense: The claim that c has property F should gain more support from the knowledge that a and b have property F than it should from the knowledge that just a has property F.

General Problem for Logical Interpretation: Vaules for logical probabilities are not uniquely fixed! Constraints on weight assignments: (a)

Sum of weights for all state descriptions must equal 1.

(b)

Weight for any state description must be greater than 0.

So: Besides Carnap's method m*, there are in principle an infinite number of ways to assign weights. (And really why should "learning from experience" be a criterion when it comes to defining the very meaning of probability?) But! If probabilities really are measures of a rational agent's degree of belief, then "learning from experience" may be a desireable characteristic...

9

III. Bayesian Confirmation Theory Recall: We want a quantitative account of how evidence affects our belief in a theory. Claim: Under a Personalist Interpretation, Bayes' Theorem gives us exactly such an account! Let H = hypothesis, E = evidence, K = background knowledge Bayes' Theorem becomes:

Pr(H /K & E) = =

Pr(H /K )Pr(E/K & H ) Pr(E/K)

,

Pr(E/K) ≠ 0

Pr(H /K )Pr(E/K & H ) Pr(H /K )Pr(E/K & H ) + Pr(∼ H /K )Pr(E/K & ∼ H )

Personalist Interpretation of Bayes' Theorem: Pr(H/K) = prior probability of H given K = Degree of belief in H before E is obtained. Pr(E/K&H) = likelihood of E given K and H = Degree of belief in E, given H is true. Pr(E/K) = expectedness of E given K = Degree of belief in E regardless of whether H is true or false. Pr(H/K&E) = posterior probability of H given K and E = Degree of belief in H given background knowledge and evidence. Exactly what we are looking for in the context of confirmation theory! Provides a quantitative measure of how belief in a theory is affected by evidence and background knowledge.

Bayesian Confirmation Theory Claims: 1. Belief in a theory is rational if it conforms to the Probability Axioms (Personalism) and, in the light of evidence, it is "up-dated" by Bayes' Theorem (Bayesianism). 2. "E confirms H relative to K" means "Pr(H/K & E) > Pr(H/K)"

10

Ex: Coin flips H = Penny p is 2-headed

Suppose prior Pr(H/K) = 0.001 for John

K = Penny p is either 2-headed or fair

John’s initial degree of belief that p is 2-headed.

E = (a)

Let E1 = 1 head observed after 1 flip. Then (i)

likelihood Pr(E1/K & H) = 1

A2 and (K & H)  E1

(ii) Pr(∼H/K) = 1 − 0.01 = 0.99

Negation rule

(iii) Pr(E1/K & ∼H) = 0.5

Pr(head on 1 flip/fair coin)

So for John, Pr(H/K & E1) =

(0.01)(1) (0.01)(1) + (0.99)(0.5)

= 0.02

(b) Let E2 = 2 heads observed after 2 flips. Then (i)

Pr(E2/K & H) = 1

(ii) Pr(E2/K & ∼H) = Pr(heads on 2 flips/fair coin) = Pr(heads on flip2 and heads on flip1/fair coin) = Pr(heads on 1 flip/fair coin)Pr(heads on 1 flip/fair coin) = (0.5)(0.5) = 0.25 So for John, Pr(H/K & E2) =

(c)

(0.01)(1) (0.01)(1) + (0.99)(0.25)

Similarly, Pr(H/K & E10) = 0.91

= 0.04

since Pr(E10/K&∼H) = (0.5)10 = 1/1024)

SO: As more heads are reported, John’s belief in H converges to 1 (certainty). Now: Suppose Wes' prior probability in H given K is Pr(H/K) = 0.5. Then for Wes, Pr(H/K & E1) = 0.67 Pr(H/K & E2) = 0.80 Pr(H/K & E10) = 0.99

Wes' belief in H is strengthened as more evidence accumulates.

Important Fact: John and Wes start out with different priors, but eventually their beliefs about H converge.

"Washing out" of priors 11

Advantages of Bayesian Confirmation (1)

Models convergence of beliefs to certainty on positive evidence

(2)

“Wash-out” of priors: the convergence in (1) occurs regardless of initial prior probabilities.

(3)

Provides a quantitative description of confirmation. (It gives us a numerical value for the degree of belief in a hypothesis given evidence.)

3 Problems (1)

Problem of Subjectivity

Note: A belief in a theory will be deemed rational just when it obeys the Probability Axioms (Personalism) and is “up-dated” in the light of evidence and background knowledge via Bayes’ Theorem (Bayesianism). But:

This consistency constraint does not prescribe what the values to the priors, likelihoods and expectedness must be.

So:

Belief in intelligent design, for instance, will be just as rational as belief in evolution, given that the creationist makes appropriate assignments to his priors and likelihoods.

(2)

Problem of Old Evidence

Suppose: Pr(E/K) = 1

E is known prior to H.

And:

Pr(E/K & H) = 1

H entails E.

Then:

Pr(H/K & E) = Pr(H/K)

(Bayes' Theorem)

So:

E plays no role in how beliefs get assigned to H.

So:

Can’t say that E confirms H.

E is "old evidence".

But: This is problematic since there are cases in science in which old evidence did affect belief in a theory. The perihelion advance in Mercury’s orbit was known before Einstein constructed his general theory of relativity (GR); it constituted old evidence with respect to GR. However, the perihelion advance was taken as confirming evidence for GR; it provided additional proof that most physicists at the time used to condition their beliefs in the theory. The upshot is that the Bayesian account of confirmation cannot describe how such old evidence conditions belief.

12

(3)

Problem of Non-Zero Priors

Note: For Bayes' Theorem to work, can’t assign Pr(H/K) = 0. There must be some initial degree of belief in H to begin with. Popper’s Claim: Pr(H/K) = 0 for all non-tautological universal H's (H's that make universal claims and that are not trivially true). Proof: Let H = (i)Pai

(A generic universal H: "Every individual ai has property P", where i runs from 1 to some total number)

Note: H entails (Pa1 & Pa2 & ... & Pan), for any n. So:

Pr(H /K ) ≤ lim Pr(Pa1 & Pa 2 & ... & Pan /K )

(∗)

n →∞

Now assume: (I)

Pr(Pa1 & Pa2 & ... & Pan/K) = Pr(Pa1/K) Pr(Pa2/K) ... Pr(Pan/K),

(E) Pr(Pam/K) = Pr(Pan/K), Then: From (I),

for any m, n

lim Pr(Pa1 & ... & Pan /K ) = lim Pr(Pa1 /K )...Pr(Pan /K )

n →∞

for all n

n →∞

(∗∗)

Now: From (A1), any particular factor Pr(Pai/K) in the limit on the RHS of (∗∗) satisfies 0 ≤ Pr(Pai/K) ≤ 1. So:

Either Pr(ai/K) = 0, or Pr(ai/K) = 1/r for some integer r ≠ 1. (The case r = 1 represents a trivially true (i.e., tautologous) universal claim.)

Now: From (E), all factors in the limit on the right-hand-side of (∗∗) are equal. So:

Either they are all 0, hence the limit is zero, or they are all equal to 1/r for some integer r ≠ 1. In the latter case, the right-hand-side becomes lim(1/r)n = 0. n →∞

So:

So either way, the right-hand-side of (∗∗) vanishes and we have,

Pr(H /K ) ≤ lim Pr(Pa1 & Pa 2 & ... & Pan /K ) = 0 n →∞

Thus: By (A1), Pr(H/K) = 0.

13

Response: To block the conclusion, we can challenge either of assumptions (I) or (E). (1) (E) claims that instances of a property P are exchangeable. This assumption guarantees "Weak Hume Projectability", which might be appealing to a Bayesian. Def. A property P is Weakly Hume Projectible relative to background knowledge K just when, for the sequence ... a−2, a−1, a0, a1, a2, ... and any n,

lim Pr (Pan /Pan−1 & Pan−2 & ...Pan−k & K ) = 1

k →∞

"If all a's have been P's for as far back as we care to consider, then the next a will be a P with certainty." Claim. Exchangeability entails Weak Hume Projectability.

(2) (I) claims that all instances of a property P are probabilistically independent of each other. Popper says: If (I) doesn't hold, this implies some form of causal glue between events. A Bayesian may respond: The "glue" need not be causal, but merely probabilistic!

14