Formal Probability Basics We like to think of “Probability” formally as a function that assigns a real number to an event. We denote by H the basic experimental context in which events will arise. Very often H will be a ¯. hypothesis. Its complement, is denoted by H c or H Let E and F be any events that might occur under H . Then a probability function P (E|H) (spoken as E given H ) is defined as: P1 0 ≤ P (E|H) ≤ 1 for all E , H . ¯ P2 P (H|H) = 1 and P (H|H) = 0. P3 P (E ∪ F |H) = P (E|H) + P (F |H) whenever E ∩ F ∩ H = {φ} - whenever impossible for any two of the events E , F and H to occur. Usually consider: E ∩ F = {φ} and say they are mutually exclusive. Overview of Bayesian Statistics – p. 2/15

Formal Probability Basics (contd.) If E is an event, then we denote its complement (not E ) by E¯ or E c . Since E ∩ E¯ = {φ}, we have from P3: ¯ = 1 − P (E). P (E)

Conditional Probability of E given F : P (E ∩ F |H) P (E|F ∩ H) = P (F |H)

We will often write EF for E ∩ F . Compound probability rule: write the above as P (E|F H)P (F |H) = P (EF |H).

Overview of Bayesian Statistics – p. 3/15

Formal Probability Basics Independent Events: E and F are said to be independent (we will write E ⊥ F ) if the occurrence of one does not imply the occurrence of the other. Then, P (E|F H) = P (E|H) and we have the following multiplication rule: P (EF |H) = P (E|H)P (F |H).

Homework: Can you formally show that if P (E|F H) = P (E|H) then P (F |EH) = P (F |H)? Marginalization: We can express P (E|H) by “marginalizing” over the event F : P (E|H) = P (EF |H) + P (E F¯ |H) = P (F |H)P (E|F H) + P (F¯ |H)P (E|F¯ H). Overview of Bayesian Statistics – p. 4/15

Application - prognosis example Joint and conditional distributions:

Table 1: Survival and Stage E = early E¯ = Late M arginals F = survive 0.72 0.02 0.74 F¯ = dead 0.18 0.08 0.26 M arginals 0.90 0.10 1.00

Can you identify the different numbers with joint, conditional and marginal probabilities? Odds and log-odds: Any probability p can be expressed as Odds, O where O = p/(1 − p). Natural logarithm of odds will be called logit and logit(p) = log [p/(1 − p)]. Overview of Bayesian Statistics – p. 5/15

Bayes Theorem Observe that: P (EF |H) = P (E|F H)P (F |H) = P (F |EH)P (E|H) P (F |H)P (E|F H) ⇒ P (F |EH) = . P (E|H)

This is Bayes’ Theorem, named after Reverend Thomas Bayes – an English clergyman with a passion for gambling! Often this is written as: P (F |H)P (E|F H) P (F |EH) = P (F |H)P (E|F H) + P (F¯ |H)P (E|F¯ H)

Overview of Bayesian Statistics – p. 6/15

Principles of Bayesian Statistics Two hypothesis: H0 : excess relative risk for thrombosis for women taking a pill exceeds 2; H1 : it is under 2. Data collected at hand from a controlled trial show a relative risk of 3.6. Probability or likelihood under the data, given our prior beliefs is P (x|H); H is H0 or H1 .

Overview of Bayesian Statistics – p. 7/15

Bayes Theorem updates the probability of each hypothesis as, P (H)P (x|H) P (H|x) = P (x)

Marginal probability: P (x) = P (H0 )P (x|H0 ) + P (H1 )P (x|H1 )

Reexpress: P (H|x) ∝ P (H)P (x|H)

Overview of Bayesian Statistics – p. 8/15

Famous Game Show Example! Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car, behind the others, goats. You pick a door, say number 1, and the host, who knows what is behind the doors, opens another door, say number 3, which has a goat. He says to you, “Do you want to pick door number 2?” Is it to your advantage to switch your choice of doors? Homework Provide a formal solution to the above problem using Bayes Theorem.

Overview of Bayesian Statistics – p. 9/15

Likelihood and Prior Bayes theorem in English: prior × likelihood Posterior distribution = P ( prior × likelihood)

Denominator is summed over all possible priors

It is a fixed normalizing factor that is (usually) extremely difficult to evaluate Curse of dimensionality Markov Chain Monte Carlo to the rescue! WinBUGS software: www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml

Overview of Bayesian Statistics – p. 10/15

Prior Elicitation Clinician interested in π : prop. of children between age 5–9 in a particular population having asthma symptoms. Clinician has prior beliefs about π , summarized as “Prior support” and “Prior weights” Data: random sample of 15 children show 2 having asthma symptoms. Likelihood obtained from Binomial distribution:   15 2 π (1 − π)13 2 15 Note: 2 is a “constant” and *can* be ignored in the computations, though they are accounted for in the next Table.

Overview of Bayesian Statistics – p. 11/15

Computation Table Prior Support

Prior weight

Likelihood

Prior × Likelihood

Posterior

0.10

0.10

0.267

0.027

0.098

0.12

0.15

0.287

0.043

0.157

0.14

0.25

0.290

0.072

0.265

0.16

0.25

0.279

0.070

0.255

0.18

0.15

0.258

0.039

0.141

0.20

0.10

0.231

0.023

0.084

Total

1.00

0.274

1.000

Posterior: obtained by dividing Prior × Likelihood with normalizing constant 0.274 HOMEWORK: Redo the posterior computations with prior weights as 1/6 for each of the above six numbers in the support. This is an uninformative prior. Redo (again!) the computations with a more informative prior assigning weights 0.4 on support values 0.14 and 0.16, and weight 0.05 on each of the remaining four values. Comment on the sensitivity of the posteriors to the priors.

Overview of Bayesian Statistics – p. 12/15

The Sampling Perspective Previous example: direct evaluation of the posterior probabilities. Feasible for simpler discrete problems. Modern Bayesian Analysis: Derive complete posterior densities, say p(θ | y) by drawing samples from that density. Samples are of the parameters themselves, or of their functions. If θ1 , . . . , θM are samples from p(θ | y) then, densities are created by feeding them into a density plotter. Similarly samples from f (θ), for some function f , are obtained by simply feeding the θi ’s to f (·). In principle M can be arbitrarily large – it comes from the computer and only depends upon the time we have for analysis. Do not confuse this with the data sample size n which is limited in size by experimental constraints.

Overview of Bayesian Statistics – p. 13/15

Issues in Sampling-based analysis Direct Monte Carlo: Some algorithms can be designed to generate independent samples exactly from the posterior distribution. In these situations there are NO convergence problems or issues. Sampling is called exact. Markov Chain Monte Carlo (MCMC): In general, exact sampling may not be possible/feasible. MCMC is a far more versatile set of algorithms that can be invoked to fit more general models. Note: anywhere where direct Monte Carlo applies, MCMC will provide excellent results too. Convergence issues: There is no free lunch! The power of MCMC comes at a cost. The initial samples do not necessarily come from the desired posterior distribution. Rather, they need to converge to the true posterior distribution. Therefore, one needs to assess convergence, discard output before the convergence and retain only post-convergence samples. The time of convergence is called burn-in. Diagnosing convergence: Usually a few parallel chains are run from rather different starting points. The sample values are plotted (called trace-plots) for each of the chains. The time for the chains to “mix” together is taken as the time for convergence. Good news! All this is automated in WinBUGS. So, as users, we need to only configure how to specify good Bayesian models and implement them in WinBUGS. This will be the focus of the course. Overview of Bayesian Statistics – p. 14/15

Principle of Predictions Classical: Impute a point-estimate of θ into model. In Bayesian analysis we summarize θ by its entire posterior distribution p(θ | y). In that spirit we obtain complete predictive distributions by averaging the likelihood over the full posterior distribution. In the sampling approach, posterior density p(θ | y) is simulated using samples θ1 , . . . , θM . Out of sample prediction of z will be: X p(z | y) = p(z | θ)p(θ | y) θ

Implementation: We sample from p(z | y). For each θi from p(θ | y), this amounts to drawing zi from the data likelihood p(y | θ) with θ = θi .

Overview of Bayesian Statistics – p. 15/15