Bayesian inference for logistic models using Polya-Gamma latent variables

Bayesian inference for logistic models using Polya-Gamma latent variables Nicholas G. Polson∗ arXiv:1205.0310v1 [stat.ME] 2 May 2012 University of C...
Author: Brett Freeman
2 downloads 0 Views 1MB Size
Bayesian inference for logistic models using Polya-Gamma latent variables Nicholas G. Polson∗

arXiv:1205.0310v1 [stat.ME] 2 May 2012

University of Chicago

James G. Scott† Jesse Windle‡ University of Texas at Austin

First Draft: August 2011 This Draft: April 2012

Abstract We propose a new data-augmentation strategy for fully Bayesian inference in models with logistic likelihoods. To illustrate the method we focus on four examples: binary logistic regression; contingency tables with fixed margins; multinomial logistic regression; and negative binomial models for count data. In each case, we show our how data-augmentation strategy leads to simple, effective methods for posterior inference that: (1) entirely circumvent the need for analytic approximations, numerical integration, or Metropolis–Hastings; and (2) outperform the existing state of the art, both in ease of use and in computational efficiency. We also describe how the method may be extended to other logit models, including topic models and nonlinear mixed-effects models. Our approach appeals heavily to the novel class of Polya-Gamma distributions; much of the paper is devoted to constructing this class in detail, and demonstrating its relevance for Bayesian inference. The methods described here are implemented in the R package BayesLogit.



[email protected] [email protected][email protected]

1

1 1.1

Introduction A latent-variable representation of logistic models

Many common statistical models involve a likelihood that decomposes into a product of terms of the form (eψi )ai Li = , (1) (1 + eψi )bi where ψi is a linear function of parameters, and where ai and bi involve the response for subject i. One familiar case arises in binary logistic regression, where we observe outcomes yi ∈ {0, 1}, i = 1, . . . , n, and assume that pr(yi = 1) =

eψi . 1 + eψi

(2)

Here ψi = xTi β for a known p-vector of predictors xi and a common, unknown set of regression coefficients β = (β1 , . . . , βp )T . In this paper, we propose a new latent-variable representation of likelihoods involving terms like (1). This leads directly to efficient Gibbs-sampling algorithms for a very wide class of Bayesian models that have previously eluded simple treatment. This paper focuses on four common models: binary logistic regression; contingency tables with logistic-normal priors; multinomial logistic regression; and negative-binomial models for count data. But the basic approach also applies to a much wider class of models, including topic models, discrete mixtures of logits, random-effects and mixed-effects models, non-Gaussian factor models, and hazard models. Our representation is based upon a new family of Polya-Gamma distributions, which we introduce here, and describe thoroughly in Section 2. Definition 1. A random variable X has a Polya-Gamma distribution with parameters b > 0 and c > 0, denoted X ∼ PG(b, c), if ∞ gk 1 X X= 2 , 2 2π k=1 (k − 1/2) + c2 /(4π 2 ) D

(3)

where each gk ∼ Ga(b, 1) is an independent gamma random variable. With this definition in place, we proceed directly to our main result, which we prove in Section 2.3. Theorem 1. Let p(ω) denote the density of the random variable ω ∼ PG(b, 0), b > 0. Then the following integral identity holds for all a ∈ R: Z ∞ (eψ )a 2 −b κψ =2 e e−ωψ /2 p(ω) dω , (4) ψ b (1 + e ) 0 where κ = a − b/2. 2

Moreover, the conditional distribution 2

e−ωψ /2 p(ω) , p(ω | ψ) = R ∞ −ωψ2 /2 e p(ω) dω 0 which arises in treating the integrand in (4) as an unnormalized joint density in (ψ, ω), is also in the Polya-Gamma class: (ω | ψ) ∼ PG(b, ψ). Equation (4) is significant because it allows us to write (1) in a conditionally Gaussian form, given the latent precision ωi . Thus in the important case where ψi = xTi β is a linear function of predictors, the full likelihood in β will also be conditionally Gaussian. Moreover, the second half of Theorem 1 shows that the full conditional distribution for each ωi , given ψi , is in the same class as the prior p(ωi ). Theorem 1 thus enables full Bayesian inference for a wide class of models involving logistic likelihoods and conditionally Gaussian priors. All of these models can now be fit with a Gibbs sampler having only two basic steps: multivariate normal draws for the main parameters affecting the ψi terms, and Polya-Gamma draws for the latent variables ωi . (We will describe this second step in detail.)

1.2

Existing work

As we will show, our approach has significant advantages over existing approaches for Bayesian inference in logistic models. While traditional approaches have involved numerical integration, normal approximations to the likelihood (Carlin, 1992; Gelman et al., 2004), or the Metropolis–Hastings algorithm (Dobra et al., 2006), the most direct comparison here is with the more recent methods of Holmes and Held (2006) and Fr¨ uhwirth-Schnatter and Fr¨ uhwirth (2007). Both of these papers rely upon a random-utility construction of the logistic model, where the outcomes yi are assumed to be thresholded versions of an underlying continuous quantity zi . For the purposes of exposition, assume a linear model where  1 , zi ≥ 0 yi = −1 , zi < 0 zi = xTi β + i

(5)

i ∼ Lo(1) , where i ∼ Lo(1) has a standard logistic distribution. Upon marginalizing over zi , the likelihood in (2) is recovered. A useful analogy is with the work of Albert and Chib (1993), who represent the probit regression model in the same way, subject to the modification that i has a standard normal distribution. The difficulty with this latent-utility representation is that the logistic distribution does not lead to an easy method for sampling from the posterior distribution of β. This is very different from the probit case, where one may exploit the conditional normality of the likelihood in β, given zi . 3

β φi

�i

β

yi

ωi

i = 1, . . . , n

yi i = 1, . . . , n

Figure 1: Directed acyclic graphs depicting two latent-variable constructions for the logistic-regression model: the random-utility model of Holmes and Held (2006) and Fr¨ uhwirth-Schnatter and Fr¨ uhwirth (2007), on the left; versus our direct dataaugmentation scheme, on the right.

To respond to this difficulty, Holmes and Held (2006) introduce a second layer of latent variables for the logit model, writing the errors i in (5) as (i | φi ) ∼ N(0, φi ) φi = (2λ2i ) λi ∼ KS(1) , where λi has a Kolmogorov–Smirnov distribution. Marginalizing over λi recovers the original latent-utility representation of the likelihood for yi . This exploits the fact that the logistic distribution is a scale mixture of normals (Andrews and Mallows, 1974). Posterior sampling proceeds by iterating three steps: sample zi from a truncated normal distribution, given λi , yi , and β; sample β for a multivariate normal distribution, given zi and λi ; and sample λi , given zi and β. This latter update can be accomplished via adaptive rejection sampling, using the alternating-series method of Devroye (1986). Fr¨ uhwirth-Schnatter and Fr¨ uhwirth (2007) take a different approach based upon auxiliary mixture sampling. They approximate p(i ) in (5) using a discrete mixture of normals, rather than a scale mixture: (i | φi ) ∼ N(0, φi ) K X φi ∼ wk δφ(k) , k=1

where δφ indicates a Dirac measure at φ. The weights wk and the points φ(k) in the discrete mixture are fixed for a given choice of K so that the Kullback–Leibler divergence from the true distribution of the random utilities is minimized. Fr¨ uhwirthSchnatter and Fr¨ uhwirth (2007) argue that the choice of K = 10 leads to a good approximation, and list the optimal weights and variances for this choice. This results in a Gibbs-sampling scheme that also requires two levels of latent

4

variables, and that has a very similar structure to that of Holmes and Held (2006): sample zi , given φi , yi , and β; sample β, given zi and φi ; and sample φi , given zi and β, from its discrete conditional distribution. The difficulties of dealing with the Kolmogorov–Smirnov distribution for λi are avoided, at the cost of approximating the true model for i with a thinner-tailed Gaussian mixture. A similar approach to ours is that of Gramacy and Polson (2012), who concentrate on a latent-variable representation of a powered-up version of the logit likelihood. This representation is very useful for obtaining classical penalized-likelihood estimates via simulation. But the difficulty with performing fully Bayesian inference using this representation is that, when applied to the likelihood in (2), it leads to an improper mixing distribution for the latent variable. This requires modifications that make simulation very challenging in the general logit case. In contrast, our Polya-Gamma representation sidesteps this issue entirely, and results in a proper mixing distribution for all common choices of ai , bi in (1).

1.3

Outline

In Section 2, we construct the class of Polya–Gamma variables, and prove our key result (Theorem 1). Then in Section 3 we illustrate the power of this result by showing how it leads to a simple, efficient sampling algorithm for posterior inference in binary logistic regression. We also benchmark our method extensively against those of Holmes and Held (2006) and Fr¨ uhwirth-Schnatter and Fr¨ uhwirth (2007). In Sections 4–6 we illustrate the method in three further settings: contingency tables, multinomial logistic regression, and negative-binomial regression. Then in Section 7, we address the key requirement for Theorem 1 to be practically useful: namely, an efficient method for simulating Polya-Gamma random variables. The identity in (3) suggests a na¨ıve way to approximate this distribution using a large number of independent gamma draws. But in Section 7, we describe a far more efficient and stable way to sample exactly from the Polya-Gamma distribution, which avoids the difficulties that can result from truncating an infinite sum. To do so, we use the alternating-series method from Devroye (1986) to construct a rejection sampler for the Polya-Gamma distribution. As we will prove, this sampler is highly efficient: on average, it requires 1.000803 proposals for every accepted draw, regardless of the parameters of the underlying distribution. The proposal distribution itself is easy to sample from, and involves only exponential and inverse-Gaussian draws, together with a handful of simple auxiliary computations. We conclude in Section 8 with some final remarks about how the method can be generalized to other settings.

5

2

Polya-Gamma random variables

2.1

The case PG(b, 0)

The key step in our approach in the construction of a novel class of Polya-Gamma random variables. Together with the sampling method described in Section 7, this distributional theory greatly simplifies Bayesian inference in models with logistic likelihoods. Following Devroye (2009), a random variable J ∗ has a Jacobi distribution if ∞ ek 2 X J = 2 , π k=1 (k − 1/2)2 ∗ D

(6)

where the ek are independent, standard exponential random variables. The momentgenerating function of this distribution is ∗

E(e−tJ ) =

1 √ . cosh( 2t)

(7)

The density of this distribution is expressible as a multi-scale mixture of inverseGaussians; all moments are finite and expressible in terms of Riemann zeta functions. The Jacobi is related to the Polya distribution (Barndorff-Nielsen et al., 1982), in D that if J ∗ has a Jacobi distribution, and ω = J ∗ /4, then ω ∼ Pol(1/2, 1/2). Let ωk ∼ Pol(1/2, 1/2) for k = 1, . . . , n be a set of independent Polya-distributed D random variables. A PG(n, 0) random variable is then defined by the sum ω? = Pn k=1 ωk . Its moment generating function follows from that of a Jacobi distribution: E{exp(−ωk t)} =

1 p . cosh( t/2)

Therefore, for the Polya-Gamma with parameters (n, 0), we have E{exp(−ω ? t)} =

1 p . cosh ( t/2) n

The name “Polya-Gamma” arises from the following observation. From (6), ! n ∞ X el,k 1 X D ω? = , 2 2 2π (k − 1/2) l=1 k=1

6

(8)

where el,k are independent exponential random variables. Rearranging terms, ω?

∞ Pn 1 X l=1 el,k = 2 2π k=1 (k − 1/2)2 D

D

=

∞ gk 1 X , 2π 2 k=1 (k − 1/2)2

where gk are independent Gamma(n, 1) random variables. More generally we may replace n with any positive real number b.

2.2

The general PG(b, c) class

The general PG(b, c) class arises through an exponential tilting of the PG(b, 0) density. Specifically, we define the density of a PG(b, c) random variable as  2  exp − c2 ω p(ω | b, 0) ,  (9) p(ω | b, c) = 2 Eω exp(− c2 ω) where p(ω | b, 0) is the density of a PG(b, 0) random variable. The expectation in the denominator is taken with respect to the PG(b, 0) distribution, ensuring that p(ω | b, c) integrates to 1. Using the Weierstrass factorization theorem, write the moment-generating function of the PG(b, c) distribution as     coshb 2c 1  √ Eω exp − ωt = 2 2 coshb c2 +t !b c2 ∞ Y 1 + 4(k−1/2) 2 π2 = c2 +t 1 + 4(k−1/2) 2 π2 k=1 2  ∞ Y 1 −1 −b π 2 + c2 . = (1 + dk t) , where dk = 4 k − 2 k=1 Each term in the product is recognizable as the moment-generating function of the gamma distribution. We can therefore write a PG(b, c) random variable as D

ω=2

∞ X Ga(b, 1) k=1

2.3

dk

∞ 1 X Ga(b, 1) = 2 . 1 2 2π k=1 (k − 2 ) + c2 /(4π 2 )

Proof of Theorem 1

The key step in the proof of Theorem 1 is simply the construction of the Polya-Gamma class itself. With this theory in place, the proof proceeds very straightforwardly. 7

Proof. Appealing to the moment-generating function in (8), we may write the likelihood in (4) as (eψ )a 2−b exp{κψ} = (1 + eψ )b coshb (ψ/2) = 2−b eκψ Eω {exp(−ωψ 2 /2} , where the expectation is taken with respect to ω ∼ PG(b, 0), and where κ = a − b/2. Turn now to the conditional distribution 2

e−ωψ /2 p(ω) p(ω | ψ) = R ∞ −ωψ2 /2 , e p(ω) dω 0 where p(ω) is the density of the prior, PG(b, 0). This is of the same form as (9), with ψ = c. Therefore (ω | ψ) ∼ PG(b, ψ).

3 3.1

Binary logistic regression Overview of approach

Several examples will serve to illustrate the utility of Theorem 1, beginning with the case of binary logistic regression. Let yi ∈ {0, 1} be a binary outcome for unit i, with corresponding predictors xi = (xi1 , . . . , xip ). Suppose that the logistic model of (2) holds, and that we model the log odds of success as ψi = xTi β. Appealing to Theorem 1, the contribution of yi to the likelihood in β may be written as {exp(xTi β)}yi 1 + exp(xTi β) Z T ∝ exp(κi xi β)

Li (β) =



exp{−ωi (xTi β)2 /2} p(ωi | 1, 0) ,

0

where κi = yi − 1/2, and where p(ωi | 1, 0) is the density of a Polya-Gamma random variable with parameters (1, 0). Combining all n terms gives the following expression for the conditional likelihood

8

in β, given ω = (ω1 , . . . , ωn ): L(β | ω) =

n Y

Li (β | ωi ) ∝

i=1

n Y

 exp κi xTi β − ωi (xTi β)2 /2

i=1



n Y

exp



i

(xTi β − κi /ωi )2

o

2   1 T ∝ exp − (z − Xβ) Ω(z − Xβ) , 2 i=1

where z = (κ1 /ω1 , . . . , κn /ωn ), and where Ω = diag(ω1 , . . . , ωn ). Given all ωi terms, we therefore have a conditionally Gaussian likelihood in β, with pseudo-response vector z, design matrix X, and covariance matrix Ω−1 . Suppose that β has a conditionally normal prior β ∼ N(b, B). We may then sample from the joint posterior distribution for β by iterating two simple Gibbs steps: (ωi | β) ∼ PG(1, xTi β) (β | y, ω) ∼ N(m, V ) , where V

= (X T ΩX + V −1 )−1

m = V (X T Ωz + V −1 b) , recalling that zi = ωi−1 (yi − 1/2) itself depends upon the augmentation variable ωi , along with the binary outcome yi . If the number of regressors p exceeds the number of observations, then these matrix computations can be done faster by exploiting the Sherman–Morrison–Woodbury identity. Moreover, the outer products xi xTi can be pre-evaluated, which will speed computation of X T ΩX at each step of the Gibbs sampler.

3.2

Example: spam classification

To demonstrate the approach, we fit a logistic-regression model to the data set on spam classification from the online machine-learning repository at the University of California, Irvine. The data set has n = 4601 observations and p = 57 predictors, all of which correspond to the frequencies with which certain words appear in the body of an e-mail. In fitting the model, we omitted the frequencies of the words “George” and “cs.” These are the e-mail recipient’s first name and home department, respectively, and therefore not generalizable to spam filters for other e-mail recipients.

9

4 2 0

Intercept word.freq.make word.freq.address word.freq.all word.freq.3d word.freq.our word.freq.over word.freq.remove word.freq.internet word.freq.order word.freq.mail word.freq.receive word.freq.will word.freq.people word.freq.report word.freq.addresses word.freq.free word.freq.business word.freq.email word.freq.you word.freq.credit word.freq.your word.freq.font word.freq.000 word.freq.money word.freq.hp word.freq.hpl word.freq.650 word.freq.lab word.freq.labs word.freq.telnet word.freq.857 word.freq.data word.freq.415 word.freq.85 word.freq.technology word.freq.1999 word.freq.parts word.freq.pm word.freq.direct word.freq.meeting word.freq.original word.freq.project word.freq.re word.freq.edu word.freq.table word.freq.conference char.freq.; char.freq.( char.freq.[ char.freq.! char.freq.$ char.freq.# capital.run.length.average capital.run.length.longest capital.run.length.total

-4

-2

Coefficient

Figure 2: Posterior distributions for the regression coefficients on the spamclassification data. The black dots are the posterior means; the light- and dark-grey lines indicate 90% and 50% posterior credible intervals, respectively; and the crosses are the corresponding maximum-likelihood estimates.

As a prior for β, we used a logistic distribution, following Gelman et al. (2008): p(β) ∝

{exp(xT0 β)}1/2 1 + exp(xT0 β)

where each entry of the prior design point x0 is the sample mean of the corresponding column of the design matrix X. This encodes the prior belief that the probability of a success is 1/2 at the global average value of the predictors, and can be interpreted as a pseudo-observation of y0 = 1/2 with a prior sample size of 1. This prior is easily accommodated by the Polya-Gamma data-augmentation framework. In principle, any prior which is a scale mixture of normals may be used instead; see, e.g. Polson and Scott (2012). Results of the model fit are shown in Figure 2. For the most part, the Bayesian model yields results very similar to the maximum-likelihood solution. This is unsurprising, given the large amount of data and the weakly informative prior.

3.3

Comparison with existing methods

To demonstrate the merits of our method, we benchmark it against both those of Fr¨ uhwirth-Schnatter and Fr¨ uhwirth (2007) and Holmes and Held (2006)—specifically, their joint updating technique, as it has the largest effective sample size of those they

10

consider. We use the same metrics, datasets, and sample sizes described in Holmes and Held (2006). The efficiency of each sampler is measured in two ways: by the average step size between successive draws of the regression coefficients and by the effective sample size of each component averaged across all components of β. The average step size is N −1 1 X (i) Dist. = kβ − β (i+1) k N − 1 i=1 and the average effective sample size is P k X 1 X ESS = ESSi where ESSi = M/(1 + 2 ρi (j)), P i=1 j=1

M is the number of post-burn-in samples, and ρi (j) is the jth autocorrelation of β i as estimated by the initial montone sequence estimator of Geyer (1992). The four datasets considered are the Pima Indians diabetes dataset used in Ripley (1996), and the Australian Credit, German Credit, and Heart datasets found in the STATLOG project (Michie et al., 1994). For each dataset and each method we ran 10 simulations of 10,000 samples each, discarding the first 1,000 samples of each simulation as burn-in. In each case we assume a diffuse normal prior, β ∼ N (0, 100I). Table 3.3 presents the results of these comparisons. For the examples considered here, the Polya-Gamma method is at least 2.8 times as efficient as that of Holmes and Held, and at least 35 times as efficient as that of Fr¨ uhwirth-Schnatter and Fr¨ uhwirth. These draw-by-draw comparisons of efficiency are useful, but do not take into consideration the relative speed of individual draws from each method. We compare the running times of the Polya-Gamma (PG) method versus Holmes and Held (HH) to measure the relative effective sample sizes per unit time. These results were produced using R. In general, R is a poor language in which to benchmark execution time, as the number of for loops, while loops, if statements, function calls, and vectorized operations can have a large impact how long a script runs. Nonetheless, the comparison is still appropriate in this case, as the PG and HH algorithms have a similar structure. Furthermore, in both cases we use an external routine for the most time-consuming steps: drawing from the Polya-Gamma distribution and drawing from p(λ | z, β) in the case of the HH method. As seen in Table 3.3, the gains in efficiency are even greater for the PG method when accounting for per-draw execution time. The PG method finished roughly twice as fast on average, producing at least a five-fold improvement in effective sample size per unit time. While the direct comparison was performed in R, the BayesLogit package has implemented the Polya-Gamma procedure for binary logistic regression in C++. The approximate running times in BayesLogit are also listed in Table 3.3, and are much faster. These results can be explained by the fact that our approach requires only a single layer of latent variables. It also maintains conditional conjugacy, and does not 11

Dataset

PG HH n p p ESS Dist. ESS Dist. Diabetes 768 8 9 4797 0.70 1695 0.42 Heart 270 13 19 3184 4.86 891 2.70 Australia 690 14 35 3173 10.93 724 5.37 Germany 1000 20 49 4981 3.95 1723 2.40 ∗

FSF ESS Ratio ESS Dist. HH FSF 114 0.11 2.8 42 85 0.78 3.6 37 84 1.64 4.4 38 115 0.68 2.9 43

Table 1: The effective sample size and average distance between samples of β for the four data sets considered. The ESS ratio is the ratio of the effective sample sizes of Polya-Gamma technique to those of Holmes and Held (2006) and Fr¨ uhwirth-Schnatter and Fr¨ uhwirth (2007), respectively. For each data set there are n observations, p predictors, and p∗ components of β including interactions. The Australian credit dataset possess perfectly separable data that is captured by a diverging entry in β. This infects the measure of distance between β’s resulting from each method, and is affected strongly by the choice of prior for β. Consequently, one should view the distance between β’s for the Australian credit dataset as a measure of how quickly each model captures this pathological coefficient.

Dataset Diabetes Heart Australia Germany

BayesLogit PG (R) Time Time ESS/sec. 5.3 167 29 2.6 61 53 9.6 158 20 18.0 234 21

HH (R) Time ESS/sec. ESS/sec. Ratio 334 5.1 5.6 100 9.0 5.9 334 2.1 9.3 553 3.1 6.9

Table 2: Comparisons of actual run times in seconds for the Polya-Gamma and Holmes– Held methods implemented in R. All computations were performed on a Linux machine with an Intel Xeon 3.30 GHz CPU and 4GB of RAM. The average running time (sec.) was calculated using the proc.time command in R. Both the KS mixture distribution and the Polya–Gamma distribution were sampled from R by calls to an external compiled C routine, making these times comparable.The BayesLogit column is the average time it took to run 10,000 samples using the BayesLogit package, which calls a compiled C++ routine to perform posterior sampling.

require an analytic approximation to p(i ). In fact, we dispense with the latent-utility approach altogether in favor of a direct mixture representation of the logit likelihood, which also avoids the step of sampling a truncated normal variable. Moreover, the latent precision ωi in our construction is faster to sample than the latent variance φi in the Kolmogorov–Smirnov mixture construction, leading to the further gains in speed shown in Table 3.3. Another important gain in efficiency arises when the data contain repeated observations. This can happen in logistic regression when there are repeated design points, but is most relevant for the analysis of contingency tables with fixed margins. Suppose that we cross-tabulate n observations into k distinct nonzero cells. If we parametrize the cell probabilities by their log-odds and fit a logistic-normal model, the latent-utility approach will require 2n latent variables. By contrast, our approach 12

will require only k latent variables; the cell counts affect the distribution of these latent variables via the a and b terms in (4), but not their number. Our latent-variable construction will therefore scale to larger data sets much more efficiently.

4

Contingency tables with fixed margins

4.1

2 × 2 × N tables

Next, consider a simple example of a binary-response clinical trial conducted in each of N different centers. Let nij be the number of patients assigned to treatment regime j in center i; and let Y = {yij } be the corresponding number of successes for i = 1, . . . , N . Table 1 presents a data set along these lines, from Skene and Wakefield (1990). These data arise from a multi-center trial comparing the efficacy of two different topical cream preparations, labeled the treatment and the control. Let pij denote the underlying success probability in center i for treatment j, and ψij the corresponding log-odds. If the ψi = (ψi1 , ψi2 )T is assigned a bivariate normal prior ψi ∼ N(µ, Σ) then the posterior for Ψ = {ψij } is p(Ψ | Y ) ∝

N  Y i=1

 eyi2 ψi2 eyi1 ψi1 p(ψi1 , ψi2 | µ, Σ) . (1 + eψi1 )ni1 (1 + eψi2 )ni2

We apply Theorem 1 to each term in the posterior, thereby introducing augmentation variables Ωi = diag(ωi1 , ωi2 ) for each center. This yields, after some algebra, a simple Gibbs sampler that iterates between two sets of conditional distributions: (ψi | Y, Ωi , µ, Σ) ∼ N(mi , VΩi )

(10)

(ωij | ψij ) ∼ PG (nij , ψij ) , where VΩ−1 = Ωi + Σ−1 i mi = VΩi (κi + Σ−1 µ) κi = (yi1 − ni1 /2, yi2 − ni2 /2)T . Figure 3 shows the results of applying this Gibbs sampler to the data from Skene and Wakefield (1990). Notice that our method requires 16 latent variables for a data set with 273 observations. By comparison, a Gibbs sampler based on the randomutility construction in (5) would require 546 latent variables: a (zi , φi ) pair for every individual in the study. In this analysis, we used a normal-Wishart prior for (µ, Σ−1 ). Hyperparameters were chosen to match Table II from Skene and Wakefield (1990), who parameterize

13

Table 3: Data from a multi-center, binary-response study on topical cream effectiveness (Skene and Wakefield, 1990).

Center

Treatment Success Total

1 2 3 4 5 6 7 8

11 16 14 2 6 1 1 4

Control Success Total

36 20 19 16 17 11 5 6

10 22 7 1 0 0 1 6

37 32 19 17 12 10 9 7

2

Log-odds ratios in an 8-center binary-response study

X X

0

X

X

X

XX

X X

-2

X

X

X

-4

X

Treatment (95% CI) Control (95% CI)

-6

Log-Odds Ratio of Success

X

1

2

3

4

5

6

7

8

Treatment Center

Figure 3: Posterior distributions for the log-odds ratio for each of the 8 centers in the topical-cream study from Skene and Wakefield (1990). The vertical lines are central 95% posterior credible intervals; the dots are the posterior means; and the X’s are the maximum-likelihood estimates of the log-odds ratios, with no shrinkage among the treatment centers. Note that the maximum-likelihood estimate is ψi2 = −∞ for the control group in centers 5 and 6, as no successes were observed.

14

the model in terms of the prior expected values for ρ, σψ2 1 , and σψ2 2 , where  Σ=

σψ2 1 ρ ρ σψ2 2

 .

To match their choices, we use the following identity that codifies a relationship between the hyperparameters B and d, and the prior moments for marginal variances and the correlation coefficient. If Σ ∼ IW(d, B), then q   q E(σψ2 2 ) + E(σψ2 1 ) + 2 E(ρ) E(σψ2 2 ) E(σψ2 1 ) E(σψ2 2 ) + E(ρ) E(σψ2 2 ) E(σψ2 1 ) . q B = (d − 3)  2 2 2 2 E(σψ2 ) + E(ρ) E(σψ2 ) E(σψ1 ) E(σψ2 ) In this way we are able to map from pre-specified moments to hyperparameters, ending up with d = 4 and   0.754 0.857 B= . 0.857 1.480

4.2

Higher-order tables

Now consider a multi-center, multinomial response study with more than two treatment arms. This can be modeled using hierarchy of N different two-way tables, each having the same J treatment regimes and K possible outcomes. The data D consist of triply indexed outcomes yijk , each indicating the number of observations in center i P and treatment j with outcome k. We let nij = k yij indicate the number of subjects assigned to have treatment j at center k. Let P = {pijk } denote the set of probabilities that a subject in center i with P treatment j experiences outcome k, such that k pijk = 1 for all i, j. Given these probabilities, the full likelihood is L(P ) =

N Y J Y K Y

y

ijk pijk .

i=1 j=1 k=1

Following Leonard (1975), we can model these probabilities using a logistic transformation. Let exp(ψijk ) pijk = PK . l=1 exp(ψijl ) Many common prior structures will maintain conditional conjugacy using the PolyaGamma framework outlined thus far. For example, we may assume an exchangeable matrix-normal prior at the level of treatment centers: ψi ∼ N(M, ΣR , ΣC ) ,

15

where ψi is the matrix whose (j, k) entry is ψijk ; M is the mean matrix; and ΣR and ΣC are row- and column-specific covariance matrices, respectively. See Dawid (1981) for further details on matrix-normal theory. Note that, for identifiability, we set ψijK = 0, implying that ΣC is of dimension K − 1. This leads to a posterior of the form " !yijk # N J Y K Y Y exp(ψijk ) p(Ψ | D) = · p(ψi ) · , PK l=1 exp(ψijl ) i=1 j=1 k=1 suppressing any dependence on (M, ΣR , ΣC ) for notational ease. To show that this fits within the Polya-Gamma framework, we use a similar approach to Holmes and Held (2006), rewriting each probability as exp(ψijk ) l6=k exp(ψijl ) + exp(ψijk )

pijk = P

eψijk −cijk = , 1 + eψijk −cijk P where cijk = log{ l6=k exp(ψijl )} is implicitly a function of the other ψijl ’s for l 6= k. We now fix values of i and k and examine the conditional posterior distribution for ψi·k = (ψi1k , . . . , ψiJk )0 , given ψi·l for l 6= k: p(ψi·k | D, ψi·(−k) ) ∝ p(ψi·k

yijk  nij −yijk J  Y eψijk −cijk 1 | ψi·(−k) ) · 1 + eψijk −cijk 1 + eψijk −cijk j=1

= p(ψi·k | ψi·(−k) ) ·

J Y

eyijk (ψijk −cijk ) (1 + eψijk −cijk )nij j=1

This is simply a multivariate version of the same bivariate form in that arises in a 2 × 2 table. Appealing to the theory of Polya-Gamma random variables outlined above, we may express this as: p(ψi·k | D, ψi·(−k) ) ∝ p(ψi·k | ψi·(−k) ) · = p(ψi·k | ψi·(−k) ) ·

J Y

eκijk [ψijk −cijk ] cosh ([ψijk − cijk ]/2) j=1 nij

J h Y

n oi 2 eκijk [ψijk −cijk ] · E e−ωijk [ψijk −cijk ] /2 ,

j=1

where ωijk ∼ PG(nij , 0), j = 1, . . . , J; and κijk = yijk − nij /2. Given {ωijk } for j = 1, . . . , J, all of these terms will combine in a single normal kernel, whose mean and covariance structure will depend heavily upon the particular choices of hyperpa-

16

rameters in the matrix-normal prior for ψi . Each ωijk term can be updated as (ωijk | ψijk ) ∼ PG(nij , ψijk − cijk ) , leading to a simple MCMC that loops over centers and responses, drawing each vector of parameters ψi·k (that is, for all treatments at once) conditional on the other ψi·(−k) ’s.

5

Multinomial logistic regression

One may extend the Polya-Gamma method used for binary logistic regression to multinomial logistic regression. Consider the multinomial sample yi = {yij }Jj=1 that records the number of responses in each category j = 1, . . . , J and the total number of responses ni . The logistic link function for polychotomous regression stipulates that the probability of randomly drawing a single response from the jth category in the ith sample is exp ψij pij = PJ i=1 exp ψik where the log odds ψij is modeled by xTi βj and βJ has been constrained to zero for purposes of identification. Following Holmes and Held (2006) the likelihood for βj conditional upon β−j , the matrix with column vector βj removed, is yij  ni −yij N  Y eηij 1 `(βj |β−j , y) = 1 + eηij 1 + eηij i=1 where ηij = xTi βj − Cij with Cij = log

X

exp xTi βk ,

k6=j

which looks like the binary logistic likelihood previously discussed. Incorporating the Polya-Gamma auxiliary variable, the likelihood becomes N Y

eκij ηij e−

2 ηij 2

ωij P G(ωij |ni , 0)

i=1

where κij = (yij −ni /2). Employing the conditionally conjugate prior βj ∼ N (m0j , V0j ) yields a two-part update: (βj | Ωj ) ∼ N (mj , Vj ) (ωij | βj ) ∼ P G(ni , ηij ) for i = 1, · · · , N,

17

where Vj−1 = X 0 Ωj X + V0j−1 , mj = Vj X 0 (κj − Ωj Cj ) + V0j−1 m0j



and Ωj = diag({ωij }N i=1 ). One may sample the posterior of (β | y) via Gibbs sampling by repeatedly iterating over the above steps for j = 1, . . . , J − 1. The Polya-Gamma method generates samples from the joint posterior distribution without appealing to analytic approximations to the posterior. This offers an important advantage when the number of observations is not significantly larger than the number of parameters. To see this, consider sampling the joint posterior for β using a Metropolis-Hastings algorithm with an independence proposal. The likelihood in β is approximately normal, centered at the posterior mode m, and with variance V equal to the inverse of the Hessian matrix evaluated at the mode. (Both of these may be found using standard numerical routines.) Thus a natural proposal for (vec(β (t) ) | y) is vec(b) ∼ N (m, aV ) for some a ≈ 1. When data are plentiful, this method is both simple and highly efficient, and is implemented in many standard software packages (e.g. Martin et al., 2011). But when vec(β) is high-dimensional relative to the number of observations the Hessian matrix H may be ill-conditioned, making it impossible or impractical to generate normal proposals. Multinomial logistic regression succumbs to this problem more quickly than binary logistic regression, as the number of parameters scales like the product of the number of categories and the number of predictors. To illustrate this phenomenon, we consider glass-identification data from German (1987). This data set has J = 6 categories of glass and nine predictors describing the chemical and optical properties of the glass that one may measure in a forensics lab and use in a criminal investigation. This generates up to 50 = 10 × 5 parameters, including the intercepts and the constraint that βJ = 0. These must be estimated using n = 214 observations. In this case, the Hessian H at the posterior mode is poorly conditioned when employing a vague prior, incapacitating the independent Metropolis-Hastings algorithm. Numerical experiments confirm that even when a vague prior is strong enough to produce a numerically invertible Hessian, rejection rates are prohibitively high. In contrast, the multinomial Polya-Gamma method still produces reasonable posterior distributions in a fully automatic fashion, even with a weakly informative normal prior for each βj . Table 4, which shows the in-sample performance of the multinomial logit model, demonstrates the problem with the joint proposal distribution: category 6 is perfectly separable into cases and non-cases, even though the other categories are not. This is a well-known problem with maximumlikelihood estimation of logistic models. The same problem also forecloses the option of posterior sampling using methods that require a unique MLE to exist.

18

Class 1 2 3 Total 70 76 17 Correct 50 55 0

5 6 7 13 9 29 9 9 27

Table 4: “Correct” refers to the number of glass fragments for each category that were correctly identified by the Bayesian multinomial logit model. The glass identification dataset includes a type of glass, class 4, for which there are no observations.

6 6.1

Negative-binomial models for count data With regressors

Suppose that we have a Poisson model with gamma overdispersion: (yi | λi ) ∼ Pois(λi )   pi . (λi | h, pi ) ∼ Ga h, 1 − pi If we marginalize over λi , we get a negative binomial marginal distribution for yi : p(yi | h, pi ) ∝ (1 − pi )h pyi i , ignoring constants of proportionality. Now use a logit transform to represent pi as pi =

eψi . 1 + eψi

Since eψi = pi /(1 − pi ), the ψi term is analogous to a re-scaled version of the typical linear predictor in a Poisson generalized-linear model using the canonical log link. Appealing to Theorem 1, we can rewrite the negative binomial likelihood in terms of ψi as {exp(ψi )}yi {1 + exp(ψi )}h+yi Z ∞ 2 κi ψi ∝ e e−ωi ψ /2 p(ω | h + yi , 0) dω ,

(1 − pi )h pyi i =

0

where κi = (yi − h)/2, and where the mixing distribution is Polya-Gamma. Conditional upon ωi , we have a likelihood proportional to e−Q(ψi ) for some quadratic form Q. Therefore this will be conditionally conjugate to any Gaussian prior, or any prior can be made conditionally Gaussian. In case the case regressors are present, we have that ψi = xTi β for some p-vector xi . Then, conditional upon ωi , the contribution of the ith observation to the likelihood

19

is Li (β) ∝ exp{κi xTi β − ωi (xTi β)2 /2} (  2 ) ωi yi − φ ∝ exp − xTi β 2 2ωi Let Ω = diag(ω1 , . . . , ωn ); let zi = (yi − φ)/(2ωi ); and let z denote the stacked vector of zi terms. Then putting all the terms in the likelihood together, this is equivalent to the sampling model (z | β, Ω) ∼ N (Xβ, Ω−1 ) , or a simple Gaussian regression model with (known) covariance matrix Ω−1 describing the error structure. Suppose that we assume a conditionally Gaussian prior, β ∼ N (b, B). Then the Gibbs sampling proceeds in two simple steps: (ωi | φ, β) ∼ PG(h + yi , xTi β) (β | Ω, z) ∼ N (m, V ) , where V

= (X T ΩX + B −1 )−1

m = V (X T Ωz + B −1 b) , recalling the definition of z above. A related paper is that of Zhou et al. (2012), who use our Polya-Gamma construction to arrive at a similar MCMC method for log-normal mixtures of negative binomial models.

6.2

Example: an AR(1) model for the incidence of flu

To illustrate the further elaborations that are possible to the basic framework, we fit a negative-binomial AR(1) model to four years (2008–11) of weekly data on flu incidence in Texas. This data was collected from the Texas Department of State Health Services. We emphasize that there are many ways to handle the overdispersion present in this data set, and that we do not intend our model to be taken as a definitive analysis. We merely intend it as a “proof of concept” example showing how various aspects of Bayesian time-series modeling—in this case, a simple AR(1) model—can now be incorporated seamlessly into models with non-Gaussian likelihoods, such as those that arise in the analysis of binary, multinomial, and count data. Let yt denote the number of reported cases of influenza-like illness (ILI) in week t. We assume that these counts follow a negative-binomial model, or equivalently a

20

200 Week

100 0

1000

2000

3000

4000

5000

0

50

Weeks since Start of 2008

150

Maximum Likelihood Poisson Regression With Lag Term

200 150 1000

2000

3000

4000

5000

0

50

Estimated rate 95% Predictive Interval

100

Bayesian Negative Binomial AR(1) model

0

Cases

Cases

Figure 4: Incidence of influenza-like illness in Texas, 2008–11, together with the estimated mean λt from the negative-binomial AR(1) model (left) and Poisson regression model incorporating a lagged value of the response variable as a predictor (right). The blanks in weeks 21-41 correspond to missing data. In each frame the grey lines depict the upper and lower bounds of a 95% predictive interval.

21

Gamma-overdispersed Poisson model: yt ∼ NB(h, pt ) eψt pt = 1 + eψt ψt = α + γt γt = φγt−1 + t t ∼ N (0, σ 2 ) , assuming that t indexes time. It is trivial to sample from this model by combining the results of the previous section with standard results on AR(1) models. In the following analysis, we assumed an improper uniform prior on the dispersion parameter h, and fixed φ and σ 2 to 0.98 and 1, respectively. But it would be equally straightforward to place hyper-priors upon each parameter, and to sample them in a hierarchical fashion. It would also be straightforward to incorporate fixed effects in the form of regressors. Figure 4 shows the results of the fit, along with a lag-1 Poisson regression model incorporating xt = log(yt−1 ) as a regressor for the rate parameter at time t. In each frame the grey lines depict the upper and lower bounds of a 95% predictive interval. In the case of the Poisson model, this predictive interval is of negligible width, and fails entirely to capture the right degree of dispersion. (The grey lines sit almost directly over the black line; the predictive interval has a half-width essentially equal to twice the square root of the estimated conditional rate parameter.)

7 7.1

Simulating Polya-Gamma random variables Overview of approach

All our developments thus far require an efficient method for sampling Polya-Gamma random variables. In this section, we derive such a method; it is implemented in a separate sampling routine in the R package BayesLogit. First, observe that one may sample Polya-Gamma random variables n¨aively (and approximately) using the sum-of-gammas representation in Equation (3). But this is slow, and involves the potentially dangerous step of truncating an infinite sum. We therefore construct an alternate, exact method by extending the approach taken by Devroye (2009) for simulating distributions related to the Jacobi theta function, which henceforth we will simply call Jacobi distributions. Of particular interest is J ∗ , a Jacobi random variable having the moment-generating function given by (7). The P G(1, z) distribution is related to an exponentially tilted Jacobi distribution ∗ J (1, z), defined by 2 f (x | z) = cosh(z) e−xz /2 f (x) (11)

22

where f (x) denotes the density function for J ∗ , through the rescaling 1 P G(1, z) = J ∗ (1, z/2). 4

(12)

We have appealed to (7) to compute the normalizing constant for f (x|z). Since P G(b, z) is the sum of b independent P G(1, z) random variables, and since for our purposes b is generally small, it will suffice to generate draws from P G(1, z), or equivalently J(1, z/2). (Though it should be noted that the algorithm we propose is fast enough that it remains easy to simulate from P G(b, z) even for relatively large b.) Thus the task at hand is to quickly simulate from the J ∗ (1, z/2) distribution. One easily sums and scales b such draws to generate a P G(b, z) random variate.

7.2

The J ∗ (1, z) Sampler

Devroye (2009) develops an efficient J ∗ sampler. Following this work, we develop an efficient exponentially tilted J ∗ random variate J ∗ (1, z). In both cases, the density of interest can be written as an infinite, alternating sum and is amenable to von Neumann’s alternating sum method, which says that when the density of interest f sits between the partial sums of the series one may draw from a proposal g and then accept or reject that proposal using the partial sums to sample f (Devroye (2009)). In particular, suppose we want to sample from the density f using the proposal density g where kf /gk∞ ≤ c. Employing the accept-reject algorithm, one proposes X ∼ g and accepts this proposal if Y ≤ f (X) where Y ∼ U(0, cg(X)). If f is an infinite series and its partial sums satisfy S0 (x) ≥ S2 (x) ≥ · · · ≥ f (x) ≥ · · · ≥ S3 (x) ≥ S1 (x).

(13)

then Y ≤ f (X) is equivalent to Y ≤ Sn (X) for some odd n and Y > f (X) is equivalent to Y > Sn (X) for some even n. Thus to sample f using accept-reject one one may propose X ∼ g, draw Y ∼ U(0, cg(X)), and accept the proposal X if Y ≤ Sn (x) for some even n and rejecting the proposal if Y > Sn (x) for some odd n. These partial sums may be calculated and checked on the fly. The Jacobi density has two infinite sum representations that when spliced together P n yield f (x) = ∞ i=0 (−1) an (x) with   3/2   2  2 2(n + 1/2)   exp − 0 < x ≤ t, (14)  π(n + 1/2) πx x an (x) =    (n + 1/2)2 π 2   x x > t, (15)  π(n + 1/2) exp − 2 which satisfies the partial sum criterion (13) for x > 0 as shown by Devroye. It is necessary to use both representations to satisfy the partial sum criterion for all x. P n The J ∗ (1, z) density can be written as an alternating sum f (x|z) = ∞ i=0 (−1) an (x|z) 23

that satisfies the partial sum criterion by setting  2  z x an (x) an (x|z) = cosh(z) exp − 2 in the manner of (11). The first term in the series provides a natural proposal, as a0 (x|z) ≥ f (x|z), suggesting that     2 3/2  2 1 z x   − , 0 < x ≤ t, exp −  πx 2 2x (16) c(z) g(x|z) = cosh(z)   2   2  z π   + x , x > t. exp − 2 8 Examining the piecewise kernels one finds that g(x|z) can be written as the mixture ( IG(|z|−1 , 1)1(0,t] with prob. p/(p + q) g(x|z) = Ex(−z 2 /2 + π 2 /8)1(t,∞) with prob. q/(p + q) Rt R∞ where p = 0 c g(x|z)dx and q = t c g(x|z)dx. With this proposal in hand sampling a J ∗ (1, z) can be summarized as follows. 1. Generate a proposal X ∼ g(x|z). 2. Generate Y ∼ U(0, cg(X|z)). 3. Iteratively calculate Sn (X) until Y ≤ Sn (X) for an odd n or until Y > Sn (X) for an even n 4. Accept X if n is odd; return to step 1 if n is even. The details of the implementation, along with pseudocode, are described in the Appendix.

7.3

Analysis of acceptance rate

The J ∗ (1, z) sampler is highly efficient. The parameter c found in (16), which depends on both the tilting parameter z and the truncation point t, describes on average how many proposals we expect to make before accepting. Devroye shows that in the case of z = 0, one can pick t so that c is near unity. The following extends this result to non-zero tilting parameters so that, on average, the J ∗ (1, z) sampler rejects no more than 8 out of every 10,000 draws, regardless of z. We proceed in two steps to show that one may chose t so that c(z, t) is near unity for all z ≥ 0. To begin, we show that there is some t∗ so that c(z, t∗ ) ≤ c(z, t) for all z ≥ 0 and for all t for which the alternating sum criterion holds. Let aLn denote the left coefficient,

24

an 1(0,t] , and aR n denote the right coefficient, an 1(t,∞) , from (14) and (15) respectively. Take z to be fixed. We are interested in picking t to minimize c(t) = p(t) + q(t) where  2  Z t z x cosh(z) exp − aL0 (x)dx p= 2 0 and

Z q= t



 2  z x cosh(z) exp − aR 0 (x)dx. 2

The truncation point t must be in the interval (log 3)/π 2 ≤ t ≤ 4/ log 3 for the alternating sum criterion to hold. Therefore c will attain a minima (for fixed z) at some t∗ in that interval, as c is continuous in t. Differentiating with respect to t, we find that any critical point t∗ will satisfy  2 ∗  z t  L ∗ ∗ cosh(z) exp − a0 (t ) − aR 0 (t ) = 0 , 2 or will be at the boundary of the interval. Thus any minima on the interior of this interval will be independent of z. Devroye suggests that the best choice of t is indeed on the interior of the aforementioned interval and is t∗ = 0.64. Having found that the best t∗ is independent of z, we now show that the maximum of c(z, t∗ ) is achieved over z > 0. Indeed, differentiating under the integral we find 2 that c0 (z) = tanh(z) − z2 c(z). Hence a critical point z ∗ must solve tanh(z) −

z2 = 0, 2

as c(z) > 0. Solving for z and checking the second order conditions one can easily show that the maximum is achieved at z ∗ = 1.378293, for which c(z ∗ , t∗ ) = 1.0008. In other words, one expects to reject at most 8 out of every 10,000 draws for any z > 0.

8

Discussion

We have shown that Bayesian inference for logistic models can be implemented using a data augmentation scheme based on Polya-Gamma distributions. This leads to simple Gibbs-sampling algorithms for posterior computation that exploit standard normal linear-model theory. It also opens the door for exact Bayesian treatments of many modern-day machinelearning classification methods based on mixtures of logits. Indeed, many likelihood functions long thought to be intractable resemble the sum-of-exponentials form in the multinomial logit model; two prominent examples are restricted Boltzmann machines (Salakhutdinov et al., 2007) and logistic-normal topic models (Blei and Lafferty, 2007). Applying the Polya-Gamma mixture framework to such problems is

25

currently an active area of research. A further useful fact is that the expected value of a Polya-Gamma random variable is available in closed form. If ω ∼ PG(b, c), then E(ω) =

b tanh(c/2) . 2c

We arrive at this result by appealing to the moment-generating function for the PG(b, 0) density, evaluated at c2 /2: c  1 2 cosh−b = E e− 2 ωc 2 Z ∞ 1 2 e− 2 ωc p(ω | b, 0)dω . = 0

Taking logs and differentiating under the integral sign with respect to c then gives the moment identity c 1 ∂ log coshb . E (ω) = c ∂c 2 Simple algebra reduces this down to the form above. This allows the same data-augmentation scheme to be used in EM algorithms, where the latent ω’s will form a set of complete-data sufficient statistics for the logistic likelihood. We are actively studying how this fact can be exploited in large-scale problems where Gibbs sampling becomes intractable. Some early results along these lines are described in Polson and Scott (2011). A number of technical details of our latent-variable representation are worth further comment. First, the dimensionality of the set of latent ωi ’s does not depend on the sample size ni corresponding to each unique design point. Rather, the sample size only affects the distribution of these latent variables. Therefore, our MCMC algorithms are more parsimonious than traditional approaches that require one latent variable for each observation. This is a major source of efficiency in the analysis of contingency tables. Second, posterior updating via exponential tilting is a quite general situation that arises in Bayesian inference incorporating latent variables. For example, the posterior distribution of ω that arises under normal data with precision ω and a PG(b, 0) prior is precisely an exponentially titled PG(b, 0) random variable. This led to our characterization of the general PG(b, c) class. Notice, moreover, that one may identify the conditional posterior for ωij strictly using its moment-generating function, without ever appealing to Bayes’ rule for density functions. This follows the L´evy-penalty framework of Polson and Scott (2012) and relates to work by Ciesielski and Taylor (1962), who use a similar argument to characterize sojourn times of Brownian motion. Doubtless there are many other modeling situations where the basic idea is also applicable, or will lead to new insights.

26

A

Details of sampling algorithm

Algorithm 1 shows how to simulate a PG(1, z) random variate. Recall that for the PG(b, z) case, where b is an integer, we add up b independent copies of PG(1, z). We expand upon the simulation of truncated inverse-Gaussian random variables, denoted IN (µ, λ). We break the IN draw up into two separate scenarios. When µ = 1/Z is large the inverse Gaussian distribution is approximately inverse χ21 , motivating an accept-reject algorithm. When µ is small, a simple rejection algorithm suffices, as the truncated inverse Gaussian will have most of its mass below the truncation point t. Thus when µ > t we generate a truncated inverse-Gaussian random variate using accept-reject sampling. The proposal is Y ∼ 1/χ21 1(t,∞) . Following Devroye (2009), this variate may be generated by squaring a truncated standard normal draw z, letter x ∼ z1[1/√t,∞) , and then taking the reciprocal of x. Devroye (1986) (p. 382) suggests sampling the truncated normal by generating independent standard-exponential pairs √ 0 2 0 (E, E ) until E ≤ 2E /t and returning (1 + tE)/ t. The ratio of the proposal kernel to the inverse-χ21 kernel is   z2  −z 2  x−3/2 exp −1 − x 2x 2   = exp x , 2 x−3/2 exp −1 2x

whose supremum is unity. This implies an acceptance probability of exp

n −z 2 o x . 2

Since Z < 1/t and X < t we may compute a lower bound on the average rate of acceptance using   −Z 2  −1 E exp X = 0.61 . ≥ exp 2 2t See algorithm (2) for pseudocode. When µ < t, we generate a truncated inverse-Gaussian random variate using rejection sampling. Devroye (1986) (p. 149) describes how to sample from an inverseGaussian distribution using a many-to-one transformation. Sampling X in this fashion until X < t yields an acceptance rate bounded below by Z t Z t IG(x|µ = 1/Z, λ = 1)dx ≥ IG(x|µ = t, λ = 1) = 0.67 0

0

for all µ < t. See algorithm (3) for pseudocode. A final note applies to the negative-binomial case where the term parameter of the Polya-Gamma (b) may be a real number. Here we exploit the additivity of the Polya-Gamma class, writing b = b + e where b is the integer part of b, and where e is the decimal remainder. We then sample from PG(b, z) exactly, and from PG(e, z) 27

Algorithm 1 Sampling from P G(1, z) Input: z, a positive real number Define: pigauss(t | µ, λ), the CDF of the inverse Gaussian distribution Define: an (x), the piecewise-defined coefficients in Equations (14) and (15) z ← |z|/2, t ← 0.64, K ← π 2 /8 + z 2 /2 π exp(−Kt) p ← 2K q ← 2 exp(−|z|) pigauss(t | µ = 1/z, λ = 1.0) repeat Generate U, V ∼ U(0, 1) if U < p/(p + q) then (Truncated Exponential) X ← t + E/K where E ∼ E(1) else (Truncated Inverse Gaussian) µ ← 1/z if µ > t then repeat Generate 1/X ∼ χ21 1(t,∞) 2 until U(0, 1) < exp(− z2 X) else repeat Generate X ∼ IN (µ, 1.0) until X < t end if end if S ← a0 (X), Y ← V S, n ← 0 repeat n←n+1 if n is odd then S ← S − an (X); if Y < S, then return X / 4 else S ← S + an (X); if Y > S, then break end if until FALSE until FALSE

28

Algorithm 2 Algorithm used to generate IG(µ = 1/Z, λ = 1)1(0,t) when µ > t. Truncation point: t. Z = 1/µ. repeat repeat Generate E, E 0 ∼ E(1). until E 2 ≤ 2E 0 /t X ← t/(1 + tE)2 Z 2 X) α ← exp( −1 2 U ∼U until U ≤ α Algorithm 3 Algorithm used to generate IG(µ = 1/Z, λ = 1)1(0,t) when µ ≤ t. repeat Y ∼ N (0, 1)2 . p X ← µ + 0.5µ2 Y − 0.5µ 4µY + (µY )2 U ∼U If (U > µ/(µ + X)), then X ← µ2 /X. until X ≤ R. using the finite sum-of-gammas approximation. With 200 terms in the sum, we find that the approximation is quite accurate for such small values of the first parameter, as each Ga(e, 1) term in the sum tends to be small, and the weights in the sum decay like 1/k 2 . (This approximation is never necessary in the case of the logit models considered.)

References J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–79, 1993. D. Andrews and C. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B, 36:99–102, 1974. O. E. Barndorff-Nielsen, J. Kent, and M. Sorensen. Normal variance-mean mixtures and z distributions. International Statistical Review, 50:145–59, 1982. D. M. Blei and J. Lafferty. A correlated topic model of Science. The Annals of Applied Statistics, 1 (1):17–35, 2007. J. Carlin. Meta-analysis for 2 × 2 tables: a Bayesian approach. Statistics in Medicine, 11(2):141–58, 1992. Z. Ciesielski and S. J. Taylor. First passage times and sojourn times for Brownian motion in space and the exact Hausdorff measure of the sample path. Transactions of the American Mathematical Society, 103(3):434–50, 1962.

29

A. P. Dawid. Some matrix-variate distribution theory: Notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. L. Devroye. Non-uniform random variate generation. Springer, 1986. L. Devroye. On exact simulation algorithms for some distributions related to Jacobi theta functions. Statistics & Probability Letters, 79(21):2251–9, 2009. A. Dobra, C. Tebaldi, and M. West. Data augmentation in multi-way contingency tables with fixed marginal totals. Journal of Statistical Planning and Inference, 136(2):355–72, 2006. S. Fr¨ uhwirth-Schnatter and R. Fr¨ uhwirth. Auxiliary mixture sampling with applications to logistic models. Computational Statistics and Data Analysis, 51:3509–28, 2007. A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2nd edition, 2004. A. Gelman, A. Jakulin, M. Pittau, and Y. Su. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4):1360–83, 2008. B. German. Glass identification dataset, 1987. URL http://archive.ics.uci.edu/ml/datasets/ Glass+Identification. C. Geyer. Practical markov chain monte carlo. Statistical Science, 7:473–511, 1992. R. B. Gramacy and N. G. Polson. Simulation-based regularized logistic regression. Bayesian Analysis, 2012. C. Holmes and L. Held. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1):145–68, 2006. T. Leonard. Bayesian estimation methods for two-way contingency tables. Journal of the Royal Statistical Society (Series B), 37(1):23–37, 1975. A. D. Martin, K. M. Quinn, and J. H. Park. MCMCpack: Markov chain Monte Carlo in r. Journal of Statistical Software, 42(9):1–21, 2011. D. Michie, D. Spiegelhalter, , and C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood Limited, 1994. N. G. Polson and J. G. Scott. Data augmentation for non-Gaussian regression models using variance-mean mixtures. Technical report, University of Texas at Austin, http://arxiv.org/abs/1103.5407v3, 2011. N. G. Polson and J. G. Scott. Local shrinkage rules, L´evy processes, and regularized regression. Journal of the Royal Statistical Society (Series B), 2012. (to appear). B. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996. R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th Annual International Conference on Machine Learning, pages 791–8, 2007. A. Skene and J. C. Wakefield. Hierarchical models for multi-centre binary response studies. Statistics in Medicine, 9:919–29, 1990. M. Zhou, L. Li, D. Dunson, and L. Carin. ognormal and gamma mixed negative binomial regression. In International Conference on Machine Learning (ICML), 2012.

30