Submitted 12/11; Published 12/11

t-Logistic Regression Nan Ding

DING [email protected] PURDUE . EDU

Department of Computer Science Purdue University West Lafayette, IN 47907-2066, USA

S.V. N. Vishwanathan

VISHY @ STAT. PURDUE . EDU

Departments of Statistics and Computer Science Purdue University West Lafayette, IN 47907-2066, USA

Manfred Warmuth

MANFRED @ CSE . UCSC . EDU

Department of Computer Science and Engineering University of California, Santa Cruz Santa Cruz, CA, USA

Vasil Denchev

VDENCHEV @ PURDUE . EDU

Department of Computer Science Purdue University West Lafayette, IN 47907-2066, USA

Editor: U.N. Known

Abstract We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. We examine our algorithm for both binary classfication and multiclass classfication with both L1 and L2 regularizer. The objective function of our algorithm is non-convex, an efficient block coordinate descent optimization scheme is derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. We examine our algorithm in a bunch of synthetic as well as real datasets.

1. Introduction Many machine learning algorithms minimize a regularized risk (Teo et al., 2010): m

J(θ) = Ω(θ) + Remp (θ), where Remp (θ) =

1 X l(xi , yi , θ). m

(1)

i=1

Here, Ω is a regularizer which penalizes complex θ; and Remp , the empirical risk, is obtained by averaging the loss l over the training dataset {(x1 , y1 ), . . . , (xm , ym )}. The features of a data point x is extracted via a feature map Φ. In binary classification, the label is usually predicted via sign(hΦ(x), θi); while in multiclass classification, the label is predicted via maxc {hΦ(x), θ c i} where θ c denotes the subvector of θ corresponds to the parameter of class c. For a long period, convex losses have been a strong favorate by people, mainly because it ensures that the regularized risk minimization problem has a unique global optimum (Boyd and Vandenberghe, 2004) and the rate of convergence of the algorithm can be analytically obtained. However, c

2011 Nan Ding and S.V. N. Vishwanathan and Manfred Warmuth and Vasil Denchev.

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

Logistic

exp

loss

Hinge

6

4

2 0-1 loss

-4

-2

0

2

4

margin

Figure 1: Some commonly used loss functions for binary classification. The 0-1 loss is non-convex. The hinge, exponential, and logistic losses are convex upper bounds of the 0-1 loss.

as was recently shown by Long and Servedio (2010), learning algorithms based on convex loss functions for binary classification are not robust1 to noise2 . In the binary classification, if we define the margin of a training example (x, y) as u(x, y, θ) := y hφ(x), θi, then many popular loss functions for binary classification can be written as functions of the margin. Examples include3 l(u) = max(0, 1 − u) l(u) = exp(−u) l(u) = log(1 + exp(−u))

(Hinge Loss)

(2)

(Exponential Loss)

(3)

(Logistic Loss).

(4)

Intuitively, the convex loss functions grows at least linearly with slope |l0 (0)| as u ∈ (−∞, 0), which introduces the overwhelming impact from the data with u 0. There has been some recent and some not-so-recent work on using non-convex loss functions Freund (2009) to alleviate the above problem. In this paper, we continue this line of inquiry and propose the non-convex loss function which is firmly grounded in probability theory. By extending logistic regression from the exponential family to the t-exponential family, a natural extension of exponential family of distributions studied in statistical physics (Naudts, 2002, 2004a,b,c; Tsallis, 1988), we obtain the t-logistic regression algorithm. Furthermore, the binary t-logistic regression can be extended to multiclass t-logistic regression. We also show that L1 and L2 regularizer in t-logistic regression corresponds to the Student’s t-distribution and a newly proposed t-Laplace distribution in the t-exponential family. We 1. There is no unique definition of robustness. For example, one of the definitions is through the outlier-proneness O’hagan (1979): p(θ | X, Y, xn+1 , yn+1 ) → p(θ | X, Y) as xn+1 → ∞. 2. Although, the analysis of Long and Servedio (2010) is carried out in the context of boosting, we believe, the results hold for a larger class of algorithms which minimize a regularized risk with a convex loss function. 3. We slightly abuse notation and use l(u) to denote l(u(x, y, θ)).

1002

t-L OGISTIC R EGRESSION

give a simple block coordinate descent scheme that can be used to solve the resultant regularized risk minimization problem. Analysis of this procedure also intuitively explains why t-logistic regression is able to handle label noise. Our paper is structured as follows: In section 2 we briefly review generalized exponential family, which includes Student’s t-distribution and t-Laplace distribution. In section 3, we review the logistic regression in the exponential family both for binary classfication and multiclass classification. We then proposed t-logistic regression algorithm in section 6. In section ?? we utilize ideas from convex multiplicative programming to design an optimization strategy and give its convergence analysis. Experiments that compare our new approach to existing algorithms on a number of publicly available datasets are reported in section 9. It follows a discussion with the related research work in section ?? and the outlook in section 11.

2. Generalized exponential family of distributions 2.1 φ-exponential family of distribution To define the generalized exponential family of distribution, we first have to review the generalizations of the log and exp functions which were introduced in statistical physics (Naudts, 2002, 2004a,b,c). Some extensions and machine learning applications were presented in (Sears, 2008). The φ-logarithm function is defined in the domain of R+ as Z x 1 logφ (x) = du (5) 1 φ(u) where φ(u) > 0 is an increasing function defined on R+ . The φ-exponential function expφ is defined as the inverse function of the φ-logarithm function. Both logφ and expφ function generalizes the usual exp and log function, which is recovered as φ(u) = u. Many familiar properties of exp are therefore preserved. expφ function is a non-negative, convex and monotonically increasing function passing (0, 1) and logφ function is concave, monotonically increasing and passing (1, 0). Besides, it is easy to verify that the first derivative of logφ and expφ are, d logφ (x) d expφ (x) 1 = , = φ(expφ (x)) (6) dx φ(x) dx However, one key property missing is that logφ (ab) 6= logφ (a) + logφ (b) and expφ (a + b) 6= expφ (a) expφ (b) when φ(u) 6= u. Analogous to the exponential family of distributions, the φ-exponential family of distributions is defined as (Naudts, 2004c; Sears, 2008): p(x; θ) := expφ (hφ(x), θi − gφ (θ)) .

(7)

where gφ is the log-partition function ensures that p(x; θ) is normalized. In general, unlike g(θ) in the exponential family, there is no closed form solution exists for computing gφ (θ) exactly. However, the following theorem, which was first proved by Naudts in (Naudts, 2004c), shows that gφ (θ) still preserves a few important properties. Theorem 2.1 gφ (θ) is a strictly convex function. In addition, if the following regularity condition Z Z ∇θ p(x; θ)dx = ∇θ = ∇θ 1 = 0p(x; θ)dx (8)

1003

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

holds, then ∇θ gφ (θ) = Eqφ (x;θ) [φ(x)] .

(9)

where qφ (x; θ) is the escort distribution qφ (x; θ) := φ(p(x; θ))/Z(θ) where Z(θ) =

R

(10)

φ(p(x; θ))dx.

Proof To prove convexity, we rely on the elementary arguments. Recall that expφ is an increasing and strictly convex function. Choose θ1 and θ2 such that gφ (θi ) < ∞ for i = 1, 2, and let α ∈ (0, 1). Set θα = αθ1 + (1 − α)θ2 , and observe that Z expφ (hφ(x), θα i − αgφ (θ1 ) − (1 − α)gφ (θ2 )) dx Z Z < α expφ (hφ(x), θ1 i − gφ (θ1 ))dx + (1 − α) expφ (hφ(x), θ2 i − gφ (θ2 ))dx = 1. On the other hand, we also have Z expφ (hφ(x), θα i − gφ (θα )) dx = 1. Again, using the fact that expt is an increasing function, we can conclude from the above two equations that gφ (θα ) < αgφ (θ1 ) + (1 − α)gφ (θ2 ). This shows that gφ is a strictly convex function. d To show (9), use (8) and combining with the fact that du expφ (u) = φ(expφ (u)), to write Z Z ∇θ p(x; θ)dx = ∇θ expφ (hφ(x), θi − gφ (θ)) dx Z = φ expφ (hφ(x), θi − gφ (θ)) (φ(x) − ∇θ gφ (θ))dx Z ∝ qφ (x; θ)(φ(x) − ∇θ gφ (θ))dx = 0. Rearranging terms and using

R

qφ (x; θ)dx = 1 directly yields (9).

Therefore, the main difference from ∇θ g(θ) of the exponential family is that ∇θ gφ (θ) is equal to the expectation of its escort distribution qφ (x; θ) instead of p(x; θ). 2.2 t-exponential family of distribution One example of the φ-exponential family which draws us particular attention is the t-exponential family. The t-exponential/logarithm function as well as the t-exponential family was first proposed in 1980s by Tsallis (where he called it q-exponential family, but we tend to use t instead of q to avoid confusion with the escort distribution q). 1004

t-L OGISTIC R EGRESSION

The expt and logt function is a special case of the expφ and logφ function with φ(u) = ut for t > 0: ( exp(x) if t = 1 expt (x) := (11) 1/(1−t) [1 + (1 − t)x]+ otherwise. where (·)+ = max(·, 0). Some examples are shown in Figure 2. The inverse of expt namely logt is ( log(x) logt (x) := x1−t − 1 /(1 − t)

if t = 1 otherwise.

(12)

t = 1.5

expt

exp(x) 7 6

t = 0.5 t = 0.5

5 4

t→0 3

-3

-2

-1

t→0

logt 2

2

-1

1

-2 1

2

x

t = 1.5

1 0

0

log(x)

x 1

2

3

4

5

6

7

-3

Figure 2: Left: expt and Middle: logt for various values of t indicated. The right figure depicts the t-logistic loss functions for different values of t. When t = 1, we recover the logistic loss From Figure 2, you can see that expt decays towards 0 more slowly than the exp function for t > 1. This important property leads to a family of heavy tailed distributions.

3. Logistic Regression Logistic regression is a discriminative model which is mainly used in classifications. In particular, we are given a labeled dataset (X, y) = {(x1 , y1 ), . . . , (xm , ym )}. The feature xi ’s are drawn from some d-dimensional domain X . The labels yi being categorical and C is the number of classes. Given a family of conditional distributions parametrized by θ, making a standard i.i.d. assumption about the data allows us to write, p(y | X, θ) =

m Y i=1

1005

p(yi | xi , θ)

(13)

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

Logistic regression computes a maximum likelihood (ML) estimate for θ by minimizing − log p(y | X, θ) = −

m X

log p(yi | xi ; θ)

(14)

i=1

as a function of θ. To avoid overfitting to the data, it is common to add a regularizer of the parameter θ, which yields the maximum a-posteriori (MAP) estimate of θ, − log p(θ | y, X) = −

m X

log p(yi | xi ; θ) − log p(θ) + const.

(15)

i=1

where p(θ | y, X) = p(y | X, θ)p(θ)/p(y | X) follows the Bayes rule and p(y | X) is a constant of θ. The regularizer is usually assumed to be a zero mean isotropic Gaussian prior, which yields: − log p(θ) =

λ kθk22 + const. . 2

(16)

This is also called an L2 regularizer. On the other hand, when the feature space is huge, L1 regularizer is favored because it enforces sparsity on the components of θ. To this end, Laplace prior is normally chosen as the prior, which yields − log p(θ) = λ kθk1 + const. .

(17)

In the logistic regression, p(yi | xi ; θ) is modeled as the conditional exponential family of distributions p(yi | xi ; θ) = exp (hΦ(xi , yi ), θi − g(θ | xi )) ,

(18)

We will concentrate on the linear classifier, this allows us to simplify, Φ(x, k) = 0, . . . , 0, x, 0, . . . , 0 . | {z } | {z } 1,...,y−1

(19)

y+1,...,C

with 0 the d-dimensional all-zero vector. Therefore, hΦ(xi , yi ), θi = θ Tyi xi θ is a d × C-dim vector, θ = (θ 1 , . . . , θ C ) .

(20)

and θ c is a d-dim vector which is the c-th segment of θ. p(yi | xi ; θ) = exp θ Tyi xi −g(θ | xi )

(21)

The log-partition function g(θ | xi ) = log

C X c=1

1006

! exp θ Tc xi

.

(22)

t-L OGISTIC R EGRESSION

4. Logistic Loss for Binary Classification For a binary classification problem, we have p(yi | xi ; θ) = exp θ Tyi xi −g(θ | xi ) ,

(23)

g(θ | xi ) = log exp θ T1 xi + exp θ T2 xi .

(24)

where

˜ = θ 1 − θ 2 , and slightly abuse notation by using yi ∈ {+1, −1} to represent yi = If we rewrite θ {1, 2}, then through a simple modification, we can equivalently write, ˜ = exp p(yi | xi ; θ)

1 ˜T ˜ xi ) yi θ xi −˜ g (θ| 2

(25)

where 1 ˜T 1 ˜T ˜ θ xi + exp − θ xi . g˜(θ| xi ) = log exp 2 2

(26)

and 1 ˜T 1 ˜T 1 ˜T ˜ − log p(yi | xi ; θ) = − yi θ xi + log exp( θ xi ) + exp(− θ xi ) 2 2 2 T ˜ xi ) = log 1 + exp(−yi θ

(27) (28)

Binary logistic regression, also is called the logistic loss, has widely been used as a convex surrogate ˜ have been a strong favorite loss for binary classification. For a long period, convex losses l(x, y, θ) by people, mainly because it ensures that the empirical risk of {(xi , yi )} with i = 1, . . . , m m X

˜ l(xi , yi , θ)

i=1

has a unique global optimum (Boyd and Vandenberghe, 2004) and the rate of convergence of the algorithm can be obtained. If we define the margin of a training example (x, y) as E D analytically ˜ ˜ u(x, y, θ) := y θ, x , then many popular loss functions for binary classification can be written as functions of the margin. Examples include: l(u) = max(0, 1 − u) l(u) = exp(−u) l(u) = log(1 + exp(−u))

1007

(Hinge Loss)

(29)

(Exponential Loss)

(30)

(Logistic Loss).

(31)

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

Logistic

exp

loss

Hinge

6

4

2 0-1 loss

-4

-2

0

2

4

margin

Figure 3: Some commonly used loss functions for binary classification. The 0-1 loss is non-convex. The hinge, exponential, and logistic losses are convex upper bounds of the 0-1 loss.

4.1 Shortcomings of Convex Losses However, as was recently shown by (Long and Servedio, 2010), learning algorithms based on convex loss functions for binary classification are not robust to noise4 . In (Long and Servedio, 2010), Long and Servedio constructed the following dataset to show that minimizing a convex losses are not tolerant to label noise (the label noise is added by flipping a portion of the labels of the training data). Each data point has a 21-dimension feature vector and play one of three possible roles: large margin examples (25%, x1,2,...,21 = y); pullers (25%, x1,...,11 = y, x12,...,21 = −y); and penalizers (50%, Randomly select and set 5 of the first 11 coordinates and 6 out of the last 10 coordinates to y, and set the remaining coordinates to −y). They show that although using convex losses can classify the clean data perfectly, adding 10% label noise into the dataset can fool the convex classifiers. This phenomenon is illustrated in Figure 5. The black double arrow is the true classifier. However, after adding 10% label noise, the convex classifier changes to the red double arrow, which is no longer able to distinguish the penalizers and leads to around 25% of error. The reason is intuitively shown in Figure 4, the convex loss functions grows at least linearly with slope |l0 (0)| as u ∈ (−∞, 0), which introduces the overwhelming impact from the data with u 0. Therefore, the true black classifier suffers from large penalties of the flipped large margin data and is beaten up by the red classifier even when the red one misclassifies more points. 4. Although, the analysis of Long and Servedio (2010) is carried out in the context of boosting, we believe, the results hold for a larger class of algorithms which minimize a regularized risk with a convex loss function.

1008

t-L OGISTIC R EGRESSION

puller

penalizer Large Margin

Large Margin penalizer

puller

Figure 4: The Long-Servedio dataset

5. Optimal Condition and Robustness We have seen in the last section that the logistic regression is not robust against label noise. In this section, we discuss the condition of robustness for the discriminative models. For the dataset (x1 , y1 ), . . . , (xm , ym ), the maximum likelihood estimate (MLE) θ ∗ of a discriminative model has to satisfy, 1 ∂ log p(y | X, θ ∗ ) m∂θ m 1 X ∂ =− log p(yi | xi , θ ∗ ) m ∂θ

0=−

i=1

We consider a model to be robust, if θ ∗ is the MLE of the dataset (x1 , y1 ), . . . , (xm , ym ), with m → ∞, then after adding any new data point (xm+1 , ym+1 ), the optimal condition still holds for θ ∗ , that is m

1 X ∂ 1 ∂ log p(yi | xi , θ ∗ ) + log p(ym+1 | xm+1 , θ ∗ ) = 0 m→∞ m + 1 ∂θ m+1∂θ lim

i=1

Since that m

1 X ∂ lim log p(yi | xi , θ ∗ ) = 0 m→∞ m + 1 ∂θ i=1

it is equivalent to that for any data point (xm+1 , ym+1 ), 1 ∂ log p(ym+1 | xm+1 , θ ∗ ) = 0 m→∞ m + 1 ∂ θ lim

1009

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

Furthermore, because the dataset (x1 , y1 ), . . . , (xm , ym ) are chosen arbitrarily, θ ∗ can be any parameter matrix except the trivial all 0 matrix; besides the additional data point (xm+1 , ym+1 ) can also be chosen arbitrarily. Because of the existence of the regularizer, we assume that θ ∗ is bounded by a constant as | θ ∗ |∞ ≤ M . Overall, we define the robustness as follows, Definition 5.1 A model is robust if and only if, ∀ xi , yi and ∀ θ \0C×d such that | θ ∗ |∞ ≤ M , −

∂ log p(yi | xi , θ) ∂θ

is bounded. Additionally, the following lemma, which first shown in Liu (2004), is a straightforward result from Definition 5.1, Lemma 5.2 A model is robust if and only if, ∀ xi , yi and ∀ θ \0C×D such that | θ |∞ ≤ M , ∂ I(xi , yi , θ) = θ, − log p(yi | xi , θ) ∂θ is bounded. 5.1 Optimal Condition and Robustness of Logistic regression For logistic regression, the optimal condition is 1 ∂ log p(y | X, θ ∗ ) m∂θ m 1 X ∂ =− (hΦ(xi , yi ), θi − g(θ | xi )) m ∂θ 0=−

=−

1 m

i=1 m X

(Φ(xi , yi ) − Ep∗i [Φ(xi , y)])

i=1

which yields, m X i=1

Φ(xi , yi ) =

m X

Ep∗i [Φ(xi , y)]

(32)

i=1

for any class c, m X

Φ(xi )δ(yi = c) =

i=1

m X

Φ(xi )p∗i (c)

(33)

i=1

where p∗i denotes p(y| xi , θ ∗ ) for brevity. This optimal condition is sometimes referred as moment matching in the exponential family. Now we justify the robustness of logistic regression, ∂ I(xi , yi , θ) = θ, − log p(yi | xi , θ) ∂θ =(Φ(xi , yi ) − Epi [Φ(xi , y)])T θ 1010

t-L OGISTIC R EGRESSION

where pi denote for p(yi | xi , θ) for simplicity. Apparently, under the following condition, I(xi , yi , θ) is unbounded. • p(c| xi , θ) = maxy {p(y| xi , θ)}, • yi 6= c. This implies θ Tc xi ≥ θ Tyi xi , • ui = θ Tc xi → ∞. Therefore, we justified that the logistic regression is not robust.

6. t-logistic Regression From the last section, we observe that designing a robust discriminative model should come to the following two rules: • Given dataset (x1 , y1 ), . . . , (xm , ym ), θ ∗ satisfies the optimal condition m

−

1 X ∂ log p(yi | xi , θ ∗ ) = 0 m ∂θ

(34)

i=1

• Given any data point (xi , yi ) and ∀ θ \0C×d such that | θ |∞ ≤ M , ∂ θ, − log p(yi | xi , θ) ∂θ

(35)

is bounded. We first introduce our discriminative model which satisfy the above two rules, and then will justify them in the following subsections. Our solution is surprisingly simple: we generalize the logistic regression by replacing the conditional exponential family distribution by conditional texponential family distribution (t > 1): p(y| x, θ) = expt (hΦ(x, y), θi − gt (θ | x)) = expt θ Ty x −gt (θ | x)

(36) (37)

where the log-partition function gt satisfies C X

expt θ Tc x −gt (θ | x) = 1.

(38)

c=1

Since no closed form solution to gt of the above equation exists in general, the efficiency of the t-logistic regression is highly dependent on the computational efficiency of the gt (θ | x). In the following, we show how gt (θ | x) can be computed numerically. 1011

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

6.1 Numerical Estimation of gt The basic idea of solving gt numerically is to take a iterative scheme. First, let us denote uc = θ Tc x, so that (38) is simplified as C X

expt (uc − gt (u)) = 1

(39)

c=1

˜ and a function It is clear that there exists u Z(˜ u) =

C X

expt (˜ u) .

(40)

c=1

which satisfy, expt (uc − gt (u)) =

1 expt (˜ uc ) . Z(˜ u)

1 = expt Z(˜ u)t−1 u ˜c + logt ( ) ∀c ∈ {1, . . . , C} . Z(˜ u) | {z } | {z } =uc =−gt (u)

The last equation uses the definition of expt function. The iterative algorithm for computing gt (u) goes as follows: 1. Compute u∗ = max {u} 2. Offset v = u −u∗ ˜=v 3. Initialize u 4. Compute Z =

PC

u) c=1 expt (˜

˜ = Z 1−t v 5. Update u 6. If not converge, go to step 4; otherwise go to step 7 7. Output gt = u∗ − logt (1/Z) Our experiments show that gt (u) can be obtained with high accuracy in less than 10 iterations. To illustrate, we let C ∈ {10, 20, . . . , 100} and we randomly generate u ∈ [−10, 10]C , and compute the corresponding gt (u). We compare the time spent in estimating gt (u) by the iterative scheme and by calling Matlab fsolve function averaged over 100 runs using Matlab 7.1 in a 2.93 GHz Dual-Core CPU. And we present the results in Table 1. The results show that our iterative method scales well with C the number of classes, thus making it efficient enough for problems involving a large number of classes. 1012

t-L OGISTIC R EGRESSION

Table 1: Average time (in milliseconds) spent by our iterative scheme and fsolve on solving gt (u). C fsolve iterative

10 8.1 0.3

20 8.3 0.3

30 8.1 0.3

40 8.7 0.4

50 9.6 0.4

60 9.8 0.4

70 10.0 0.3

80 10.2 0.3

90 10.3 0.4

100 10.7 0.5

6.2 Optimal Condition and Robustness of t-logistic Regression The MLE of the t-logistic regression θ ∗ is, 1 ∂ log p(y | X, θ ∗ ) m∂θ m 1 X ∂ log expt θ ∗yi xi −gt (θ ∗ | xi ) =− m ∂θ 0=−

i=1

m 1 X (Φ(xi , yi ) − Eqi∗ [Φ(xi , y)])p(yi | xi , θ ∗ )t−1 =− m i=1

And the optimal condition for the t-logistic regression is, m X

Φ(xi , yi )ξi =

i=1

m X

Eqi∗ [Φ(xi , y)]ξi

(41)

i=1

for any class c, m X

ξi Φ(xi )δ(yi = c) =

i=1

m X

ξi Φ(xi )qi∗ (c)

(42)

i=1

∗ ∗ where qi∗ (y) ∝ p∗t i (y) is the escort distribution of pi (y) = p(yi | xi , θ ) and

ξi = p(yi | xi , θ ∗ )t−1 .

(43)

is the weight of each data point. There are two remarks about this optimal condition: Firstly, it might be a bit weird to see qi∗ instead of p∗i . As a matter of fact, qi∗ simply amplifies the distribution of p∗i to make it closer to ec∗i , where c∗i = argmax hθ ∗c , xi i . c

Note that if we use ec∗i to replace Ep∗i [y] in the optimal condition of the logistic regression, it yields to the perception method, where m X i=1

Φ(xi , yi ) =

m X i=1

1013

Φ(xi , c∗i )

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

Secondly, when t > 1, the impact of the i-th data point is controlled by its weight ξi . Intuitively, these weights associated with each of the data points will dampen the contribution of any outliers since they always have small ξi . However, do these extra weights make the t-logistic regression robust? We justify it as follows. According to (36) and Lemma 5.2, we have log p(yi | xi , θ) = log(expt (hΦ(xi , yi ), θi − gt (θ | xi ))) I(xi , yi , θ) =

− exptt−1 (hΦ(xi , yi ), θi

∂ − gt (θ | xi )) θ, (hΦ(xi , yi ), θi − gt (θ | xi )) ∂θ

= − exptt−1 (hΦ(xi , yi ), θi − gt (θ | xi ))(hΦ(xi , yi ) − Eqi [Φ(xi , y)], θi) hEqi [Φ(xi , yi )], θi − hΦ(xi , yi ), θi = 1 + (t − 1)(gt (θ | xi ) − hΦ(xi , yi ), θi) For simplicity, let us use the vector ui , where each element uiy = θ Ty xi , and write p(yi | xi , θ) = p(yi | ui ) = expt (uiyi − gt (ui )) where C X

expt (uic −gt (ui )) = 1

c=1

and

∂ ∂ ui gt (ui )

= Eqi [ey ] = [qi1 ; . . . ; qiC ] with qiy ∝ p(y| ui )t . Then we have I(xi , yi , θ) = I(ui , yi ) T = − expt−1 t (uiyi − gt (ui ))(uiyi − Eqi [ey ] ui )

=

Eqi [ey ]T ui −uiyi 1 + (t − 1)(gt (ui ) − uiyi )

We now justify that as uic → ∞ (∀c), I(ui , yi ) is bounded. Firstly, if | Eqi [ey ]T ui −uiyi | is bounded as uic → ∞ (∀c), then because t > 1 and gt (ui ) − uiyi ≥ 0, lim |I(ui , yi )| ≤ | Eqi [ey ]T ui −uiyi |

ui →∞

is also bounded. On the other hand, if Eqi [ey ]T ui −uiyi → ∞, then we will apply the L’hospital principle, lim I(ui , yi ) (∀c)

uic →∞

Eqi [ey ]T ui −uiyi (∀c) uic →∞ 1 + (t − 1)(gt (ui ) − uiyi ) ∂ T ∂uic Eqi [ey ] ui −uiyi = lim (∀c) uic →∞ ∂ (1 + (t − 1)(gt (ui ) − uiy )) i ∂uic

= lim

qic − δ(yi = c) (∀c) (t − 1)(qic − δ(yi = c)) 1 = . t−1

=

1014

t-L OGISTIC R EGRESSION

where the third equation comes because of the property that ∂u∂ic gt (ui ) = Eqi [yc ]. Therefore, we conclude that the t-logistic regression is robust because I(xi , yi , θ) is bounded for any (xi , yi ) and θ.

7. t-logistic Loss for Binary Classification After a general discussion on the t-logistic regression, we now closely investigate the t-logistic regression for binary classification problem. The binary t-logistic regression is interesting because we are able to visualize its loss function and justify its Bayes consistency. The t-logistic regression for binary classification is simply a two class example of the t-logistic regression, p(yi | xi ; θ) = expt θ Tyi xi −gt (θ | xi ) , (44) where expt θ T1 x −gt (θ | x) + expt θ T2 x −gt (θ | x) = 1.

(45)

˜ = Just like the logistic regression, the binary t-logistic regression can also be rewritten as θ and y ∈ {+1, −1}, T ˜ = exp yi θ ˜ xi −˜ ˜ p(yi | xi ; θ) g ( θ| x ) t i t

1 2 (θ 1 − θ 2 ),

where, T T ˜ x −˜ ˜ ˜ x −˜ ˜ x) + exp −θ expt θ g ( θ| x) = 1. gt (θ| t t We call the negative log-likelihood of the t-logistic regression, the t-logistic loss. For a data point with margin u, the t-logistic loss is l(u) = − log expt (u − g˜t (u))

(t-logistic loss)

where expt (u − g˜t (u)) + expt (−u − g˜t (u)) = 1. When t = 2, we obtain an analytical form for the binary t-logistic loss. To see this, note that when t = 2, expt (x) = 1/(1 − x), therefore, 1 1 + =1 1 − u + g˜t (u) 1 + u + g˜t (u) which leads to g˜t (u) =

√

1 + u2 and l(u) = log(1 − u +

p 1 + u2 )

Notice that we no longer obtain a convex loss function. The loss function is curling down as the margin of a data point is too negative, which usually is the indication of an outlier. Therefore, the benefit of the nonconvexity is that the losses by those outliers are not as large as the convex losses, so that the model will be more robust against the outliers. 1015

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

loss

t = 1 (logistic)

6

t = 1.3 4 t = 1.6 t = 1.9 2 0-1 loss

-4

-2

0

2

4

margin

7.1 Bayes Consistency It is well-known that all the convex losses mentioned earlier for binary classification are Bayes consistent. In this subsection, we want to justify the Bayes consistency of the t-logistic loss. We first review some concepts in (Bartlett et al., 2006). The binary classification algorithms are essentially proposed to optimize the empirical risks which are, RL (f ) = Ex,y [L(yf (x))] For any given (x, y), if we denote η(x) = p(y = 1| x), then the associated conditional risk given x can be written as, CL (η, f ) = Ey| x [L(yf (x))] = ηL(f ) + (1 − η)L(−f ) The 0-1 loss can be written as ( 1, yf (x) < 0 L0/1 [yf (x)] = 0, otherwise then the minimization of the 0-1 loss gives, Ey| x [L0/1 [yf (x)]] = ηL0/1 [f ∗ (x)] + (1 − η)L0/1 [−f ∗ (x)] = L0/1 [(2η − 1)f ∗ (x)] which yields to, f ∗ (x) = sign[2η(x) − 1] Definition 7.1 A Bayes consistent loss function is the class of loss function, for which fL∗ (η) the minimizer of CL (η, f ) satisfies fL∗ (η) = sign[2η − 1] 1016

(46)

t-L OGISTIC R EGRESSION

For t-logistic regression, we have 1 L(f ) = log expt ( yf − gt (f )) 2

(47)

And CL (η, f ) = ηL(f ) + (1 − η)L(−f ) 1 1 = η log expt ( f − gt (f )) + (1 − η) log expt (− f − gt (f )) 2 2 1 1 = η log expt ( f − gt (f )) + (1 − η) log(1 − expt ( f − gt (f ))). 2 2

(48) (49) (50)

Minimizing over f results in the fL∗ which satisfies η = expt (f ∗ − gt (f ∗ )), and therefore fL∗ (η) = logt η − logt (1 − η)

(51)

CL∗ (η)

(52)

= −η log η − (1 − η) log(1 − η)

It is then clear that t-logistic loss is Bayes consistent since it satisfies (46).

8. Generalization of the Logistic Regression by Trust Function As we have shown in the last section, t-logistic regression provides an interesting generalization of the logistic regression. Meanwhile, there are also other different kinds of generalization of the logistic regression. Recall that the logistic regression takes, p(y| x; θ) = exp (hΦ(x, y), θi − g(θ | x)) ,

(53)

Denoting the margin to be u where uc = hΦ(x, c), θi, we propose a different generalization of the logistic regression by replacing the margin u with s(u) where s : R → R, based on the energybased learning framework(). This function modifies the logistic regression as follows: p(y| x; θ) = exp (s(hΦ(x, y), θi) − g˜(θ | x))

(54)

= exp (s(uy ) − g˜(u)) ,

(55)

where g˜(θ | x) = log

= log

C X c=1 C X

! exp(s(hΦ(x, c), θi))

(56)

! exp(s(uc ))

(57)

c=1

We call s an s-function, and its derivative s0 a trust function if they have the following properties: 1. s antisymmetric: s(−u) = −s(u), s(0) = 0. 2. s(∞) = ∞ which will assure that the loss remains unbounded. 1017

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

3. s0 (u) ∈ [0, 1] and s0 (0) = 1 because we want the trust at s0 (u) of the linear activation u be a probability. 4. s0 symmetric: s0 (−u) = s0 (u). 5. The trust s0 (u) is decreasing in the interval [0..∞] i.e. as we go away from zero. 6. limitu→∞ u s0 (u) is increasing in u for u ≥ 0 and asympotes at a constant. Our main family of trust functions will be scaled versions of the arcsinh(u) function5 (See Figure ??): 1 arcsinh(q u), q 1 and limit u s0q (u) = , u→∞ q

sq (u) :=

1 s0q (u) = p , 2 q u2 + 1

where q > 0 is a non-negative scaling parameter. Note that limitq→0 sq (u) = u and therefore we define s0 (u) as the identity function u. The function s0 (u) the straight line in Figure ??, and as q increases, sq (u) bends to the right above zero and to the left below zero. The corresponding trust functions sq (u) are depicted in Figure ??. Note that the s00 (u) = 1, ie. full trust everywhere. However, the larger q the less trust is given to linear activations with high absolute value. 8.1 Binary Classification T

˜ x, one can For binary classification, by using the label y = {+1, −1} and denoting u = y θ generalize the logistic loss to obtain the trust loss by, l(u) = log(1 + exp(s(−u))) Especially, when s(u) = arcsinh(u), we have, l(u) = log(1 − u +

p 1 + u2 )

It is clear to see that the trust loss is equal to the t-logistic loss as t = 2. 8.2 Optimal Condition and Robustness For the Trust Regression, the optimal condition is 1 ∂ log p(y | X, θ ∗ ) m∂θ m 1 X ∂ (s(hΦ(xi , yi ), θi) − g˜(θ | xi )) =− m ∂θ 0=−

=−

1 m

i=1 m X

s0 (hΦ(xi , yi ), θi)Φ(xi , yi ) − Ep∗i [s0 (hΦ(xi , y), θi)Φ(xi , y)]

i=1

5. arcsinh(u) = ln(u +

√

1 + u2 )

1018

t-L OGISTIC R EGRESSION

In particular, for any class c, we have 0=−

m 1 X 0 s (hΦ(xi , yi ), θi)Φ(xi )δ(yi = c) − s0 (hΦ(xi , c), θi)Φ(xi )p∗i (c) m i=1

m 1 X 0 =− s (hΦ(xi , c), θi)Φ(xi )δ(yi = c) − s0 (hΦ(xi , c), θi)Φ(xi )p∗i (c) m i=1

m 1 X˜ ξic Φ(xi ) (δ(yi = c) − p∗i (c)) =− m i=1

which immediately yields to the optimal condition, m X

ξ˜ic Φ(xi )δ(yi = c) =

i=1

m X

ξ˜ic Φ(xi )p∗i (c).

(58)

i=1

where p∗i (y) denotes p(y| xi , θ ∗ ) for brevity, and ξ˜ic = s0 (hΦ(xi , c), θi)

(59)

is the weight of each data. The second equation is because as yi 6= c, the first term is always 0. From the above gradient, tt can be justified the trust regression is robust. ∂ log p(yi | xi , θ) I(xi , yi , θ) = θ, − ∂θ T = s0 (hΦ(xi , yi ), θi)Φ(xi , yi ) − Ep∗i [s0 (hΦ(xi , y), θi)Φ(xi , y)] θ =s0 (hΦ(xi , yi ), θi) hΦ(xi , yi ), θi − Ep∗i [s0 (hΦ(xi , y), θi) hΦ(xi , y), θi] Denoting uyi = hΦ(xi , y), θi, and making use of the sixth propertyof the s-function that s0 (u)u is upper bounded by a constant, we have I(xi , yi , θ) = s0 (uyi i )uyi i − Ep∗i [s0 (uyi )uyi ] is bounded. 8.3 t-logistic Regression vs. Trust Regression We have already seen the the t-logistic regression and the trust regression are closely related in the binary classification settings. However, it is easy to verify that these two algorithms are completely different in general. Since both t-logistic regression and trust regression are generalizations of the logistic regression. A natural question to ask is which one is better? To answer this question, let us first look at the gradient of these log-likelihood functions. For any class c, the gradient of the logistic regression is m

X ∂ − log p(y | X, θ) = − Φ(xi ) (δ(yi = c) − pi (c)) ∂ θc i=1

1019

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

The gradient of the t-logistic regression is, m

−

X ∂ log p(y | X, θ) = − ξi Φ(xi ) (δ(yi = c) − qi (c)) ∂ θc i=1

where ξi = pi (yi )t−1 . The gradient of the trust regression is, m

−

X ∂ log p(y | X, θ) = − ξ˜ic Φ(xi ) (δ(yi = c) − pi (c)) ∂ θc i=1

where ξ˜ic = s0 (hΦ(xi , c), θi). Basically, the contribution of each data point on the gradients, can be decomposed to the product of two terms. The first is the weight term, which is equal to 1, ξi and ξ˜ic for logistic regression, t-logistic regression, and trust regression respectively; and the second is sign term, which is larger than 0 as c = yi and less than 0 otherwise. The sign terms of the three algorithms are very similar to each other. The main difference of the above three algorithms lies in the weights of each data point. The weight terms essentially measure the goodness of the data points. In the t-logistic regression, ξi is dependent on both xi and yi ; on the other hand, ξ˜ic in the trust regression is independent of the label yi of the data. In supervised learning, the goodness of a data point (xi , yi ) is not only dependent on its feature xi , but also dependent on its label yi . In other words, for the t-logistic regression, if there is an outlier with low pi (yi ), then it has low impacts on all θ c ’s. For the trust regression, since the weights across different classes c’s are different, the outliers may still have impact for some θ c ’s. We will compare the logistic regression, the t-logistic regression, as well as the trust regression in the experiments.

9. Experimental Evaluation In the experiment, we evaluate the t-logistic regression in synthetic datasets, real datasets for binary classification, and real datasets for multiclass classification. 9.1 Synthetic Dataset We first examine the t-logistic regression on the following two benchmark synthetic datasets: LongServedio and Mease-Wyner. The Long-Servedio dataset is an artificially constructed dataset to show that algorithms which minimize a differentiable convex loss are not tolerant to label noise Long and Servedio (2010). The Mease-Wyner is another synthetic dataset to test the effect of label noise. The input x is a 20-dimensional vector where each coordinate is uniformly distributed on [0, 1]. P The label y is +1 if 5j=1 xj ≥ 2.5 and −1 otherwise Mease and Wyner (2008). It also a natural example to validate the ability of an algorithm to learn a sparse feature space. In our experiment, we use 800 samples as training, 200 as validating, and 1000 as testing. we use the identity feature map Φ(x) = x in all our experiments, and set t ∈ {1.3, 1.6, 1.9} for tlogistic regression. Our main comparator of the t-logistic regression is the logistic regression. The parameter λ is selected in 0.1, 0.01, . . . , 10−8 . Label noise is added by randomly choosing 10% or 20% of the labels in the training set and flipping them; each dataset is tested with and without label noise. The convergence criterion is to stop when the change in the objective function value is less than 10−8 or a maximum of 3000 function evaluations have been achieved. 1020

t-L OGISTIC R EGRESSION

’Mease-Wyner’

’Long-Servedio’ 100 100

Accuracy(%)

Accuracy(%)

98 95 90 85

96 94 92

80

90

75 0 Log(L1 )

5·

10−2

0.1

tLog(L1 )

0.15 Log(L2 )

0

0.2 tLog(L2 )

5 · 10−2

Log(L1 )

0.1

tLog(L1 )

0.15 Log(L2 )

0.2 tLog(L2 )

Figure 5: Test results on Long-Servedio and Mease-Wyner datasets.

We plot the test accuracy in Figure 6. It appears to us that the label noise in the Long-Servedio datasets fool the logistic regression, while unable to fool the t-logistic regression. In Mease-Wyner dataset, we notice that in the sparse feature space, t-logistic regression with L2 regularizer partially benefits from its Student’s t-prior which outperforms the L2 -logistic regression with Gaussian prior. While the two algorithms are having closer performance with L1 regularizer. To obtain Figure 7 we used the datasets with 10% of label noise, chose the optimal parameter λ in the previous experiment, and plotted the distribution of the 1/z ∝ ξ obtained after training with t = 1.9 and L2 regularizer. To distinguish the points with noisy labels we plot them in cyan while the other points are plotted in red. Recall that ξ denotes the influence of a point. One can clearly observe that the ξ of the noisy data is much smaller than that of the clean data, which indicates that the algorithm is able to effectively identify these points and cap their influence. In particular, on the Long-Servedio dataset observe the 4 distinct spikes. From left to right, the first spike corresponds to the noisy large margin examples, the second spike represents the noisy pullers, the third spike denotes the clean pullers, while the rightmost spike corresponds to the clean large margin examples. Clearly, the noisy large margin examples and the noisy pullers are assigned a low value of ξ thus capping their influence and leading to the perfect classification of the test set. On the other hand, logistic regression is unable to discriminate between clean and noisy training samples which leads to bad performance on noisy datasets. 9.2 Binary Classification Datasets Table 2 summarizes the datasets used in our experiments. adult9, astro-ph, news20, real-sim, reuters-c11, reuters-ccat are from the same source as in Hsieh et al. (2008). aut-avn is from Andrew McCallum’s home page6 , covertype is from the UCI repository (Merz and Murphy, 1998), worm is from Franc and Sonnenburg (2008), kdd99 is from KDD

6. http://www.cs.umass.edu/˜mccallum/data/sraa.tar.gz.

1021

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

’Mease-Wyner’

’Long-Servedio’

80

150

Frequency

Frequency

200

100 50

60 40 20 0

0 0

0.2

0.4

0.6

0.8

1

Unflipped

0

0.2

0.4

0.6

0.8

1

ξ

ξ Flipped

Unflipped

Flippeed

Figure 6: The distribution of ξ obtained after training t-logistic regression with t = 1.9. Left: Long-Servedio; Right: Mease-Wyner.

Cup 19997 , while web8, webspam-u, webspam-t8 , as well as the kdda and kddb9 are from the LibSVM binary data collection10 . The alpha, delta, fd, gamma, and zeta datasets were all obtained from the Pascal Large Scale Learning Workshop website (Sonnenburg et al., 2008). For the datasets which were also used by Teo et al. (2010) (indicated by an asterisk in Table 2) we used the training test split provided by them, and for the remaining datasets we used 80% of the labeled data for training and the remaining 20% for testing. In order to learn the parameter λ and t, we further partition the training set obtained above into two parts, in which 80% is used for training and the remaining 20% is used for validation. In all cases, we added a constant feature as a bias. Results We use the same experimental methodology as in the synthetic dataset, and we list the test accuracy results for the four comparators: L1 /L2 logistic regression and L1 /L2 t-logistic regression in the table. We also plot the ξ graph of all these datasets, and the black line denotes the ξ value of the decision boundary (p(y| x, θ) = 0.5). Because ξ ∝ p(y| x, θ)t−1 , different t will give different ξ value of the decision boundary. When the points lie on the right side of the boundary, it means that the corresponding p(y| x, θ) > 0.5; on the other hand, if the points are on the left, it means p(y| x, θ) < 0.5. When label noise is added, the t-logistic cap the influence of the data with label noise (red), and in most of these case, the test accuracy is significantly better than the logistic regression. Notable examples are ’astro-ph’, ’aut-avn’, ’fd’, ’real-sim’, ’worm’. For those cases that the test accuracy of 7. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. 8. webspam-u is the webspam-unigram and webspam-t is the webspam-trigram dataset. Original dataset can be found at http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html. 9. These datasets were derived from KDD CUP 2010. kdda is the first problem algebra 2008 2009 and kddb is the second problem bridge to algebra 2008 2009. 10. http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/binary.html.

1022

t-L OGISTIC R EGRESSION

Table 2: Summary of the datasets used in our experiments. n is the total # of examples, d is the # of features, s is the feature density (% of features that are non-zero), and n+ : n− is the ratio of the number of positive vs negative examples. M denotes a million. dataset adult9 astro-ph beta delta fd kdd99 kddb real-sim reuters-ccat webspam-t worm

n 48,842 94,856 500,000 500,000 625,880 5.21 M 20.01 M 72,201 804,414 350,000 1.03 M

d 123 99,757 500 500 900 127 29.89 M 2.97 M 1.76 M 16.61 M 804

s(%) n+ : n− dataset n d 11.3 0.32 alpha 500,000 500 0.08 0.31 aut-avn 71,066 20,707 100 1.00 covertype 581,012 54 100 1.00 epsilon 500,000 2000 100 0.09 gamma 500,000 500 12.86 4.04 kdda 8.92 M 20.22 M 1e−4 6.18 news20 19,954 7.26 M 0.25 0.44 reuters-c11 804,414 1.76 M 0.16 0.90 web8 59,245 300 0.022 1.54 webspam-u 350,000 254 25 0.06 zeta 500,000 800.4 M

s(%) n+ : n− 100 1.00 0.25 1.84 22.22 0.57 100 1.00 100 1.00 2e−4 5.80 0.033 1.00 0.16 0.03 4.24 0.03 33.8 1.54 100 1.00

t-logistic regression is comparable to the logistic regression, most of time the ξ of data with label noise is also smaller than clean data, which indicates that the flipped data has less influence. When label noise is not added, the performance of the t-logistic regression and logistic regression seems to be comparable. Additionally, datasets such as ’astro-ph’, ’aut-avn’, ’fd’, ’kdd99’, ’news20’, ’real-sim’, ’reuters-c11’, ’web8’, ’webspamtrigram’, and ’worm’, it appears that the ξ value of the training data are almost all concentrated at the area of p(y| x, θ) ' 1. It indicates that the weight vector fits nearly perfectly to the training data. 9.3 Multiclass Classification Datasets Table 5 summarizes the datasets used in our multiclass classification experiments. For the datasets which were also used by Teo et al. (2010) (indicated by an asterisk in Table 2) we used the training test split provided by them, and for the remaining datasets we used 80% of the labeled data for training and the remaining 20% for testing. In order to learn the parameter λ and t, we further partition the training set obtained above into two parts, in which 80% is used for training and the remaining 20% is used for validation. In all cases, we added a constant feature as a bias. Results We use the same experimental methodology as before, and we list the test errors in the below. To see how t-logistic regression works, we also plot the ξ variables of the results, and the black line denotes the ξ value of the decision boundary (p(y| x, θ) = (nc)−1 ). When label noise is added, we observe that almost all the test results of the t-logistic regression are better. And the ξ plots also show that the data with added label noise (red) are distinguished by the algorithm and put far less influence than majority of the clean data (blue). Without label noise is added, t-logistic regression still works better or as well as logistic regression in most of the dataset. It is observable that when t-logistic regression is better, especially ’letter’ and ’mnist’ dataset, there are a portion of data get small ξ-values. It is very likely that these are the outliers existed in the origianl dataset, and the capability of removing or capping the influence of these data in t-logistic regression clearly improves the test performance. 1023

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

Dataset adult9(0.0) adult9(0.1) adult9(0.2) alpha(0.0) alpha(0.1) alpha(0.2) astro-ph(0.0) astro-ph(0.1) astro-ph(0.2) aut-avn(0.0) aut-avn(0.1) aut-avn(0.2) beta(0.0) beta(0.1) beta(0.2) covertype(0.0) covertype(0.1) covertype(0.2) delta(0.0) delta(0.1) delta(0.2) epsilon(0.0) epsilon(0.1) epsilon(0.2) fd(0.0) fd(0.1) fd(0.2) gamma(0.0) gamma(0.1) gamma(0.2) kdd99(0.0) kdd99(0.1) kdd99(0.2) kdda(0.0) kdda(0.1) kdda(0.2) kddb(0.0) kddb(0.1) kddb(0.2)

Table 3: Test Error on Binary Datasets Log(L1 ) tLog(L1 ) Forget(L1 ) Log(L2 ) tLog(L2 ) 14.97 15.03 14.99 14.97 14.92 15.24 15.16 15.20 15.28 15.17 15.47 15.21 15.16 15.40 15.28 21.75 21.77 21.75 21.75 21.77 21.81 21.86 21.86 21.84 21.86 21.87 21.91 21.86 21.83 21.85 0.75 1.47 1.24 1.02 1.49 3.23 2.37 2.97 2.77 2.35 4.77 3.33 3.32 4.22 3.17 1.88 1.94 1.97 1.92 1.93 4.03 2.71 2.90 3.68 2.45 6.77 4.02 3.56 4.29 3.59 49.88 49.99 49.84 49.99 49.91 49.95 49.85 49.97 49.90 49.94 50.01 49.93 49.98 50.11 49.95 23.00 22.63 22.47 23.03 22.62 23.20 22.81 22.56 23.15 22.74 23.32 23.08 22.76 23.24 22.95 21.54 21.55 21.54 21.54 21.54 21.51 21.56 21.55 21.52 21.55 21.68 21.66 21.65 21.68 21.67 10.21 10.24 10.19 10.23 10.23 10.23 10.33 10.25 10.45 10.38 10.59 10.53 10.50 10.85 10.70 2.96 3.00 2.92 2.94 2.86 3.85 3.22 3.07 3.84 2.99 4.30 3.59 3.18 4.29 3.32 19.99 19.98 20.00 20.01 19.98 20.12 20.09 20.08 20.11 20.08 20.14 20.14 20.11 20.21 20.10 8.00 8.15 8.11 8.00 8.28 8.16 8.16 8.22 8.04 8.12 8.05 8.19 8.11 8.07 8.07 10.50 10.58 10.65 11.56 10.50 11.32 10.70 10.56 10.73 10.61 10.70 10.73 10.67 10.81 10.77 10.97 10.08 10.08 10.33 10.16 10.38 10.31 10.32 10.25 10.34 11.01 10.55 10.53 10.61 10.49

1024

Forget(L2 ) 15.11 15.27 15.28 21.80 21.86 21.84 1.19 2.62 3.74 1.99 2.68 3.97 49.99 49.85 50.11 22.49 22.50 22.59 21.54 21.53 21.66 10.23 10.33 10.67 2.86 2.88 2.98 19.98 20.11 20.12 8.10 8.20 8.09 10.49 10.80 10.80 10.28 10.22 10.59

t-L OGISTIC R EGRESSION

Dataset news20(0.0) news20(0.1) news20(0.2) real-sim(0.0) real-sim(0.1) real-sim(0.2) reuters-c11(0.0) reuters-c11(0.1) reuters-c11(0.2) reuters-ccat(0.0) reuters-ccat(0.1) reuters-ccat(0.2) web8(0.0) web8(0.1) web8(0.2) webspamtrigram(0.0) webspamtrigram(0.1) webspamtrigram(0.2) webspamunigram(0.0) webspamunigram(0.1) webspamunigram(0.2) worm(0.0) worm(0.1) worm(0.2) zeta(0.0) zeta(0.1) zeta(0.2)

Table 4: Test Error on Binary Datasets (Continue) Log(L1 ) tLog(L1 ) Forget(L1 ) Log(L2 ) tLog(L2 ) 3.25 3.48 3.25 3.78 3.88 7.41 5.53 6.18 5.98 5.58 14.45 9.16 12.34 7.84 7.71 2.85 2.70 2.69 2.83 2.85 4.85 3.57 4.23 4.23 3.35 7.60 4.63 5.08 5.44 4.31 2.83 2.89 2.95 2.95 2.84 2.86 2.94 2.89 3.03 2.84 2.88 2.84 2.89 3.00 2.92 7.95 7.18 7.33 7.68 7.41 8.78 7.77 8.20 7.64 7.53 9.52 8.50 9.53 8.30 7.95 1.12 1.05 0.87 1.11 1.01 1.29 1.15 1.05 1.29 1.21 1.30 1.23 1.23 1.30 1.21 0.61 0.69 0.53 0.54 0.66 1.06 0.76 0.70 0.97 0.81 1.45 1.17 1.07 1.41 1.21 7.20 6.70 6.53 7.21 6.83 7.49 6.88 6.52 7.48 6.91 7.74 7.16 6.70 7.74 7.16 1.51 1.49 1.50 1.51 1.50 2.63 1.55 1.56 2.63 1.58 3.68 1.63 1.63 3.68 1.66 5.69 5.82 5.61 11.76 13.33 6.13 5.97 5.73 11.22 13.20 6.61 6.15 5.94 10.94 12.78

1025

Forget(L2 ) 3.58 6.03 7.84 2.83 3.46 4.97 2.86 2.93 3.00 7.56 7.64 7.93 0.99 1.12 1.23 0.54 0.66 1.22 6.54 6.56 6.69 1.50 1.57 1.64 11.61 11.28 11.18

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

Table 5: Summary of the datasets used in our experiments. n is the total # of examples, d is the # of features, nc is the # of classes, and s is the feature density (% of features that are non-zero). M denotes a million. dataset n d nc dna 2,586 182 3 mnist 70,000 782 10 534,130 47,238 52 rcv1 sensitcombined 98,528 102 3 9298 258 10 usps

s(%) dataset n d nc s(%) 100 letter 15,500 18 26 100 100 protein 21,516 359 3 100 22.22 sensitacoustic 98,528 52 3 100 100 sensitseismic 98,528 52 3 100 100

10. Discussion 10.1 Another Way to Optimize the t-logistic Regression In the discussion of the optimal condition, we obtain the MLE by minimizing − log p(y | X, θ). For t-logistic regression, it is also convenient to minimize p(y | X, θ)1−t , 1−t

P (θ) , p(y | X, θ)

=

m Y

p(yi | xi ; θ)1−t

(60)

i=1

=

m Y

1 + (1 − t)(θ Tyi xi −gt (θ | xi )) {z } | i=1

(61)

li (θ)

Since t > 1, and gt (θ | xi ) is convex, it is easy to see that each component li (θ) is positive and convex. Therefore, P (θ) becomes the product of a series of positive convex functions li (θ). The optimal solutions to the problem (61) can be obtained by solving the following parametric problem (see Theorem 2.1 of Kuno et al. (1993)): min min MP(θ, ξ) , ξ

θ

m X

ξi li (θ) s.t. ξ > 0,

i=1

m Y

ξi ≥ 1.

(62)

i=1

ξ-Step: Assume that θ is fixed, and denote ˜li = li (θ) to rewrite (62) as: min MP(θ, ξ) = min ξ

ξ

m X

ξi ˜li s.t.

ξ > 0,

i=1

m Y

ξi ≥ 1.

(63)

i=1

Since the objective function is linear in ξ and the feasible region is a convex set, (63) is a convex optimization problem. By introducing a non-negative Lagrange multiplier γ ≥ 0, the partial Lagrangian and its gradient with respect to ξi0 can be written as ! m m X Y L(ξ, γ) = ξi ˜li + γ · 1 − ξi (64) i=1

i=1

Y ∂ L(ξ, γ) = ˜li0 − γ ξi . ∂ξi0 0 i6=i

1026

(65)

t-L OGISTIC R EGRESSION

Dataset dna(0.0) dna(0.1) dna(0.2) letter(0.0) letter(0.1) letter(0.2) mnist(0.0) mnist(0.1) mnist(0.2) protein(0.0) protein(0.1) protein(0.2) rcv1(0.0) rcv1(0.1) rcv1(0.2) sensitacoustic(0.0) sensitacoustic(0.1) sensitacoustic(0.2) sensitcombined(0.0) sensitcombined(0.1) sensitcombined(0.2) sensitseismic(0.0) sensitseismic(0.1) sensitseismic(0.2) usps(0.0) usps(0.1) usps(0.2)

Table 6: Test Error on Multiclass Datasets Log(L1 ) tLog(L1 ) Forget(L1 ) 5.36 ± 0.45 4.91 ± 0.89 5.85 ± 1.00 6.79 ± 0.86 5.13 ± 0.69 6.24 ± 1.18 8.01 ± 0.81 5.74 ± 1.28 6.79 ± 1.57 23.07 ± 0.77 19.84 ± 1.33 27.39 ± 0.96 24.97 ± 0.60 20.11 ± 1.06 27.73 ± 1.21 26.71 ± 0.90 20.36 ± 1.36 28.05 ± 1.06 8.05 ± 0.30 7.61 ± 0.24 10.66 ± 0.16 9.41 ± 0.13 7.79 ± 0.13 10.78 ± 0.43 10.24 ± 0.25 7.73 ± 0.27 11.13 ± 0.41 31.69 ± 0.90 31.71 ± 0.70 31.70 ± 0.80 32.16 ± 0.77 31.89 ± 0.82 32.05 ± 0.82 32.87 ± 0.99 32.08 ± 0.82 32.36 ± 1.02 7.44 ± 0.10 7.54 ± 0.10 10.80 ± 0.17 8.70 ± 0.06 7.46 ± 0.09 10.42 ± 0.43 9.41 ± 0.14 7.59 ± 0.12 10.77 ± 0.51 31.68 ± 0.14 28.46 ± 0.14 31.72 ± 0.19 31.91 ± 0.17 28.85 ± 0.11 31.90 ± 0.15 32.12 ± 0.14 29.29 ± 0.05 32.18 ± 0.18 19.69 ± 0.28 18.68 ± 0.19 20.40 ± 0.26 20.37 ± 0.30 18.74 ± 0.25 20.60 ± 0.31 20.67 ± 0.40 18.92 ± 0.24 20.76 ± 0.50 28.54 ± 0.48 26.70 ± 0.44 29.00 ± 0.55 30.15 ± 0.46 27.03 ± 0.39 30.05 ± 0.49 30.65 ± 0.53 27.35 ± 0.40 30.65 ± 0.57 5.95 ± 0.40 5.78 ± 0.48 7.62 ± 0.76 7.13 ± 0.58 6.51 ± 0.74 7.45 ± 0.49 7.45 ± 0.34 6.28 ± 0.50 7.60 ± 0.47

1027

Log(L2 ) 5.96 ± 1.08 8.12 ± 1.13 8.06 ± 0.88 23.04 ± 0.79 24.94 ± 0.58 26.65 ± 0.88 8.00 ± 0.27 9.40 ± 0.21 10.23 ± 0.29 31.64 ± 0.70 32.02 ± 0.91 32.32 ± 1.02 7.12 ± 0.13 7.87 ± 0.09 8.43 ± 0.13 31.68 ± 0.15 31.91 ± 0.16 32.14 ± 0.13 19.70 ± 0.31 20.37 ± 0.29 20.69 ± 0.40 28.52 ± 0.48 30.12 ± 0.48 30.66 ± 0.52 5.52 ± 0.32 6.79 ± 0.53 7.44 ± 0.32

tLog(L2 ) 6.07 ± 0.46 6.90 ± 0.65 6.74 ± 0.67 19.78 ± 1.48 20.11 ± 1.13 20.29 ± 1.21 7.83 ± 0.26 7.97 ± 0.19 8.13 ± 0.27 31.56 ± 0.82 31.97 ± 0.97 31.99 ± 0.87 7.43 ± 0.11 7.59 ± 0.12 7.61 ± 0.13 28.76 ± 0.13 28.97 ± 0.08 29.33 ± 0.15 18.81 ± 0.24 18.83 ± 0.26 19.05 ± 0.27 26.92 ± 0.35 27.23 ± 0.33 27.66 ± 0.36 5.75 ± 0.53 5.92 ± 0.52 5.90 ± 0.35

Forget(L2 ) 6.68 ± 1.03 8.17 ± 1.13 8.12 ± 0.93 28.80 ± 1.24 28.58 ± 1.49 28.79 ± 1.00 11.75 ± 0.19 11.87 ± 0.26 12.10 ± 0.22 31.59 ± 0.77 32.24 ± 0.95 32.51 ± 0.77 32.88 ± 1.43 32.65 ± 1.41 30.07 ± 0.33 32.52 ± 0.17 32.32 ± 0.17 32.36 ± 0.28 21.98 ± 0.40 21.58 ± 0.32 21.16 ± 0.35 29.28 ± 0.46 30.21 ± 0.48 30.64 ± 0.53 8.19 ± 0.63 7.99 ± 0.52 8.53 ± 0.58

D ING , V ISHWANATHAN , WARMUTH , D ENCHEV

˜ Q li0

Since ˜li0 > 0, it follows that γ cannot be 0. By the Q K.K.T. conditions (Boyd and Vandenberghe, 2004), we can conclude that m i=1 ξi = 1. This in turn implies that γ = ˜li0 ξi0 or

Setting the gradient to 0 obtains γ =

i6=i0

ξi .

(ξ1 , . . . , ξm ) = (γ/˜l1 , . . . , γ/˜lm ), with γ =

m Y

1

˜l m . i

(66)

i=1

where ξi ∝ 1/˜li = p(yi | xi , θ)t−1 . θ-Step: In this step we fix ξ > 0 and solve for the optimal θ. This step is essentially the same as logistic regression, except that each component has a weight ξi here. min MP(θ, ξ) = min θ

θ

m X

ξi li (θ)

(67)

i=1

and the gradient will be m

X ∂ MP(θ, ξ) = (1 − t) ξi (yi − Eqi [y]) xTi ∂θ i=1

where qi ∝ pti , and pi = p(yi | xi , θ). 10.2 Multinomality One of the key disadvantage of the non-convex losses is that it could potentially introduce multiple local minima and therefore it creates difficulty to find the global optimum. However, when the nonconvex losses are quasi-convex, it is not immediately clear whether the empirical risk of the dataset (x1 , y1 ), . . . , (xm , ym ), which is the sum of the losses of these samples, are multimodal or not. In the following, we show that for any nonconvex loss functions under some mild conditions, one can always construct a dataset which has multimodal empirical risk. The following theorem is proved when the feature dimension is 1-d. Generalization to multi-dimension is trivial. Theorem 10.1 ?? Assume that (x, y) ∈ (R, ±1), define the margin u = θ · x · y, consider a loss 0 function L(θ; x, y) = L(u) which is smooth at u = 0. If dL(u) du |u=0 = L (0) = −z0 < 0, and there 0 exist u1 < 0 and u2 > 0 where L (ui ) > −z0 forP i = 1, 2, then there exists a set of data points x = {x1 , . . . , xn+1 }, such that their empirical risk n+1 i=1 L(θ, xi , yi ) as a function of θ has at least two local minima. Proof First, there exists z such that L0 (u1 ) ≥ −z and L0 (u2 ) ≥ −z, where z < z0 .

(68)

Because the loss function is smooth, there exists δ such that L0 (u) < −(z + z0 )/2 , where u ∈ (−δ, δ).

(69)

U = max {−u1 , u2 } .

(70)

Define

1028

t-L OGISTIC R EGRESSION

we construct a set of data points which consist of x1,...,n = 1, xn+1 = x, where U δ (z + z0 )x n=− 2z0 x=−

(71) (72)

The gradient of the empirical risk is, H(θ) =

n d X L(θxi ) + L(θxn+1 ) dθ

(73)

i=1

=nL0 (θ) + xL0 (θx) x =n L0 (θ) + L0 (θx) n 2z0 0 z − z0 0 0 L (0) = nL (0) > 0 when θ = 0, H(θ) = n L (0) − z + z0 z + z0 2z0 0 0 when θ = δu1 /U, H(θ) = n L (δu1 /U ) − L (u1 ) z + z0 2z0 (z − z0 )2 < n −(z + z0 )/2 +