Ranking observations with latent information and binary feedback. Nicolas Vayatis

Ranking observations with latent information and binary feedback Nicolas Vayatis Ecole Normale Sup´erieure de Cachan Workshop on The Mathematics of R...
Author: Colleen Patrick
1 downloads 1 Views 494KB Size
Ranking observations with latent information and binary feedback Nicolas Vayatis Ecole Normale Sup´erieure de Cachan

Workshop on The Mathematics of Ranking at AIM - Palo Alto, August 2010

1

Statistical Issues in Machine Learning

2

Prediction of Preferences

3

Other Criteria for Ranking Error

Statistical issues in Machine Learning

Generalization ability of decision rules

Class G of candidate decision rules Risk functional L, the ”objective” criterion Past data Dn with sample size n Method/Algorithm outputs an empirical estimate gbn ∈ G Main questions: I

Strong Bayes-risk consistency a.s.

L(b gn ) −−→ L∗ = inf L(g ) , g

I

Rate of this convergence?

n→∞

?

An example - Binary classification with i.i.d. data Data Dn = {(Xi , Yi ) : i = 1, . . . , n} i.i.d. copies of (X , Y ) ∈ X × {−1, +1} Empirical Risk Minimization principle n

1X Ln (g ) := I{g (Xi ) 6= Yi } gbn = arg min b n g ∈G i=1

First-order analysis: with probability at least 1 − δ ( ) r log(1/δ) L(b gn ) − inf L(g ) ≤ 2E sup |b Ln (g ) − L(g )| + c g ∈G n g ∈G Tools: empirical processes techniques, concentration inequalities

Complexity Control

Vapnik-Chervonenkis inequality: ( ) r V b E sup |Ln (g ) − L(g )| ≤ c n g ∈G where V is the VC dimension of the class G. Rademacher average: 1 Rn (G) = E n

) n X sup i I{Yi 6= g (Xi )} g ∈G

(

i=1

where 1 , . . . , n i.i.d. sign variables

Variance control Second-order analysis: Talagrand’s inequality     b b sup P(f ) − Pn (f ) ≤ 2E sup P(f ) − Pn (f ) + . . . f ∈F f ∈F r 2 (supf ∈F Var(f )) log(1/δ) log(1/δ) ... + +c n n Variance control assumption Var(f ) ≤ C (L(g ) − L∗ )α ,

∀g

with α ∈ (0, 1]. Fast rates of convergence: excess risk in n−1/(2−α)

Prediction of Preferences

Joint work with St´ephan Cl´emen¸con (Telecom ParisTech) G´abor Lugosi (Pompeu Fabra)

Setup

(X , Y ) random pair with unknown distribution P over X × R (X , Y ), (X 0 , Y 0 ) i.i.d., and Y , Y 0 may not be observed Preference label R = R(Y , Y 0 ) ∈ R , with R(Y , Y 0 ) = −R(Y 0 , Y ) R > 0 means ”X is better than X 0 ” Decision rule: r : X × X → {−1, 0, 1} Prediction error = classification error with pairs of observations  L(r ) = P R · r (X , X 0 ) < 0 Same like before?

Empirical Ranking Risk Minimization Latent data Dn = {(Xi , Yi ) : i = 1, . . . , n} i.i.d. Observed data: {(Xi , Xj , Ri,j ) : i, j = 1, . . . , n} , Ri,j = R(Yi , Yj ) Empirical criterion for ranking: Ln (r ) =

X 1 I{Ri,j · r (Xi , Xj ) < 0} n(n − 1) i6=j

General definition of a U-statistic (fixed f ): Un (f ) =

X 1 f (Zi , Zj ) n(n − 1) i6=j

where Z1 , ..., Zn i.i.d.

Structure of U-Statistics - First representation Assume f symmetric. Average of ’sums-of-i.i.d.’ blocks: Un (f ) =

bn/2c  1 X 1 X f Zπ(i) , Zπ(bn/2c+i) n! π bn/2c i=1

where π represents permutations of {1, . . . , n}.

Lemma Let ψ convex increasing and F a class of functions. Then:     bn/2c X  1 Eψ sup Un (f ) ≤ Eψ  sup f Zπ(i) , Zπ(bn/2c+i)  f ∈F f ∈F bn/2c i=1

Consequences of the first representation

Back to classification with bn/2c i.i.d. pairs Enough for first-order analysis (including ERM and CRM) Overestimates the variance Noise assumption too restrictive!! No fast rates in the general case!

Structure of U-Statistics - Second representation Hoeffding’s decomposition Un (f ) = E(Un (f )) + 2Tn (f ) + Wn (f ) with I

Tn (f ) =

n 1X h(Zi ) ( empirical average of i.i.d. ) n i=1

where h(z) = Ef (Z1 , z) − E(Un (f )) I

Wn (f ) = degenerate U-statistic (remainder term)

Degenerate U-statistic Wn with kernel h˜ is such that: ˜ 1 , Z2 ) | Z1 ) = 0 E(h(Z

a.s.

Remark: Need here to observe individual labels Y , Y 0 !

Insights for rates-of-convergence results

Leading term Tn is an empirical process I

handled by Talagrand’s concentration inequality

I

involves ”standard” complexity measures:

⇒ Variance control involves the function h Exponential inequality for degenerate U-processes I

VC classes - exponential inequality by Arcones and Gin´e (AoP1993)

I

general case - a new moment inequality

⇒ additional complexity measures

Fast Rates - Notations Kernel: qr ((x, y ), (x 0 , y 0 )) = I{(y − y 0 ) · r (x, x 0 ) < 0} − I{(y − y 0 ) · r ∗ (x, x 0 ) < 0}

U-process indexed by ranking rule r ∈ R Λn (r ) =

X 1 qr ((Xi , Yi ), (Xj , Yj )), n(n − 1) i6=j

Excess risk: Λ(r ) = L(r ) − L∗ = E{qr ((X , Y ), (X 0 , Y 0 ))} Key quantity: hr (x, y ) = E{qr ((x, y ), (X 0 , Y 0 ))} − Λ(r )

Result on Fast Rates - VC Case

Assume we have: the class R of ranking rules has finite VC dimension V for all r ∈ R, Var(hr (X , Y )) ≤ c L(r ) − L∗



(V)

with some constants c > 0 and α ∈ [0, 1]. Then, with probability larger than 1 − δ: ∗

L(rn ) − L ≤ 2





inf L(r ) − L

r ∈R



 +C

V log(n/δ) n

1/(2−α)

Comments

Question Sufficient condition for Assumption (V): ∀r ∈ R,

Var(hr (X , Y )) ≤ c L(r ) − L∗



?

Goal Formulate noise assumptions on the regression function: E{Y | X = x}

Example 1 - Bipartite Ranking Binary labels Y , Y 0 ∈ {−1, +1} Posterior probability: η(x) = P{Y = +1 | X = x}

Noise Assumption (NA) There exist constants c > 0 and α ∈ [0, 1] such that : ∀x ∈ X ,

E(|η(x) − η(X )|−α ) ≤ c .

Sufficient condition for (NA) with α < 1 η(X ) absolutely continuous on [0, 1] with bounded density

Example 2 - Regression Data Y = m(X ) + σ(X ) · N , Key quantity:

where N ∼ N (0, 1), E(N | X ) = 0

m(X ) − m(X 0 ) ∆(X , X 0 ) = p σ 2 (X ) + σ 2 (X 0 )

Noise Assumption (NA) There exist constants c > 0 and α ∈ [0, 1] such that : ∀x ∈ X ,

E(|∆(x, X )|−α ) ≤ c .

Sufficient condition for (NA) with α < 1 m(X ) has a bounded density and σ(X ) is bounded over X .

Remainder Term

Degenerate U-process Consider F a class of degenerate kernels, and X ˜ n = sup W f (Z , Z ) i j f ∈F i,j

Additional Complexity Measures 1 , . . . , n i.i.d. Rademacher random variables

Complexity measures: (1) Z

X i j f (Zi , Zj ) = sup f ∈F i,j

(2) U

= sup

sup

X

i αj f (Zi , Zj )

f ∈F α:kαk2 ≤1 i,j

(3) M

n X = sup max i f (Zi , Zk ) f ∈F k=1...n i=1

Moment Inequality

Theorem ˜ n is a degenerate U-process, then there exists a universal constant If W C > 0 such that for all n and q ≥ 2, 

˜ nq EW

1/q

  ≤ C EZ + q 1/2 EU + q(EM + n) + q 3/2 n1/2 + q 2

Main tools: symmetrization, decoupling and concentration inequalities Related work: Adamczak (AoP, 2006), Arcones and Gin´e (AoP, 1993), Gin´e, Latala and Zinn (HDP II, 2000), Houdr´e and Reynaud-Bouret (SIA, 2003), Major (PTRF, 2006)

Control of the Degenerate Part

Corollary With probability 1 − δ, ˜n ≤ C W

! p EZ EU log(1/δ) EM log(1/δ) log(1/δ) + + + n2 n2 n2 n

Special case - F is a VC class EZ ≤ CnV ,

√ EU ≤ Cn V ,

Hence, with probability 1 − δ ˜ n ≤ 1 (V + log(1/δ)) W n

√ E M ≤ C Vn

Other Criteria for Ranking Error AUC and beyond - Focus on the top of the list

Joint work with St´ephan Cl´emen¸con (Telecom ParisTech)

Global performance measures: ROC Curve For a given scoring rule: s:X →R Threshold t ∈ R True positive rate: βs (t) = P {s(X ) ≥ t | Y = +1} False positive rate: αs (t) = P {s(X ) ≥ t | Y = −1}  ROC : (s, t) 7→ αs (t), βs (t) + continuous extension

Optimality, Metrics for ROC Curves By Neyman-Pearson’s lemma: optimal scoring rules are in S ∗ = {T ◦ η : T

strictly increasing}

Optimal ROC curve: α ∈ [0, 1] 7→ ROC∗ (α) = βη ◦ αη−1 (α) L1 metric on ROC curves Z 1 d1 (s, η) = (ROC∗ (α) − ROCs (α)) dα = AUC(η) − AUC(s) 0

What about stronger metrics ? d∞ (s, η) = sup (ROC∗ (α) − ROCs (α)) α∈[0,1]

Connection to the AUC criterion Consider a real-valued scoring rule s : X → R Take: (X , Y ), (X 0 , Y 0 ) i.i.d. copies Z AUC(s) =

1

ROCs (α) dα = P{s(X ) ≥ s(X 0 ) | Y > Y 0 }

0

Ranking rule: r (X , X 0 ) = 2I{s(X ) > s(X 0 )} − 1 Ranking error and AUC: p = P{Y = +1} AUC(s) = 1 −

1 L(r ) 2p(1 − p)

Maximization of AUC = Minimization of ranking error

Beyond the AUC - Truncation of the ROC curve

Focus on the ”best” instances Question: cut-off point on the ROC curve? Constraint: fix u ∈ (0, 1) to be the rate of ”best” X ’s Best instances according to scoring function s at rate u: Cs,u = {x ∈ X | s(x) > Q(s, u)} where Q(s, u) is the (1 − u)-quantile of s(X ) I

Mass constraint property: µ(Cs,u ) = P {X ∈ Cs,u } = u

I

Invariance property: if T nondecreasing, then CT ◦s,u = Cs,u

Reparameterization of the ROC curve

True positive rate at level u: β(s, u) = P {s(X ) ≥ Q(s, u) | Y = 1} False positive rate at level u: α(s, u) = P {s(X ) ≥ Q(s, u) | Y = −1} Control line at level u: u = pβ(s, u) + (1 − p)α(s, u) with p = P{Y = 1}

Partial AUC ROC curve and partial AUC 1

0.9

true positive rate β

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5

0.6

false positive rate α

0.7

0.8

0.9

1

Definition (Partial AUC) For a scoring function s and a rate u of best instances as: Z PartAUC(s, u) =

α(s,u)

β(s, t) dt . 0

Partial AUC is not consistent! ROC curve and partial AUC 1

0.9

0.8

true positive rate β

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

false positive rate α

0.7

0.8

0.9

1

For any scoring function s, we have for all u ∈ (0, 1), β(s, u) ≤ β(η, u) α(s, u) ≥ α(η, u)

Correction - Local AUC

Local AUC vs. Partial AUC Set u ∈ (0, 1). For any scoring function s: LocAUC(s, u) = PartAUC(s, u) + β(s, u)(1 − α(s, u)) . Double goal: I

Find the best instances: Cu∗ = {x ∈ X | η(x) > Q(η, u)}

I

Rank them with a scoring function

A Subproblem - ERM for Finding the Best Instances Take sets of the form: Cs,u = {x ∈ X | s(x) > Q(s, u)} where s positive real-valued scoring function Empirical risk: n

1X b u)) < 0}. b Ln (s) = I{Yi · (s(Xi ) − Q(s, n i=1

Conditions for consistency and (fast) rates: I I

behavior of η around Q(η, u) class of scoring functions neither too flat nor too steep

Result: Fastest rate in n−2/3

Typical Scoring Functions

Signed rank statistics

Take Z1 , . . . , Zn i.i.d. Φ : [0, 1] → [0, 1] (score generating function) Ri+ = rank(|Zi |)

Definition The statistic

n X

 Φ

i=1

is a linear signed rank statistic.

Ri+ n+1

 sgn(Zi )

Structure of the empirical risk Notations: K (s, u) = E (Y I{s(X ) ≤ Q(s, u)}) n

X ˆ n (s, u) = 1 ˆ u)} K Yi I{s(Xi ) ≤ Q(s, n i=1

We have: L(s) = 1 − p + K (s, u) P ˆn (s) = n− + K ˆ n (s, u) where n− = n I{Yi = −1} L i=1 n

Observe ˆ n (s, u) is a linear Set Zi = Yi s(Xi ). For fixed s and u, the statistic K signed rank statistic.

Hoeffding’s-type decomposition Notations: n

 1X Zn (s, u) = Yi − K 0 (s, u) I{s(Xi ) ≤ Q(s, u)} − K (s, u) + uK 0 (s, u) , n i=1

where K 0 (s, u) = Ku0 (s, u).

Proposition We have, for all s and u ∈ [0, 1]: ˆ n (s, u) = K (s, u) + Zn (s, u) + Λn (s) . K with Λn (s) = OP (n−1 ) as n → ∞ .

General ROC summaries Score-generating function Φ : [0, 1] → [0, 1] increasing Empirical performance functional Wn (s) =

n X

 I{Yi = 1} Φ

i=1

rank(s(Xi )) n+1



Choices of Φ: I I I I

Φ(x) = x ⇒ AUC Φ(x) = x I{x ≥ 1 − u} ⇒ Local AUC Φ(x) = c(x)I{x ≥ k/n + 1}, ⇒ DCG smooth Φ’s

Most ranking criteria are conditional linear rank statistics

This talk: Statistical theory for learning summaries of the optimal ROC curve Analysis of higher order statistics Orthogonal decompositions and control of remainder term Generic form for risk criteria in ranking Not in this talk! Approximation and estimation schemes for the optimal ROC curve Design of scoring/ranking algorithms based on decision trees Aggregation of ranking trees involves rank aggregation techniques Application to multivariate homogeneity tests R implementation of TreeRank available!

Today, Yves Meyer received the Carl Friedrich Gauss Prize in Hyderabad.

References

S. Cl´ emen¸con, M. Depecker, and N. Vayatis (2010). Adaptive partitioning schemes for bipartite ranking. Machine Learning Journal. To appear. S. Cl´ emen¸con and N. Vayatis (2010). Overlaying classifiers: a practical approach for optimal scoring. Constructive Approximation. To appear. S. Cl´ emen¸con, M. Depecker, and N. Vayatis (2009). AUC maximization and the two-sample problem. Proceedings of NIPS’09, Advances in Neural Information Processing Systems 22, pp.360-368, MIT Press. S. Cl´ emen¸con and N. Vayatis (2009). Tree-based ranking methods. IEEE Transactions on Information Theory. S. Cl´ emen¸con and N. Vayatis (2008). Empirical performance maximization for linear rank statistics. Proceedings of NIPS’08, MIT Press. S. Cl´ emen¸con, G. Lugosi, and N. Vayatis (2008). Ranking and empirical risk minimization of U-statistics. The Annals of Statistics, vol.36(2):844-874. S. Cl´ emen¸con and NV (2007). Ranking the best instances. Journal of Machine Learning Research, 8(Dec):2671-2699.

Suggest Documents