# transductive_learning_with_multiclass_volume_approximation__cb11dff4.pdf

Transductive Learning with Multi-class Volume Approximation

Gang Niu NIUGANG@BAIDU.COM Tokyo Institute of Technology, Tokyo, 152-8552, Japan Baidu Inc., Beijing, 100085, China

Bo Dai BODAI@GATECH.EDU Georgia Institute of Technology, Atlanta, GA 30332, USA

Marthinus Christoffel du Plessis CHRISTO@SG.CS.TITECH.AC.JP Masashi Sugiyama SUGI@CS.TITECH.AC.JP Tokyo Institute of Technology, Tokyo, 152-8552, Japan

Given a hypothesis space, the large volume principle by Vladimir Vapnik prioritizes equivalence classes according to their volume in the hypothesis space. The volume approximation has hitherto been successfully applied to binary learning problems. In this paper, we propose a novel generalization to multiple classes, allowing applications of the large volume principle on more learning problems such as multi-class, multi-label and serendipitous learning in a transductive manner. Although the resultant learning method involves a non-convex optimization problem, the globally optimal solution is almost surely unique and can be obtained using O(n3) time. Novel theoretical analyses are presented for the proposed method, and experimental results show it compares favorably with the one-vs-rest extension.

1. Introduction

The history of the large volume principle (LVP) goes back to the early age of the statistical learning theory when Vapnik (1982) introduced it for the case of hyperplanes. But it did not gain much attention until a creative approximation was proposed in El-Yaniv et al. (2008) to implement LVP for the case of soft response vectors. From then on, it has been applied to various binary learning problems successfully, such as binary transductive learning (El-Yaniv et al., 2008), binary clustering (Niu et al., 2013a), and outlier detection (Li & Ng, 2013).

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

Figure 1. The large volume principle and its approximation.

LVP is a learning-theoretic principle which views learning as hypothesis selecting from a certain hypothesis space H. Regardless of the hypothesis form, H can always be partitioned into a ﬁnite number of equivalence classes on some observed data set, where each equivalence class is a set of hypotheses that generate the same labeling of the observed data. LVP, as one of the learning-theoretic principles from the statistical learning theory, prioritizes those equivalence classes according to the volume they occupy in H. See the illustration in Figure 1: The blue ellipse represents H, and it is partitioned into C1, . . . , C4 each occupying a quadrant of the Cartesian coordinate system R2 intersected with H; LVP claims that C1 and C3 are more preferable than C2 and C4, since C1 and C3 have larger volume than C2 and C4.

In practice, the hypothesis space H cannot be as simple as H in Figure 1. It is often located in very high-dimensional spaces where exact or even quantiﬁable volume estimation is challenging. Therefore, El-Yaniv et al. (2008) proposed a volume approximation to bypass the volume estimation. Instead of focusing on the equivalence classes of H, it directly focuses on the hypotheses in H since learning is regarded as hypothesis selecting in LVP. It deﬁnes H via an ellipsoid, measures the angles from hypotheses to the principal axes of H, and then prefers hypotheses near the long

Transductive Learning with Multi-class Volume Approximation

principal axes to those near the short ones. This manner is reasonable, since the long principal axes of H lie in largevolume regions. In Figure 1, h and h are two hypotheses and v1/v2 is the long/short principal axis; LVP advocates that h is more preferable than h as h is close to v1 and h

is close to v2. We can adopt this volume approximation to regularize our loss function, which has been demonstrated helpful for various binary learning problems.

Nevertheless, the volume approximation in El-Yaniv et al. (2008) only ﬁts binary learning problem settings despite its potential advantage. In this paper we extend it naturally to a more general deﬁnition, which can be applied to several transductive problem settings, including but not limited to multi-class learning (Zhou et al., 2003), multi-label learning (Kong et al., 2013), and serendipitous learning (Zhang et al., 2011). We adopt the same strategy as El-Yaniv et al. (2008): For n data and c labels, a hypothesis space is deﬁned in Rn c and linked to an ellipsoid in Rnc, such that the equivalence classes and the volume approximation can be deﬁned accordingly. We name the learning method that realizes the above approximation multi-class approximate volume regularization (MAVR). It involves a non-convex optimization problem, but the globally optimal solution is almost surely unique and accessible in O(n3) time following Forsythe & Golub (1965). Moreover, we theoretically provide novel stability and error analyses for MAVR, and experimentally show that MAVR compares favorably with the one-vs-rest extension of El-Yaniv et al. (2008).

The rest of this paper is organized as follows. In Section 2 the problem settings are discussed. In Section 3 the binary volume approximation is reviewed and the multi-class volume approximation is derived. Then the proposed method MAVR is developed and analyzed in Section 4. At last the experimental results are reported in Section 5.

2. Transductive Problem Settings

Recall the setting of transductive binary problems (Vapnik, 1998, p. 341). Suppose that X is the domain of input data, and most often but not necessarily, X Rd where d is a natural number. A ﬁxed set Xn = {x1, . . . , xn} of n points from X is observed, and the labels y1, . . . , yn { 1, +1} of these points are also ﬁxed but unknown. A subset Xl Xn of size l is picked uniformly at random, and then yi is revealed if xi Xl. We call Sl = {(xi, yi) | xi Xl} the labeled data and Xu = Xn \ Xl the unlabeled data. Using Sl and Xu, the goal is to predict yi of xi Xu (while any unobserved x X \ Xn is currently left out of account).

Transductive learning (TL) (e.g., Blum & Chawla, 2001; Szummer & Jaakkola, 2001; Joachims, 2003; Zhou et al., 2003; El-Yaniv et al., 2008) slightly differs from semisupervised learning (SSL) (e.g., Bennett & Demiriz, 1998;

Zhu et al., 2003; Grandvalet & Bengio, 2004; Belkin et al., 2006; Li et al., 2009; Li & Zhou, 2011; Niu et al., 2013b): TL focuses on predicting Xu while SSL aims at predicting X \ Xl, and TL is distribution free but SSL is not.1 More speciﬁcally, TL generally makes no assumption about the underlying distributions, and the true labels are deterministic; SSL usually assumes that Sl is sampled from p(x, y) and Xu is sampled from p(x), and then the true labels are stochastic. Moreover, if there is any distributional change, SSL should specify the form of the change, but TL might deal with it directly. To sum up, SSL is inductive learning in nature, and the advantage of TL over inductive learning is conceptually critical for us.

As an extension of El-Yaniv et al. (2008), the volume approximation to be proposed can be applied to many transductive problem settings, where the differences are the encoding of labels and the decoding of hypotheses. The ﬁrst setting is multi-class learning: Instead of yi { 1, +1}, we have yi Y where Y = {1, . . . , c} and c is a natural number. Each of the c labels here have some labeled data in spite of any distributional change. The second setting is multi-label learning: yi Y with Y = {1, . . . , c} where yi is a label set, or yi Y with Y = { 1, 0, 1}c where yi is a label vector (cf. Kong et al., 2013). The third setting is serendipitous learning which is a multi-class setting with missing classes in Sl, that is, some of the c labels have no labeled data (cf. Zhang et al., 2011). It is non-trivial to see the distributional change is covariate shift (Yamada et al., 2010) or class-prior change (du Plessis & Sugiyama, 2012) from semi-supervised point of view, whereas it is unnecessary to specify the form of the change in our settings.

In principle, all transductive methods can solve multi-class problems with the one-vs-rest extension. But this may not be a good idea for methods deﬁned in terms of non-convex optimizations like El-Yaniv et al. (2008). Furthermore, the encoding of labels for multi-label and serendipitous problems is an issue when using the one-vs-rest extension. The volume approximation to be proposed can handle all these settings in a uniﬁed manner, but in this paper we focus on multi-class and serendipitous learning since they do not require sophisticated post-processing as Kong et al. (2013).

3. Volume Approximations

In this section we review the binary volume approximation and propose our multi-class volume approximation.

3.1. Binary volume approximation

The binary volume approximation involves a few key concepts (El-Yaniv et al., 2008): The soft response vector, the hypothesis space and the equivalence class, and the power

1Some methods lie between them, e.g., Wang et al. (2013).

Transductive Learning with Multi-class Volume Approximation

and volume of equivalence classes.

Given a set of n data Xn = {x1, . . . , xn} where xi X, a soft response vector is an n-dimensional vector

h := (h1, . . . , hn) Rn, (1)

so that hi stands for a soft or conﬁdence-rated label of xi. For binary problems, h suggests that xi is from the positive class if hi > 0, xi is from the negative class if hi < 0, and the above two cases are equally possible if hi = 0.

A hypothesis space is a collection of hypotheses. The volume approximation requires a symmetric positive-deﬁnite matrix Q Rn n which contains the pairwise information about Xn. Consider the hypothesis space

HQ := {h | h Qh 1}, (2)

where the hypotheses are soft response vectors. The set of sign vectors {sign(h) | h HQ} contains all of N = 2n

possible dichotomies of Xn, and HQ can be partitioned into a ﬁnite number of equivalence classes C1, . . . , CN, such that for ﬁxed k, all hypotheses in Ck will generate the same labeling of Xn.

Then, in statistical learning theory, the power of an equivalence class Ck is deﬁned as the probability mass of all hypotheses in it (Vapnik, 1998, p. 708), i.e.,

Ck p(h)dh, k = 1, . . . , N,

where p(h) is the underlying probability density of h over HQ. The hypotheses in Ck which has a large power should be preferred according to Vapnik (1998).

When no speciﬁc domain knowledge is available (i.e., p(h) is unknown), it would be natural to assume the continuous uniform distribution p(h) = 1/ PN k=1 V(Ck), where

Ck dh, k = 1, . . . , N,

is the volume of Ck. That is, the volume of an equivalence class is deﬁned as the geometric volume of all hypotheses in it. As a result, P(Ck) is proportional to V(Ck), and the larger the value V(Ck) is, the more conﬁdent we are of the hypotheses chosen from Ck.

However, it is very hard to accurately compute the geometric volume of even a single convex body in Rn, let alone all 2n convex bodies, so El-Yaniv et al. (2008) introduced an efﬁcient approximation. Let λ1 λn be the eigenvalues of Q, and v1, . . . , vn be the associated orthonormal eigenvectors. Actually, the hypothesis space HQ in Eq. (2) is geometrically an origin-centered ellipsoid in Rn with vi and 1/ λi as the direction and length of its i-th principal axis. Note that a small angle from a hypothesis h in Ck to

some vi with a small/large index i (i.e., a long/short principal axis) implies that V(Ck) is large/small (cf. Figure 1). Based on this crucial observation, we deﬁne

h 2 2 , (3)

where h vi/ h 2 means the cosine of the angle between h and vi. We subsequently expect V (h) to be small when h lies in a large-volume equivalence class, and conversely to be large when h lies in a small-volume equivalence class.

3.2. Multi-class volume approximation

The multi-class volume approximation could deal with the aforementioned transductive problem settings in a uniﬁed manner. In order to extend the deﬁnition Eq. (3), we need only to extend the hypothesis and the hypothesis space.

To begin with, we allocate a soft response vector in Eq. (1) for each of the c labels:

h1 = (h1,1, . . . , hn,1) , . . . , hc = (h1,c, . . . , hn,c) .

The value hi,j is a soft or conﬁdence-rated label of xi concerning the j-th label and it suggests that

xi should possess the j-th label, if hi,j > 0;

xi should not possess the j-th label, if hi,j < 0;

the above two cases are equally possible, if hi,j = 0.

For multi-class and serendipitous problems, yi is predicted by ˆyi = arg maxj hi,j. For multi-label problems, we need a threshold Th that is either preset or learned since usually positive and negative labels are imbalanced, and yi can be predicted by ˆyi = {j | hi,j Th}; or we can employ the prediction methods proposed in Kong et al. (2013). Then, a soft response matrix as our transductive hypothesis is an n-by-c matrix deﬁned by

H = (h1, . . . , hc) Rn c, (4)

and a stacked soft response vector as an equivalent hypothesis is an nc-dimensional vector deﬁned by

h = vec(H) = (h 1, . . . , h c) Rnc,

where vec(H) is the vectorization of H formed by stacking its columns into a single vector.

As the binary deﬁnition of the hypothesis space, a symmetric positive-deﬁnite matrix Q Rn n which contains the pairwise information about Xn is provided, and we assume further that a symmetric positive-deﬁnite matrix P Rc c

which contains the pairwise information about Y is available. Consider the hypothesis space

HP,Q := {H | tr(H QHP) 1}, (5)

Transductive Learning with Multi-class Volume Approximation

where the hypotheses are soft response matrices. Let P Q Rnc nc be the Kronecker product of P and Q. Due to the symmetry and the positive deﬁniteness of P and Q, the Kronecker product P Q is also symmetric and positive deﬁnite, and HP,Q in (5) could be deﬁned equivalently as

HP,Q := {H | vec(H) (P Q) vec(H) 1}. (6)

The equivalence between Eqs. (5) and (6) comes from the fact that tr(H QHP) = vec(H) (P Q) vec(H) following the well-known identity (see, e.g., Theorem 13.26 of Laub, 2005)

(P Q) vec(H) = vec(QHP).

As a consequence, there is a bijection between HP,Q and

EP,Q := {h | h (P Q)h 1}

which is geometrically an origin-centered ellipsoid in Rnc. The set of sign vectors {sign(h) | h EP,Q} spreads over all the N = 2nc quadrants of Rnc, and thus the set of sign matrices {sign(H) | H HP,Q} contains all of N possible dichotomies of Xn {1, . . . , c}. In other words, HP,Q can be partitioned into N equivalence classes C1, . . . , CN, such that for ﬁxed k, all soft response matrices in Ck will generate the same labeling of Xn {1, . . . , c}.

The deﬁnition of the power is same as before, and so is the deﬁnition of the volume:

Ck d H, k = 1, . . . , N.

Because of the bijection between HP,Q and EP,Q, V(Ck) is likewise the geometric volume of all stacked soft response vectors in the intersection of the k-th quadrant of Rnc and EP,Q. By a similar argument to the deﬁnition of V (h), we deﬁne

V (H) := h (P Q)h

h 2 2 = tr(H QHP)

H 2 Fro , (7)

where h = vec(H) and H Fro means the Frobenius norm of H. We subsequently expect V (H) to be small when H lies in a large-volume equivalence class, and conversely to be large when H lies in a small-volume equivalence class.

Note that V (H) and V (h) are consistent for binary learning problems. When c = 2, we may constrain h1 + h2 = 0n where 0n means the all-zero vector in Rn. Let P = I2 where I2 means the identity matrix of size 2, then

V (H) = h 1Qh1 + h 2Qh2 h1 2 2 + h2 2 2 = h 1Qh1 h1 2 2 = V (h1),

which coincides with V (h) deﬁned in Eq. (3). Similarly to V (h), for two soft response matrices H and H from the same equivalence class, V (H) and V (H ) may not necessarily be the same value. In addition, the domain of V (H) could be extended to Rn c though the deﬁnition of V (H) is originally null for H outside HP,Q.

4. Multi-class Approximate Volume Regularization

The proposed volume approximation motivates a family of new transductive methods taking it as a regularization. We develop and analyze an instantiation in this section whose optimization problem is non-convex but can be solved exactly and efﬁciently.

First of all, we deﬁne the label indicator matrix Y Rn c

for convenience whose entries can be from either {0, 1} or { 1, 0, 1} depending on the problem settings and whether negative labels ever appear. Speciﬁcally, we can set Yi,j = 1 if xi is labeled to have the j-th label and Yi,j = 0 otherwise, or alternatively we can set Yi,j = 1 if xi is labeled to have the j-th label, Yi,j = 1 if xi is labeled to not have the j-th label, and Yi,j = 0 otherwise.

Let (Y, H) be our loss function measuring the difference between Y and H. The multi-class volume approximation motivates the following family of transductive methods:

min H HP,Q (Y, H) + γ tr(H QHP)

where γ > 0 is a regularization parameter. The denominator H 2 Fro is annoying so we would like to get rid of it as in El-Yaniv et al. (2008) or Niu et al. (2013a). We ﬁx τ > 0 as a scale parameter, constrain H to be of norm τ, replace the feasible region HP,Q with Rn c, and it becomes

min H Rn c (Y, H) + γ tr(H QHP)

s.t. H Fro = τ. (8)

Although the optimization is done in Rn c, the regularization is relative to HP,Q, since tr(H QHP) is a weighted sum of squared cosines between vec(H) and the principal axes of EP,Q under the constraint H Fro = τ.

Subsequently, we denote by y1, . . . , yn and r1, . . . , rn the c-dimensional vectors that satisfy Y = (y1, . . . , yn) and H = (r1, . . . , rn) . Consider the following loss functions to be (Y, H) in optimization (8):

1. Squared losses over all data P

Xn yi ri 2 2;

2. Squared losses over labeled data P

Xl yi ri 2 2;

3. Linear losses over all data P

4. Linear losses over labeled data P Xl y i ri;

They or their binary counterparts have been used in Zhou et al. (2003), El-Yaniv et al. (2008) and Niu et al. (2013a). Actually, the third and fourth ones are identical since yi is zero for xi Xu, and the ﬁrst one is equivalent to them

Transductive Learning with Multi-class Volume Approximation

in (8) since P

Xn yi 2 2 and P

Xl yi 2 2 are constants and P

Xn ri 2 2 = τ 2 is also a constant. The second one is undesirable due to an issue of the time complexity. Thus, we instantiate (Y, H) := P

Xn yi ri 2 2 = Y H 2 Fro, and optimization (8) becomes

min H Rn c Y H 2 Fro + γ tr(H QHP)

s.t. H Fro = τ. (9)

We refer to constrained optimization problem (9) as multiclass approximate volume regularization (MAVR). An unconstrained version of MAVR is then

min H Rn c Y H 2 Fro + γ tr(H QHP). (10)

4.2. Algorithm

Optimization (9) is non-convex, but we can rewrite it using the stacked soft response vector h = vec(H) as

min h Rnc y h 2 2 + γh (P Q)h s.t. h 2 = τ, (11)

where y = vec(Y ) is the vectorization of Y . In this representation, the objective is a second-degree polynomial and the constraint is an origin-centered sphere, and fortunately we could solve it exactly and efﬁciently following Forsythe & Golub (1965). To this end, a fundamental property of the Kronecker product is necessary (see, e.g., Theorems 13.10 and 13.12 of Laub, 2005): Theorem 1. Let λQ,1 λQ,n be the eigenvalues and v Q,1, . . . , v Q,n be the associated orthonormal eigenvectors of Q, λP,1 λP,c and v P,1, . . . , v P,c be those of P, and the eigen-decompositions of Q and P be Q = VQΛQV Q and P = VP ΛP V P . Then, the eigenvalues of P Q are λP,jλQ,i associated with orthonormal eigenvectors v P,j v Q,i for j = 1, . . . , c, i = 1, . . . , n, and the eigen-decomposition of P Q is P Q = VP QΛP QV P Q, where ΛP Q = ΛP ΛQ and VP Q = VP VQ.

After we ignore the constants y 2 2 and h 2 2 in the objective of optimization (11), the Lagrange function is

Φ(h, ρ) = 2h y + γh (P Q)h ρ(h h τ 2),

where ρ R is the Lagrangian multiplier for h 2 2 = τ 2. The stationary conditions are

Φ/ h = y + γ(P Q)h ρh = 0nc, (12)

Φ/ ρ = h h τ 2 = 0. (13)

Hence, for any locally optimal solution (h, ρ) where ρ/γ is not an eigenvalue of P Q, we have

h = (γP Q ρInc) 1y (14)

= VP Q(γΛP Q ρInc) 1V P Qy

= (VP VQ)(γΛP Q ρInc) 1 vec(V QY VP ) (15)

Algorithm 1 MAVR

Input: P, Q, Y , γ and τ Output: H and ρ

1: Eigen-decompose P and Q; 2: Construct the function g(ρ); 3: Find the smallest root of g(ρ); 4: Recover h using ρ and reshape h to H.

based on Eq. (12) and Theorem 1. Next, we search for the feasible ρ for (12) and (13) which will lead to the globally optimal h. Let z = vec(V QY VP ), then plugging (15) into (13) gives us

z (γΛP Q ρInc) 2z τ 2 = 0. (16)

Let us sort the eigenvalues λP,1λQ,1, . . . , λP,cλQ,n into a non-descending sequence {λP Q,1, . . . , λP Q,nc}, rearrange {z1, . . . , znc} accordingly, and ﬁnd the smallest k0 which satisﬁes zk0 = 0. As a result, Eq. (16) implies that

z2 k (γλP Q,k ρ)2 τ 2 = 0 (17)

for any stationary ρ. By Theorem 4.1 of Forsythe & Golub (1965), the smallest root of g(ρ) determines a unique h so that (h, ρ) is the globally optimal solution to Φ(h, ρ), i.e., h minimizes the objective of (11) globally. Here, the only exception where we cannot determine h by Eq. (14) for a speciﬁc value of ρ is when ρ/γ is an eigenvalue of P Q. This, however, happens with probability zero. The theorem below points out the location of the optimal ρ (the proof is in the appendix):

Theorem 2. The function g(ρ) deﬁned in Eq. (17) has exactly one root in the interval [ρ0, γλP Q,k0) and no root in the interval ( , ρ0), where ρ0 = γλP Q,k0 y 2/τ.

The algorithm of MAVR is summarized in Algorithm 1. It is easy to see that ﬁxing ρ = 1 in Algorithm 1 instead of ﬁnding the smallest root of g(ρ) sufﬁces to solve optimization (10). Moreover, for a special case P = Ic where Ic is the identity matrix of size c, any stationary H is simply

H = (γQ ρIn) 1Y = VQ(γΛQ ρIn) 1V QY.

Let z = V QY 1c where 1c means the all-one vector in Rc, and k0 is the smallest number that satisﬁes zk0 = 0. Then the smallest root of g(ρ) = Pn k=k0 z2 k/(γλQ,k ρ)2 τ 2

gives us the feasible ρ leading to the globally optimal H.

The asymptotic time complexity of Algorithm 1 is O(n3). More speciﬁcally, eigen-decomposing Q in the ﬁrst step of Algorithm 1 costs O(n3), and this is the dominating computation time. Eigen-decomposing P just needs O(c3) and is negligible under the assumption that n c without loss

Transductive Learning with Multi-class Volume Approximation

of generality. In the second step, it requires O(nc log(nc)) for sorting the eigenvalues of P Q and O(n2c) for computing z. Finding the smallest root of g(ρ) based on a binary search algorithm uses O(log( y 2)) in the third step. In the fourth step, recovering h is essentially same as computing z and costs O(n2c).

We would like to comment a little more on the asymptotic time complexity of MAVR. Given ﬁxed P and Q but different Y , γ and τ, the computational complexity is O(n2c) if we reuse the eigen-decompositions of P and Q and the sorted eigenvalues of P Q. This property is particularly advantageous for trying different hyperparameters. It is also quite useful for choosing different Xl Xn to be labeled following transductive problem settings. Finally, the asymptotic time complexity O(n3) for solving MAVR exactly can hardly be improved based on existing techniques. Even if ρ is ﬁxed in optimization (10), the stationary condition Eq. (12) is a discrete Sylvester equation which consumes O(n3) for solving it (Sima, 1996).

4.3. Theoretical analyses

We provide two theoretical results. Under certain assumptions, the stability analysis upper bounds the difference of two optimal H and H trained with two different label indicator matrices Y and Y , and the error analysis bounds the difference of H from the ground truth.

Theorem 2 guarantees that ρ < γλP Q,k0. In fact, with high probability over the choice of Y , it holds that k0 = 1 (we did not meet k0 > 1 in our experiments). For this reason, we make the following assumption: Fix P and Q, and allow Y to change according to the partition of Xn into different Xl and Xu. There is Cγ,τ > 0, which just depends on γ and τ, such that for all optimal ρ trained with different Y , ρ γλP Q,1 Cγ,τ.

Note that for unconstrained MAVR, there must be Cγ,τ > 1 since γλP Q,1 > 0 and ρ = 1. We can prove the theorem below based on the assumption above and the lower bound of ρ in Theorem 2. Theorem 3 (Stability of MAVR). Assume the existence of Cγ,τ. Let (H, ρ) and (H , ρ ) be two globally optimal solutions trained with two different label indicator matrices Y and Y respectively. Then,

H H Fro Y Y Fro/Cγ,τ + |ρ ρ | min{ Y Fro, Y Fro}/C2 γ,τ. (18)

Consequently, for MAVR in optimization (9) we have

H H Fro Y Y Fro/Cγ,τ + Y Fro Y Fro/τC2 γ,τ,

and for unconstrained MAVR in optimization (10) we have

H H Fro Y Y Fro/Cγ,τ.

In order to present an error analysis, we assume there is a ground-truth soft response matrix H with two properties. Firstly, the value of V (H ) should be bounded, namely,

V (H ) = tr(H QH P)/ H 2 Fro Ch,

where Ch > 0 is a small number. This ensures that H lies in a large-volume region. Otherwise MAVR implementing the large volume principle can by no means learn some H close to H . Secondly, Y should contain certain information about H . MAVR makes use of P, Q and Y only and the meanings of P and Q are ﬁxed already, so MAVR may access the information about H only through Y . To make Y and H correlated, we assume that Y = H + E where E Rn c is a noise matrix of the same size as Y and H . All entries of E are independent with zero mean, and the variance of them is σl or σu depending on its correspondence to a labeled or an unlabeled position in Y . We could expect that σl σu, such that the entries of Y in labeled positions are close to the corresponding entries of H , but the entries of Y in unlabeled positions are completely corrupted and uninformative for recovering H . Notice that we need this generating mechanism of Y even if Ch/γ is the smallest eigenvalue of P Q, since P Q may have multiple smallest eigenvalues and H have totally different meanings. Based on these assumptions, we can prove the theorem below. Theorem 4 (Accuracy of MAVR). Assume the existence of Cγ,τ, Ch, and the generating process of Y from H and E. Let el and eu be the numbers of the labeled and unlabeled positions in Y and assume that EE Y 2 Fro el where the expectation is with respect to the noise matrix E. For each possible Y , let H be the globally optimal solution trained with it. Then,

EE H H Fro ( p

ChγλP Q,1/Cγ,τ) H Fro

el/τ γλP Q,1 1, γλP Q,1 Cγ,τ + 1}/Cγ,τ)

elσ2 l + euσ2u/Cγ,τ (19)

for MAVR in optimization (9), and

EE H H 2 Fro (Ch/4) H 2 Fro + elσ2 l + euσ2 u (20)

for unconstrained MAVR in optimization (10).

The proofs of Theorems 3 and 4 are in the appendix. Considering the instability bounds in Theorem 3 and the error bounds in Theorem 4, unconstrained MAVR is superior to constrained MAVR in both cases. That being said, bounds are just bounds. We will demonstrate the potential of constrained MAVR in the next section by experiments.

5. Experiments

In this section, we numerically evaluate MAVR. The baseline methods include the one-vs-rest extension of the bina-

Transductive Learning with Multi-class Volume Approximation

(b) Data 2 (c) Data 3 (d) Data 4 (e) Data 5

(f) Data 1, P1 (g) Data 1, P2 (h) Data 1, P3 (i) Data 2, P4

(j) Data 3, P5 (k) Data 4, P5 (l) Data 5, P5

Figure 2. Serendipitous learning by MAVR.

ry approximate volume regularization (BAVR) as well as a multi-class transductive method named learning with local and global consistency (LGC) (Zhou et al., 2003).

5.1. Serendipitous learning

We show how to handle serendipitous problems by MAVR directly without performing clustering (Hartigan & Wong, 1979; Ng et al., 2001; Sugiyama et al., 2014) or estimating the class-prior change (du Plessis & Sugiyama, 2012). The experimental results are displayed in Figure 2. There are 5 artiﬁcial data sets in total where the latter 3 data sets come from Zelnik-Manor & Perona (2004). The matrix Q was speciﬁed as the normalized graph Laplacian (see, e.g., von Luxburg, 2007)2

Q = Lnor = In D 1/2WD 1/2,

where W Rn n is a similarity matrix and D Rn n is the degree matrix of W. The matrix P was speciﬁed by

1 0 0 0 0 1 0 0 0 0 3 1 0 0 1 1

1 0 0 0 0 3 0 1 0 0 1 0 0 1 0 1

1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 3

1 1/2 1/2 1/2 1/2 2 0 1/2 1/2 0 2 1/2 1/2 1/2 1/2 3

1 1/2 1/2 1/2 1 0 1/2 0 1

For data sets 1 and 2 we used the Gaussian similarity

Wi,j = exp( xi xj 2 2/(2σ2)), Wi,i = 0

with the kernel width σ = 0.25, and for data sets 3 to 5 we applied the local-scaling similarity (Zelnik-Manor & Perona, 2004)

Wi,j = exp( xi xj 2 2/(2σiσj)), Wi,i = 0

2Though the graph Laplacian matrices have zero eigenvalues, they would not cause algorithmic problems when used as Q.

(a) Data 1 (b) Data 2 (c) Data 3 (d) Data 4 (e) Data 5

Figure 3. Serendipitous learning by LGC.

with the nearest-neighbor number k = 7, where each σi = xi x(k) i 2 is the scale parameter of xi and x(k) i is the kth nearest neighbor of xi in Xn. For the hyperparameters, we set γ = 99 and τ =

l. Furthermore, a class-balance regularization was imposed for data sets 2 to 5, which tries to minimize

γ tr(H (1n1 n)H(Ic 1c1 c/c)).

The detailed derivation is omitted due to the limited space, but the idea is to encourage balanced total responses among c classes. For this regularization, the regularization parameter was set to γ = 1.

We can see that in Figure 2, MAVR successfully classiﬁed the data belonging to the known classes and simultaneously clustered the data belonging to the unknown classes. Moreover, we can control the inﬂuence of the known classes on the unknown classes by specifying different P, as shown in subﬁgures (f), (g) and (h) of Figure 2. On the other hand, BAVR cannot beneﬁt from the class-balance regularization and LGC with the class-balance regularization for data sets 2 to 5 in Figure 3 was not as perfect as MAVR.

5.2. Multi-class learning

As commented in the end of our theoretical analyses, we would demonstrate the potential of constrained MAVR by experiments. Actually, LGC could be subsumed in MAVR as a special case of unconstrained MAVR: Although LGC is motivated by the label propagation point of view, it can be rewritten as the following optimization problem

min H Rn c Y H 2 Fro + γ tr(H Lnor H).

Therefore, unconstrained MAVR will be reduced exactly to LGC if P = Ic and Q = Lnor. Now we specify P = Ic and Q = Lnor and illustrate the nuance of constrained MAVR, LGC, and BAVR using an artiﬁcial data set.

The artiﬁcial data set 3circles is generated as follows. We have three classes with the class ratio 1/6 : 1/3 : 1/2. Let yi be the ground-truth label of xi, then xi is generated by

xi = (6yi cos(ai) + ϵi,1, 5yi sin(ai) + ϵi,2) R2,

where ai is an angle drawn i.i.d. from the uniform distribution U(0, 2π), and ϵi,1 and ϵi,2 are noises drawn i.i.d. from the normal distribution N(0, σ2 ϵ ). In our experiments, we varied one factor while ﬁxed all other factors. The default

Transductive Learning with Multi-class Volume Approximation

20 15 10 5 0 5 10 15 20

(a) Default

20 15 10 5 0 5 10 15

(b) Small n

20 15 10 5 0 5 10 15 20

(c) Large n

20 15 10 5 0 5 10 15 20

(d) Small σϵ

20 15 10 5 0 5 10 15 20

(e) Large σϵ

Figure 4. Visualization of the artiﬁcial data set 3circles.

0.8 0.6 0.4 0.2

Log10(Noise level)

Classification error (%)

LGC BAVR MAVR

(a) Varying σϵ

Number of labeled data

Classification error (%)

LGC BAVR MAVR

(b) Varying l

150 200 250 300 350 400 450

Number of all data

Classification error (%)

LGC BAVR MAVR

(c) Varying n

0.6 0.4 0.2 0

Log10(Kernel width)

Classification error (%)

LGC BAVR MAVR

(d) Varying σ

Log10(Regularization param)

Classification error (%)

LGC BAVR MAVR

(e) Varying γ

Log10(Scale param)

Classification error (%)

LGC BAVR MAVR

(f) Varying τ

Figure 5. Means with standard errors of LGC, BAVR and MAVR on 3circles.

values of factors were σϵ = 0.5, σ = 0.5, l = 3, n = 300, γ = 99, and τ =

l, and the ranges of these factors were

σϵ 0.5 exp{ 1.5, 1.4, 1.3 . . . , 0.5};

l {3, 4, 5, . . . , 20};

n {120, 138, 156, . . . , 480};

σ 0.5 exp{ 1, 0.9, 0.8, . . . , 1};

γ 99 exp{ 4, 3, 2, . . . , 16};

l exp{ 2, 1, 0, . . . , 18}.

Note that there was a distributional change, since we sampled labeled data as balanced as possible across three classes. Figure 4 exhibits several realizations of 3circles given different values of factors.

Figure 5 shows the experimental results, where the means with the standard errors of the classiﬁcation error rates are plotted. For each task that corresponds to a full speciﬁcation of those factors, three methods were repeatedly run on 100 random samplings. We can see from Figure 5 that the performance of LGC or BAVR was usually not as good

as MAVR. The drawback of LGC is that we would always have ρ = 1 since it is unconstrained. Though ρ is adaptive in BAVR, we would have c different ρ values since it is based on the one-vs-rest extension.

6. Conclusions

We proposed a multi-class volume approximation that can be applied to several transductive problem settings such as multi-class, multi-label and serendipitous learning. The resultant learning method is non-convex, but can however be solved exactly and efﬁciently. The method was theoretically justiﬁed by stability and error analyses and empirically demonstrated a promising approach via experiments.

Acknowledgments

GN was supported by the FIRST Program and the 973 Program No. 2014CB340505. MCd P was supported by KAKENHI 23120004, and MS was supported by KAKENHI 25700022 and AOARD.

Transductive Learning with Multi-class Volume Approximation

Belkin, M., Niyogi, P., and Sindhwani, V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399 2434, 2006.

Bennett, K. and Demiriz, A. Semi-supervised support vector machines. In NIPS, 1998.

Blum, A. and Chawla, S. Learning from labeled and unlabeled data using graph mincuts. In ICML, 2001.

du Plessis, M. C. and Sugiyama, M. Semi-supervised learning of class balance under class-prior change by distribution matching. In ICML, 2012.

El-Yaniv, R., Pechyony, D., and Vapnik, V. Large margin vs. large volume in transductive learning. Machine Learning, 72(3):173 188, 2008.

Forsythe, G. and Golub, G. On the stationary values of a second-degree polynomial on the unit sphere. Journal of the Society for Industrial and Applied Mathematics, 13(4):1050 1068, 1965.

Grandvalet, Y. and Bengio, Y. Semi-supervised learning by entropy minimization. In NIPS, 2004.

Hartigan, J. A. and Wong, M. A. A k-means clustering algorithm. Applied Statistics, 28:100 108, 1979.

Joachims, T. Transductive learning via spectral graph partitioning. In ICML, 2003.

Kong, X., Ng, M., and Zhou, Z.-H. Transductive multilabel learning via label set propagation. IEEE Transaction on Knowledge and Data Engineering, 25(3):704 719, 2013.

Laub, A. J. Matrix Analysis for Scientists and Engineers. Society for Industrial and Applied Mathematics, 2005.

Li, S. and Ng, W. Maximum volume outlier detection and its applications in credit risk analysis. International Journal on Artiﬁcial Intelligence Tools, 22(5), 2013.

Li, Y.-F. and Zhou, Z.-H. Towards making unlabeled data never hurt. In ICML, 2011.

Li, Y.-F., Kwok, J., and Zhou, Z.-H. Semi-supervised learning using label mean. In ICML, 2009.

Ng, A., Jordan, M. I., and Weiss, Y. On spectral clustering: Analysis and an algorithm. In NIPS, 2001.

Niu, G., Dai, B., Shang, L., and Sugiyama, M. Maximum volume clustering: A new discriminative clustering approach. Journal of Machine Learning Research, 14:2641 2687, 2013a.

Niu, G., Jitkrittum, W., Dai, B., Hachiya, H., and Sugiyama, M. Squared-loss mutual information regularization: A novel information-theoretic approach to semisupervised learning. In ICML, 2013b.

Sima, V. Algorithms for Linear-Quadratic Optimization. Marcel Dekker, 1996.

Sugiyama, M., Niu, G., Yamada, M., Kimura, M., and Hachiya, H. Information-maximization clustering based on squared-loss mutual information. Neural Computation, 26(1):84 131, 2014.

Szummer, M. and Jaakkola, T. Partially labeled classiﬁcation with Markov random walks. In NIPS, 2001.

Vapnik, V. N. Estimation of Dependences Based on Empirical Data. Springer Verlag, 1982.

Vapnik, V. N. Statistical Learning Theory. John Wiley & Sons, 1998.

von Luxburg, U. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416, 2007.

Wang, J., Jebara, T., and Chang, S.-F. Semi-supervised learning using greedy max-cut. Journal of Machine Learning Research, 14:771 800, 2013.

Yamada, M., Sugiyama, M., and Matsui, T. Semisupervised speaker identiﬁcation under covariate shift. Signal Processing, 90(8):2353 2361, 2010.

Zelnik-Manor, L. and Perona, P. Self-tuning spectral clustering. In NIPS, 2004.

Zhang, D., Liu, Y., and Si, L. Serendipitous learning: Learning beyond the predeﬁned label space. In KDD, 2011.

Zhou, D., Bousquet, O., Navin Lal, T., Weston, J., and Sch olkopf, B. Learning with local and global consistency. In NIPS, 2003.

Zhu, X., Ghahramani, Z., and Lafferty, J. Semi-supervised learning using Gaussian ﬁelds and harmonic functions. In ICML, 2003.