# subspace_learning_with_partial_information__3f15630d.pdf

Journal of Machine Learning Research 17 (2016) 1-21 Submitted 10/14; Revised 11/15; Published 4/16

Subspace Learning with Partial Information

Alon Gonen ALONGNN@CS.HUJI.AC.IL School of Computer Science and Engineering The Hebrew University Jerusalem, Israel

Dan Rosenbaum DANRSM@CS.HUJI.AC.IL School of Computer Science and Engineering The Hebrew University Jerusalem, Israel

Yonina C. Eldar YONINA@EE.TECHNION.AC.IL Department of Electrical Engineering Technion, Israel Institute of Technology Haifa, Israel

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL School of Computer Science and Engineering The Hebrew University Jerusalem, Israel

Editor: Kevin Murphy

The goal of subspace learning is to ﬁnd a k-dimensional subspace of Rd, such that the expected squared distance between instance vectors and the subspace is as small as possible. In this paper we study subspace learning in a partial information setting, in which the learner can only observe r d attributes from each instance vector. We propose several efﬁcient algorithms for this task, and analyze their sample complexity.

Keywords: principal components analysis, budgeted learning, statistical learning, learning with partial information, learning theory

1. Introduction

Subspace learning is a dimensionality reduction technique in a variety of applications such as face recognition (Yang et al., 2004), image compression (Du and Fowler, 2007), and document classiﬁcation (Papadimitriou et al., 1998). Recently, there has been growing interest in subspace learning from partially observed data (e.g., (Chen et al., 2013; Wang and Poor, 1998; Chi et al., 2013)). As motivation, consider the scenario of subspace learning from corrupted data. As discussed in Chen et al. (2013), data corruption may cause some (or even most) of the attributes to be missing. Another typical scenario is subspace learning from multiple sources. Many applications (e.g., wireless sensor networks (Chi et al., 2013)) rely on data which is collected from multiple sources (e.g., sensors). When the data dimension is high, it may be impossible or prohibitively expensive to collect every data entry from every source. Note that in the ﬁrst scenario, we have no control over which

c 2016 Alon Gonen, Dan Rosenbaum, Yonina C. Eldar and Shai Shalev-Shwartz.

GONEN, ROSENBAUM, ELDAR AND SHALEV-SHWARTZ

attributes are missing, while in the second one a learner may actively choose which attributes to observe. The subspace learning problem is formally deﬁned as follows. Let X be a subset of the Euclidean unit ball in Rd, and let P be some unknown distribution over X. Our goal is to ﬁnd a rank-k projection matrix Π Rd d such that the expected squared distance, Ex P [ x Πx 2 2], is as small as possible. When X is a ﬁnite set and P is the uniform distribution over X, the optimal solution to the subspace learning problem is given by the Principal Component Analysis (PCA) algorithm, which returns the projection matrix that corresponds to the k leading eigenvectors of the matrix 1 |X| P x X xx . In the more general stochastic optimization setting of subspace learning, X is not restricted to be a ﬁnite set, P is an arbitrary distribution over X, and the information given to the learner has the form of an i.i.d. training sequence (x1, . . . , xm) P m. In the usual full information setting of subspace learning, the learner has access to all attributes of the sampled vectors. In this paper, we consider subspace learning in a partial information setting, in which only a subset of indices from each vector can be observed. Inspired by the two applications presented above, we study two variants of this problem, which will be named the passive setting, and the active setting, respectively. In the passive setting, we assume that each attribute is observed with probability p = r/d (r d). Therefore, the expected number of observed attributes from each vector is r. In the active setting, the learner can choose (possibly at random) r attributes to be revealed. The sample complexity of a subspace learning algorithm is deﬁned as the number of samples that are needed by the algorithm in order to ﬁnd a projection matrix Π with expected squared distance, Ex P [ x Πx 2 2], of at most ϵ more than the optimal expected squared distance. The sample complexity for each of the models is deﬁned as the minimal sample complexity attained by any algorithm. In this paper we propose efﬁcient algorithms for both settings and analyze their sample complexity. We also provide several lower bounds on the sample complexity that can be attained by any algorithm.

1.1 Our Contribution

Our ﬁrst observation is that subspace learning (in both the passive and active models) using r = 1 attributes is impossible. Consider the task of learning a 1-dimensional subspace (a line) in R2 (see Figure 1). Let u1 = (α, α) , u2 = ( α, α). Denote by P1 the uniform distribution over {u1, u1}, and by P2 be the uniform distribution over {u2, u2}. Suppose that the actual distribution P is chosen uniformly at random from {P1, P2}. The task of subspace learning in this case amounts to distinguishing between P1 and P2. However, since each single coordinate is distributed uniformly over { α, α}, the learner does not obtain any information from a single observation, and hence cannot identify the right subspace. For the case 2 r d we propose two efﬁcient algorithms, named Partially Observed PCA (Section 2.1) and Matrix Bandit Exponentiated Gradient (Section 3.1), which are designed for the passive and active models, respectively. Partially Observed PCA (POPCA) is an extension of the PCA algorithm. Denote the population covariance matrix E[xx ] by C. The optimal projection matrix, denoted Π , is the projection onto the subspace that is spanned by the k leading eigenvectors of C. An ERM (empirical risk minimizer, i.e., an algorithm that minimizes the empirical loss) for the full information setting is given by PCA which returns the projection matrix corresponding to the k leading eigenvectors of

SUBSPACE LEARNING WITH PARTIAL INFORMATION

Figure 1: Impossibility of subspace learning using r = 1 observed attributes. The distribution of the observed attribute is identical for (a) and for (b), therefore they are indistinguishable.

the matrix m 1 Pm i=1 xix i (whose expected value is C). Similarly, the POPCA algorithm uses the random observations in order to construct an estimate ˆC of C, and approximates Π using the projection matrix onto the k leading eigenvectors of ˆC. We analyze the sample complexity of this algorithm, showing (see Corollary 3) that it is bounded from above by (d/r)2 k

In the full information setting, the sample complexity is known to be O(k/ϵ2) (This bounds coincides with our bound when r = d). Hence, the (multiplicative) price of partial information, that is, how many more examples we need in order to compensate for the lack of full information on each individual example, is O((d/r)2). It is interesting to understand whether a lower price can be achieved. In Theorem 4 we prove that the sample complexity of every algorithm in the passive model is Ω (d/r)2 k

ϵ2 . The optimality of POPCA in the passive mode is thus established. Another appealing property of POPCA is that, in terms of computational complexity, the challenge of partial information does not incur any additional cost; While the sample complexity grows as r decreases, the runtime per iteration decreases by the same order.

Next, we investigate the active model and ask whether the price of partial information can be reduced due to the ability of the learner to actively choose the observed attributes. Intuitively, one may hope that the learning process would reveal some useful information that can be utilized while choosing which coordinates to observe. Our second algorithm, called Matrix Bandit Exponentiated Gradient (MBEG), exploits the active setting by maintaining a weight matrix which is updated with every observation, and induces a non-uniform attribute sampling distribution. For MBEG, we derive an upper bound of max n 8k d+r

ϵ2 , d4r 2(d+r) o log(d) (see Theorem 7) on the sample complexity. We note that if ϵ is small enough, then the right term in the bound becomes irrelevant, and thus a linear dependence on d/r is obtained (for a detailed comparison, see Section 3.2). The (almost trivial) fact that every subspace learner, even in the active model, must have a sample complexity that grows linearly with d/r is proved in Appendix C. Hence, the dependence of MBEG on d/r is optimal, in the regime where ϵ is small.

The results in the active model immediately lead to the following question: Can we attain a linear price for partial information in the active model, independently of the required accuracy? We discuss possible directions for tackling this question in Section 4.

GONEN, ROSENBAUM, ELDAR AND SHALEV-SHWARTZ

1.2 Related Work

In the full information setting, it has been shown by Shawe-Taylor et al. (2005) and Blanchard et al. (2007) that the optimal sample complexity of subspace learning is at most O(k/ϵ2), and this upper bound is achievable by applying PCA on i.i.d. samples according to the distribution P. A similar result is obtained by applying the Stochastic Gradient Descent algorithm (Arora et al., 2013). Subspace learning in the partial information setting has been studied in Chi et al. (2013), where an algorithm named PETRELS is proposed. However, no formal guarantees are derived for this method. The setting in Mitliagkas et al. (2014) is similar to our passive setting, but they assume that the distribution that generates the instance vectors is Gaussian. A closely related problem to the task of subspace learning (in the passive setting) is the approximation of the covariance matrix using partially observed attributes. We discusses this relation in Section 2.2. One possible way to tackle the challenge of subspace learning with partial information is based on the matrix completion method. For example, we may think of the partially observed examples as a data matrix with unobserved entries. Then, one could ﬁrst ﬁll in the missing entries using a matrix completion technique (e.g., as described in Cand es and Recht (2009)), and then apply PCA to the data matrix. We note that this approach may work1 provided that the average number of observed attributes per example is sufﬁciently large. More precisely, according to the main result of Cand es and Recht (2009), the number of observed attributes per example (column of the data matrix) should scale with the rank of the data matrix (which may be large as d). In contrast, our focus in this paper is on methods that work even when the number of observed attributes per example is much smaller (two attributes per instance sufﬁce). The active setting resembles the setting of the multi-armed bandit problem (Auer et al., 2002), in which the learner obtains limited feedback at each time, namely, it receives only the reward of the chosen arm. The challenge of learning linear predictors in Rd with partially observed attributes (e.g., as in Cesa-Bianchi et al. (2011)) may be seen as an extension of this problem. One of the most signiﬁcant challenges in this work is to adapt the technique used in the vector setting (e.g., those employed by the Exp3 algorithm of Auer et al. (2002)) to the corresponding matrix setting. We already observed one difference: While a single arm sufﬁces in the vector case, subspace learning with r = 1 attributes is impossible. Our MBEG algorithm can be seen as an extension of the Online PCA algorithm of Warmuth and Kuzmin (2008) (see also Nie et al. (2013)) to the partial information setting. Similarly to their approach, MBEG maintains a weight matrix which induces a probability distribution over projection matrices. In MBEG, this weight matrix also induces a distribution over which attributes to observe.

2. The Passive Setting

We begin by investigating the passive setting. We start by describing an algorithm for this case. We then analyze its sample complexity, and discuss its implementation.

2.1 Partially Observed Principal Component Analysis (POPCA)

In this section we describe the POPCA algorithm. We start by reviewing the deﬁnition of the loss function and characterizing the minimizer of the loss.

1. Under some additional assumptions on the data such as the incoherence assumption made in Cand es and Recht (2009)

SUBSPACE LEARNING WITH PARTIAL INFORMATION

Denote the set of projection matrices from Rd onto Rk by Pd k. Since Π2 = Π for any Π Pd k, the loss of a projection matrix Π Pd k can be expressed as

L(Π) = E x Πx 2 2 = E h x 2 2 2x Πx + x Π Πx i = E h x 2 2 x Πx i , (1)

Deﬁne the inner product of matrices A, B by A, B = tr(A B). We can further rewrite the loss as

L(Π) = E h x 2 2 Π, xx i = E x 2 2 E h Π, xx i = E x 2 2 Π, E[xx ] .

Denote the covariance matrix E[xx ] by C. Since x 2 2 does not depend on Π, the goal of subspace learning is equivalent to ﬁnding a projection matrix Π such that

Π, C min Π Pd k Π , C + ϵ . (2)

It is well-known that the rank-k matrix which minimizes the expression Π , C is the matrix Pk i=1 viv i , where v1, . . . , vk are the leading eigenvectors of C. The PCA approach for subspace learning in the full information setting replaces C with C(S) = 1 m Pm i=1 xix i , where S = (x1, . . . , xm) is an i.i.d. training sequence drawn according to the distribution P. Clearly, E[C(S)] = C. In our case, we will construct an unbiased estimate of C based on partially observed examples, as detailed below. Consider an instance vector x P and let ˆx be the observed vector. According to our assumptions, for each i [d], the i-th coordinate of ˆx satisﬁes

( xi w.p. r/d 0 w.p. 1 r/d . (3)

Denote p = r/d. Similarly to Mitliagkas et al. (2014), we form the estimate

p2 ˆxˆx + 1

diag(ˆxˆx ) . (4)

Indeed, it is easy to verify that ˆCˆx forms an unbiased estimate of C:

( 1 p E[x2 i ] i = j

E[xixj] i = j ,

diag(ˆxˆx )i,j

( E[x2 i ] 1

p E[x2 i ] i = j

Summing the corresponding entries, we see that E[( ˆCˆx)i,j] = Ex P [xixj] = Ci,j. Therefore, E[ ˆCˆx] = C. The POPCA algorithm (see Algorithm 1) assumes access to m i.i.d. vectors sampled according to P. For each instance vector, it forms an estimate according to 4. By averaging these m estimates we obtain ˆC. The returned projection corresponds to the k leading eigenvectors of ˆC.

GONEN, ROSENBAUM, ELDAR AND SHALEV-SHWARTZ

Algorithm 1 POPCA

Input: r, k d

ˆC = 0 Rd d

for i = 1 to m do

Let xi P and let ˆxi be the observed vector

r2 ˆxiˆx i + d r d2

r2 diag(ˆxiˆx i ) ˆC = ˆC + 1

m ˆCˆxi end for Compute the eigendecomposition ˆC = Pd j=1 λjvjv j Assuming λ1 . . . λd, return ˆΠ = Pk j=1 vjv j

2.2 Analysis of POPCA

The following lemma relates the success of Algorithm 1 to the quality of the estimation ˆC of the covariance matrix C.

Lemma 1 Suppose that the ﬁnal estimate ˆC of POPCA satisﬁes

E[ C ˆC F ] ϵ/

Then, the resulting projection matrix ˆΠ satisﬁes the desired bound

E[ ˆΠ, C ] min Π Pd k Π , C + ϵ .

Proof Using the Cauchy-Schwarz inequality, we get

E[ sup Π Pd k Π, C Π, ˆC ] E[ sup Π Pd k Π F ˆC C F ]

sup Π Pd k Π F E[ ˆC C F ]

Consequently, since ˆΠ = argminΠ Pd k Π, ˆC , we have

E[ ˆΠ, C Π , C ] = E[ ˆΠ, C Π , ˆC ]

E[ ˆΠ, ˆC Π , ˆC ] + ϵ

which completes the proof.

In view of Lemma 1, obtaining upper bound on the sample complexity of POPCA reduces to analyzing the quality of the covariance estimation. A fairly vast body of literature exists on the latter task in the full-information scenario, and several results are known in the case of missing entries. For example, Lounici (2014) considered a setting similar to our passive setting. Corollary 1 in Lounici

SUBSPACE LEARNING WITH PARTIAL INFORMATION

(2014) gives a bound on the Frobenius error that depends on the spectral decay of C. We will derive a slightly different (worst-case) bound under the assumption that the instances are bounded in the Euclidean unit ball. A comparison between the two bounds is given Appendix E.

Lemma 2 Let ϵ (0, 1). If m (d/r)2 k

E[ C ˆC F ] ϵ/

Proof Using Jensen s inequality, we have

E[ ˆC C F ] = E[( ˆC C 2 F )1/2] ( E[ ˆC C 2 F ])1/2 .

Since the observations are i.i.d.,

E[ ˆC C 2 F ] = X

i,j Var[ ˆCi,j] = X

q=1 ˆCˆxq,i,j

i,j Var[ ˆCˆx1,i,j] 1

i,j E[ ˆC2 ˆx1,i,j] ,

where ˆCˆxq,i,j is the [i, j]-th entry of ˆCq. Denote ˆx = ˆx1 and x = x1 (subscript indices will now correspond to the entries of these vectors). According to 4,

E[ ˆC2 ˆx,i,j|x] =

( p 4 E[ˆx2 i ˆx2 j|x] = p 2x2 i x2 j i = j p 2 E[ˆx4 i |x] = p 1x4 i p 2x4 i i = j .

Therefore, since the ℓ2 norm of the instances is at most 1, we have

i,j E[ ˆC2 ˆx1,i,j|x] p 2 X

i,j x2 i x2 j = p 2 x 4 p 2 = d2

and P i,j E[ ˆC2 ˆx1,i,j] d2

r2 . Combining the above bounds, we obtain

E[ ˆC C F ] 1 m d

For m l d2 r2 k

ϵ2 m , we arrive at the claimed bound.

Let mp(d, k, r, ϵ) be the sample complexity of subspace learning in the passive partial information model, namely, how many examples are needed (for the optimal learner) to guarantee that (2) holds. Based on Lemma 2 and Lemma 1, we now conclude the following bound on the sample complexity.

Corollary 3 Using POPCA (Algorithm 1), we have the following bound on the sample complexity for any integer r 2:

mp(d, k, r, ϵ) (d/r)2 k

GONEN, ROSENBAUM, ELDAR AND SHALEV-SHWARTZ

2.3 Optimality of POPCA

In this section we prove the following lower bound on the sample complexity of subspace learning with partial information in the passive model.

Theorem 4 Assume that k d/2 and ϵ (0, 1/128). The sample complexity in the passive model is at least Ω (d/r)2 k

ϵ2 . Therefore, we have

mp(d, k, r, ϵ) = Θ (d/r)2 k

Note that up to a constant factor, our lower bound coincides with the upper bound obtained by POPCA (Corollary 3). Therefore, Theorem 4 establishes the optimality of POPCA in the passive model. The proof of Theorem 4 is divided into two parts. First, in Theorem 5 we prove that the sample complexity in the full-information setting is at least Ω(k/ϵ2). Then, we complete the proof of Theorem 4 by showing that the multiplicative price of partial information is at least Ω((d/r)2).

Theorem 5 Assume that k d/2 and let ϵ (0, 1/128). The sample complexity of subspace learning with full information is bounded below by

m(d, k, r = d, ϵ) = Ω(k/ϵ2) .

We now sketch the proof of Theorem 5. A detailed proof is provided in Appendix A. Proof (sketch) The idea is to reduce the problem of coin identiﬁcation (see Section 5.2 in Anthony and Bartlett (2009)) to that of subspace learning. Assume that k d/2 and ϵ (0, 1/128). Let U = {uj}2k j=1 Rd be a set of 2k orthonormal vectors. A distribution over U is deﬁned as follows. First, we draw a sequence b = (b1, . . . , bk) { 1, 1}k uniformly at random. We associate the pair {ui, ui+k} with a Bernoulli random variable Bi with parameter pi = 1+biα

2 , where α = 16ϵ. To deﬁne a distribution P := Pb, we now describe the process of drawing an instance x P:

1. An integer i [k] is chosen uniformly at random.

2. The i-th coin is ﬂipped (according to pi). Denote the corresponding random variable by Z.

3. If Z = 1, then x is chosen uniformly at random from the set {ui, ui}. Otherwise (Z = 0), x is chosen uniformly at random from the set {ui+k, ui+k}.

In Lemma 10 we show that a successful subspace learner must identify the bias of most of the coins. Thus, we can reduce k independent tasks of coin identiﬁcation to the task of subspace learning. A well-known result in statistics (Anthony and Bartlett, 2009)[Lemma 5.1] tells us that Ω(1/α2) samples are needed to identify a coin with bias α. Hence, each of the pairs must be observed Ω(1/ϵ2) times.

Proof (of Theorem 4) Consider the construction presented in the proof (sketch) of Theorem 5. We next specify the set U and prove that the price of partial information in the passive model is Ω(d2/r2). Consequently, this will conclude the proof of Theorem 4.

SUBSPACE LEARNING WITH PARTIAL INFORMATION

For every i [k], let ui =

2 2 (ei + ei+k), ui+k =

2 2 ( ei + ei+k). To specify a distribution P, ﬁx a vector b = (b1, . . . , bk) { 1, 1}k. Consider now a single interaction between the learner and the environment. A vector x is drawn according to P as described in the proof (sketch) of Theorem 5. Let i [k] denote the index of the coin which is associated with x. As we observed in Section 1, each of the coordinates i, i + k is distributed uniformly over {

2 2 }. Hence, to obtain any information, the learner must observe both of the coordinates i and i + k. The probability that both coordinates are observed is at most O(r2/d2). Hence, in expectation, (only) a single meaningful observation is obtained every Ω(d2/r2) iterations. Therefore, the price of partial information is Ω(d2/r2).

2.4 Implementation of POPCA

As we discussed above, when the data is fully visible, the sample complexity of the PCA algorithm (which computes the k leading eigenvectors of the empirical covariance matrix, m 1 Pm i=1 xix i , and returns the corresponding projection matrix) is mf := O(k/ϵ2). When the number of samples, mf, is larger than the dimension, a standard implementation of this algorithm costs O(mfd2). We next show that POPCA has a similar runtime. We established above that POPCA requires a training set of size O((d/r)2mf). Consider a single iteration of POPCA. Note that the construction of ˆCˆxi (see 4) costs O(r2). Therefore, it costs O(mfd2) to obtain the average of the estimates, ˆC. The computation of the SVD of ˆC costs O(d3). Therefore, when mf d, the overall runtime of POPCA is indeed O(mfd2).

3. The Active Setting

We now consider the active model of subspace learning with partial information. In particular, we will present and analyze the Matrix Bandit Exponentiated Gradient (MBEG) algorithm. Before describing MBEG, it should be noticed that POPCA (along with its analysis) can be modiﬁed to ﬁt the active model; An active version of POPCA simply selects the r observed attributes uniformly at random (with replacement) and then proceeds similarly to POPCA2. It is not hard to verify that the sample complexity of this algorithm is asymptotically equivalent to the sample complexity of POPCA. In particular, the price of partial information is quadratic in d/r. The main differences between MBEG and POPCA are as follows:

1. MBEG is designed for the active model. It employs a non-uniform sampling method which gives higher priority to stronger directions.

2. MBEG is an iterative algorithm which maintains a weight matrix that belongs to the convex hull of the set Pd k of projection matrices. It can be thought of as a Bandit version of the extension of the Exponentiated Gradient (EG) algorithm to matrices. The EG algorithm, and its extension to matrices are due to Kivinen and Warmuth (1997), and Tsuda et al. (2005), respectively.

2. Note that here the attributes are no longer independent and therefore we should replace the weights in the deﬁnition of ˆCˆx.

GONEN, ROSENBAUM, ELDAR AND SHALEV-SHWARTZ

3.1 Matrix Bandit Exponentiated Gradient (MBEG)

3.1.1 CONVEXIFICATION

In order to be able to apply the Matrix EG algorithm, we ﬁrst need to formulate our task as a convex optimization problem. Recall that our problem is equivalent to approximately minimizing the objective argminΠ Pd k Π, C , where C = E[xx ]. The objective is linear and thus convex.

We will replace the non-convex set Pd k with its convex hull, Cd k := conv(Pd k). Note that for every W in Cd k, the gradient is given by C. Therefore, the EG procedure would start with some W1 Cd k, and, at iteration i, would update according to

1. Ui+1 = exp(log(Wi) + ηC)

2. Wi+1 = argmin W Cd k DR(W, Ui+1) , (6)

where D(R, U) = tr(W log W W log U W + U) is the Bregman divergence induced by the quantum entropy regularizer, R(W) = tr(W log W W). Additional details regarding this regualarizer can be found in Tsuda et al. (2005) and Warmuth and Kuzmin (2008). As in POPCA, the gradient C is unknown but can be estimated. We would like to exploit the active setting, and therefore, we will employ a non-uniform sampling method which relies on the current weight matrix Wt (see Section 3.1.2). Next, we recall that the output of the algorithm should be a projection matrix, while our algorithm maintains weight matrices, which may not belong to the set Pd k. Therefore, the ﬁnal step of the algorithm is to construct an element from Pd k that performs similarly to the average of the weight matrices maintained during the run of the algorithm. For this purpose, we rely on the following lemma, due to Warmuth and Kuzmin (2008):

Lemma 6 Every matrix W Cd k can be decomposed in time O(d3) into a convex combination of at most d elements from Pd k.

For completeness, we recall the decomposition procedure of Warmuth and Kuzmin (2008) in Appendix D. Getting back to our algorithm, let ˆW = 1 m Pm i=1 Wi be the average weight matrix, and denote by ˆW = Pd j=1 βjΠj a decomposition of ˆW into a convex combination of elements from Pd k. The ﬁnal step of MBEG sets the output matrix ˆΠ to be Πj with probability βj. This guarantees that the expected performance of ˆΠ is the same as the performance of ˆW.

3.1.2 PRIORITIZING STRONGER DIRECTIONS

In this part we present the non-uniform sampling mechanism which is employed by MBEG for attribute sampling. We ﬁrst consider the case r = 2. Let W := Wi be a weight matrix obtained during the run of MBEG. Recall3 that P i Wi,i = k, and for every i [d], Wi,i [0, 1]. Therefore, a natural distribution for attribute sampling is to choose each pair (s, q) [d]2 with probability ps,q = Ws,s

k . Unfortunately, we were not able to obtain a good bound using this sampling technique. Instead, we pick attributes according to

ps,q = (1 α)Ws,s + Wq,q

3. These properties clearly hold for projection matrices, and thus hold for any convex combination of projection matrices.

SUBSPACE LEARNING WITH PARTIAL INFORMATION

for some parameter α (0, 1/2), which is tuned below. That is, we mix a uniform distribution over [d]2, with a distribution which gives higher sampling probability to pairs for which Ws,s, Wq,q are high, reﬂecting a bias toward sampling from stronger directions. Mixing with the uniform distribution guarantees that every pair has large enough probability to be sampled, which will later help us ensure that we perform enough exploration . Based on this probability distribution over pairs, we deﬁne an unbiased estimate of the matrix C by ˆC = 1 2ps,q xsxq(Es,q + Eq,s) , (8)

where Es,q is the all zeros matrix except 1 in the i, j coordinate. The extension of MBEG to any (even) r > 2 is straightforward. We simply pick r/2 independent estimates ˆC1, . . . , ˆCr/2, each of which is constructed as in the case r = 2, and set ˆC = 2

r Pr/2 j=1 ˆCj. The algorithm is summarized in Algorithm 2. As explained in Tsuda et al. (2005), the projection step w.r.t. the Bregman divergence, Wi+1 = argmin W Cd k DR(W, Ui+1) can be performed in time O(d3). Overall, the running time per iteration is O(d3).

Algorithm 2 Matrix Bandit Exponentiated Gradient

Input: r, k < d (r mod 2 = 0)

r log(d) 2m(d+r) α = ηd2 (we assume that m is large enough so that α 1/2) W1 = k

d I for i = 1 to m do

denote W = Wi for j = 1 to r/2 do

pick (s, q) [d]2 with probability ps,q = (1 α)Ws,s+Wq,q

d2 for x P, let (xs, xq) be the corresponding attributes

ˆCi,j = 1 2ps,q xsxq(Es,q + Eq,s) end for

r Pr/2 j=1 ˆCi,j Ui+1 = exp(log(W) + η ˆCi) Wi+1 = argmin W Cd k DR(W, Ui+1) end for

m Pm i=1 Wi decompose ˆW into ˆW = Pd j=1 βjΠj using Algorithm 3 return ˆΠ = Πj with probability βj

3.1.3 ANALYSIS OF MBEG

Let ma(d, k, r, ϵ) be the sample complexity of subspace learning in the active partial information model. We denote the Frobenius norm, the spectral norm, and the trace norm by F , sp, and tr, respectively. Throughout this section we prove the following result.

GONEN, ROSENBAUM, ELDAR AND SHALEV-SHWARTZ

Theorem 7 Using MBEG (Algorithm 2), we have the following bound on the sample complexity for any (even) integer r 2:

ma(d, k, r, ϵ) max 8k d + r

In order to prove Theorem 7, we next apply the general analysis of Matrix EG (Hazan et al., 2012)[Theorem 13] to our case.

Theorem 8 Assume that the sequence ˆC1, . . . , ˆCm obtained during the run of MBEG satisﬁes ˆCi sp 1/η for every i [m]. Then, for every Π Pd k,

i=1 Wi Π , ˆCi k log(d)

i=1 Wi, ˆC2 i . (9)

The right-most term in 9 can be thought as the variance which is associated with the estimation process of MBEG. We now show that in contrast to POPCA, the variance scales only linearly with d/r.

Lemma 9 For any matrix Wi Cd k maintained by MBEG at time i, denote the conditional expectation given Wi by Ei. Then,

Ei Wi, ˆC2 i 2k(d + r)

We now prove the lemma while assuming that r = 2. The extension to any r > 2 is detailed in Appendix B. Proof Let W = Wi, ˆC = ˆCi. Then,

Ei W, ˆC2 = X

(s,q) [d]2 ps,q(Ws,s + Wq,q) 1 4p2s,q x2 sx2 q

(Ws,s + Wq,q)

4ps,q x2 sx2 q .

Since α (0, 1/2), we have

Ws,s + Wq,q

ps,q = Ws,s + Wq,q (1 α)Ws,s+Wq,q

2dk + α/d2 Ws,s + Wq,q (1 α)Ws,s+Wq,q

Therefore, since r = 2, we obtain

Ei W, ˆC2 dk X

(s,q) [d]2 x2 sx2 q dk < 2k(d + r)

completing the proof of the lemma.

SUBSPACE LEARNING WITH PARTIAL INFORMATION

Proof (of Theorem 7) By deﬁnition of ˆCi,j we have that ˆC2 i,j = ˆCi,j ˆC i,j is a diagonal matrix with all elements on the diagonal equal to zero except the (s, s) and (q, q) elements, which are both

bounded above by x2 sx2 q p2s,q . It follows that

ˆCi,j sp |xsxq|

ps,q 1 ps,q d2

j=1 ˆCi,j sp 2

r r 2 1 η = 1

Therefore, the conditions of Theorem 8 hold. Taking expectation over 9, we obtain

i=1 E Wi Π , ˆCi k log(d)

i=1 E Wi, ˆC2 i . (10)

Let Ei denote the conditional expectation given Wi. Then, Ei[ ˆCi] = C and therefore, by the law of total expectation,

E Wi Π , ˆCi = E Wi Π , C . (11)

Combining 11 with Lemma 9, and plugging into 10, we obtain that

i=1 Wi Π , C k log(d)

η + ηm2k(d + r)

Dividing by m, denoting ˆW = 1 m Pm i=1 Wi, substituting η = q

r log(d) 2(d+r)m, rearranging terms, and

observing that E[ˆΠ| ˆW] = ˆW, we have that

E ˆΠ Π , C = E [ ˆW] Π , C

8k2(d + r) log(d)

The right-hand side of the above is smaller than ϵ if m 8k log(d) d+r

ϵ2 . Note, however, that we also require that m is large enough so that α 1/2. Since α = ηd2, it follows that m should also satisfy m 2d4r log(d)

3.2 Comparison between the bounds in the passive and the active models

The left term in the bound of MBEG is smaller than the bound of POPCA if d2

r2 > 8k log(d) d+r

r . The right term in the bound of MBEG is smaller than the bound of POPCA if

(d + r)k 2d2r3 log(d) .

Hence, if both of the conditions hold, then the bound of MBEG is better than the bound of POPCA. Also, if ϵ < 2(d+r)k

d2r , then the dependence of the sample complexity of MBEG on d/r is linear.

GONEN, ROSENBAUM, ELDAR AND SHALEV-SHWARTZ

A reasonable regime in which MBEG enjoys a linear price is when r and k are constants, and ϵ is proportional to 1/d. Note that in this regime, the bound of POPCA scales with d4, while the bound of MBEG scales only with d3 (in the full-information setting, the sample complexity for this case scales with d2). To summarize, ignoring the dependence on k (and logarithmic factors), MBEG attains the desired linear price when ϵ is small .

4. Discussion

We introduced the problem of subspace learning with partial information, and considered both a passive and active model. Our ﬁrst observation was that looking at a single coordinate does not give any information. Therefore, our algorithms look at the products of attribute pairs. Using the POPCA algorithm for the passive model, we showed that the sample complexity is tightly characterized by Θ((d/r)2k/ϵ2). Hence, the price of partial information in this case is quadratic. For the active model we introduced the MBEG algorithm which exploits the gathered information in order to make a better choice over which attributes to observe. We showed that if the desired accuracy ϵ is small, then MBEG achieves a linear price. Since the expected number of observed attributes from each vector is r, we can not hope for a better dependence on d/r. At this point, a natural question arises: can we attain a linear price of partial information in the active model, independently of the required accuracy? We conclude with an observation regarding MBEG that provides a partial answer to this question. Assume that ϵ, r and k are constants. Examining the implementation of MBEG, we note that as long as the products, xsxq, between two consecutive observed attributes is zero, MBEG does not change the sampling distribution (which is initially uniform) nor its current estimation of the covariance matrix. Fix some attribute j and consider the distribution Pj which is concentrated on ej. The probability that the product between two consecutive observed attributes is not zero is (1/d)2. Since r is constant, only zero products are observed for at least Ω(d2) iterations, implying the same bound on the sample complexity. The implication of this result is that in order to achieve a linear price for any value of ϵ, it is necessary to extract more information from the partial observations (e.g., take into account both the products of attribute pairs and the single attributes). Developing such algorithms or alternatively, tightening the lower bounds, is left for future research.

Acknowledgments

We thank the anonymous reviewers for their helpful comments. We also thank Amit Daniely for helpful discussions. This research has been supported by ISF no 1673/14.

SUBSPACE LEARNING WITH PARTIAL INFORMATION

Appendix A. Proof of Theorem 5

In this section we prove Theorem 5.

A.1 The adversarial distribution

Let U = {uj}2k j=1 Rd be a set of 2k orthonormal vectors. In the proof sketch of Theorem 5 we described a family F of distributions over the set {u, u : u U}. Our lower bounds will be proved to hold in expectation when choosing P F at random. According to Yao s minimax principle, such a result implies that the lower bound holds for some distribution in F.

A.2 A Successful Subspace Learner Is Also a Successful Coin identiﬁer

In this part we formalize the reduction from coin identiﬁcation to subspace learning. As we sketched before, a key ingredient of our analysis of the lower bounds is the relation between subspace learning and coin identiﬁcation . This relation is formalized next. We ﬁrst need to introduce some additional notation. To specify the distribution P F, ﬁx a vector (b1, . . . , bk) { 1, 1}k. Let ˆΠ = Pk i=1 ˆuiˆu i Pd k (where ˆu1, . . . , ˆuk are orthonormal). For every (i, j) [k] [2k], deﬁne θi,j = | ˆui, uj | to be the covariance between ˆui and uj. Next, for each j [2k], deﬁne θ2 j = Pk i=1 θ2 i,j. Also, deﬁne the set

J = {j [k] : bj = 1 θ2 j > θ2 j+k} {j [k] : bj = 1 θ2 j < θ2 j+k} . (12)

For reasons that will become apparent shortly, we name J the set of identiﬁed coins. The following lemma asserts that a successful subspace learner must identify most of the coins.

Lemma 10 Let ϵ (0, 1). Assume that L(ˆΠ) minΠ Pd k L(Π) ϵ. Let J be deﬁned as in 12.

Then, |J|/k > 1 2ϵ

Proof Assume w.l.o.g. that b1 = . . . = bk = 1. Note that

2 uju j + 1 α

2 uj+ku j+k

The optimal projection is obtained by picking the largest eigenvectors of E[xx ]. Precisely, the optimal projection matrix is Π = Pk i=1 uiu i . We assume the loss function as formulated in 1.

GONEN, ROSENBAUM, ELDAR AND SHALEV-SHWARTZ

The loss of ˆΠ is calculated as follows:

E[ x 2 2] L(ˆΠ) =

i=1 ˆuiˆu i , E[xx ]

2 ˆuiˆu i , uju j + 1 α

2 ˆuiˆu i , uj+ku j+k

(1 + α)θ2 i,j 2 + (1 α)θ2 i,j+k 2

(1 + α)θ2 j 2 + (1 α)θ2 j+k 2

j=1 θ2 j + α

j=1 (θ2 j θ2 j+k) . (13)

Since {uj}2k j=1 is orthonormal, for each i [k] we have P2k j=1 θ2 i,j 1. Hence, P2k j=1 θ2 j k, so that the second-to-last term of 13 is at most 1

2. This value is attained by Π , and for simplicity, we will assume that is attained by ˆΠ as well. For the same reasons, for each i [k], we have Pk j=1(θ2 i,j θ2 i,j+k) 1. Hence, Pk j=1(θ2 j θ2 j+k) k, so that the last term of 13 is at most α/2. Once again, this value is attained by Π . Our last observation is that if the j-th coin is not identiﬁed, i.e., j / J, then θ2 j θ2 j+k 0. Combining these observations, we obtain that

L(ˆΠ) L(Π ) α(1 |J|/k)/2 .

The proof is completed by combining the assumption that L(ˆΠ) L(Π ) ϵ.

A.2.1 LOWER BOUND ON COIN IDENTIFICATION

We previously informally argued that Ω(1/α2) samples are needed to identify a coin with bias 1 α

2 . If we have k such independent coins, then Ω(k/α2) are needed. Let us formalize this result. A coin identiﬁcation problem with parameter α is deﬁned as follows. Consider a binary classiﬁcation problem with a domain [k] and label set {0, 1}. The hypothesis class is the set H = {0, 1}[k]. The underlying distribution over [k] {0, 1} is chosen at random using the following mechanism. The marginal distribution over [k] is uniform, and the conditional probability over the label (coin) is determined by P(y = 1|x = j) = 1+bjα

2 , where each bj is an independent Rademacher random variable (drawn in advance). We observe that this distribution is identical to the distribution deﬁned (over a shattered set of size k) in (Anthony and Bartlett, 2009, Theorem 5.2). As usual, an algorithm for coin identiﬁcation obtains an i.i.d. training sequence S according to P, and has to return a hypothesis h H. The generalization error of h H, denoted err(h), is deﬁned to be the probability that it misclassiﬁes a new generated point (x, y), i.e., err(h) = P(h(x) = y). The next theorem follows from (Anthony and Bartlett, 2009, Theorem 5.2).

SUBSPACE LEARNING WITH PARTIAL INFORMATION

Theorem 11 Let B be an algorithm for coin identiﬁcation, and let ϵ (0, 1/64). Consider a coin identiﬁcation problem with α = 8 ϵ. If m < k 8 ϵ2 , then there exists a distribution P for which

ES P m err(B(S)) min h H err(h) > ϵ .

A.3 Concluding the Theorem

We are now in position to complete the reduction from the task of coin identiﬁcation to the task of subspace learning, and consequently conclude Theorem 5. Let A be a subspace learner whose sample complexity is m(d, k, r = d, ϵ). We describe an algorithm B for coin identiﬁcation (with parameter α) which uses A as a subroutine. To this end, we shall provide a (randomized) map between the input of B to the input of A. These inputs have the form of training sequences, which will be denoted by Ssubspace and Scoins, respectively. For each j [k], the pair (j, 1) is associated with uj or uj with equal probability. The pair (j, 0) is associated with uj+k of uj+k with equal probability. Clearly, the sequence provided to A is generated according to the distribution discussed in the proof sketch of Theorem 5. Given an accuracy parameter ϵ (0, 1/64) for B, we will require accuracy ϵ = ϵ/2 from A. To complete the reduction, we will specify how the output of B is determined using the output of A. For each j [k], the output hypothesis returned by B is deﬁned by

( 1 θj > θj+k 0 otherwise .

Denote the set of identiﬁed coins by J (as in Lemma 10). Observe that each coin that is not identiﬁed, adds α/k to the relative error of B. It follows from Lemma 10 that

err(B(Scoins)) min h H err(h) = α(1 |J|/k)

L(A(Ssubspace)) min Π Pd k L(Π)

If m m(d, k, r = d, ϵ), then the right-hand side of 14 is at most 2ϵ = ϵ. The proof is completed by applying Theorem 11 (with α = 8 ϵ).

Appendix B. Extending Lemma 9 to r > 2

We will now extend the proof of Lemma 9 to any (even) r > 2. Fix an iteration i, and denote ˆC = ˆCi, W = Wi. We have

Ei W, ˆC2 = 4

j=1 Ei W, ˆC2 j + X

j =t Ei W, ˆCj ˆCt

From the case r = 2, we already know that the term Ei W, ˆC2 j is at most dk. Fix some pair j = t. Note that ˆCj ˆCt = 0 if and only if (at least) one of the indices sj, qj is equal to one of the indices

GONEN, ROSENBAUM, ELDAR AND SHALEV-SHWARTZ

st, qt. Hence,

Ei W, ˆCj ˆCt 2 X

(s,q) [d]2 p2 s,q(Ws,s + Wq,q)x2 sx2 q 4p2s,q

(s,q,q ) [d]3: q =q

ps,qps,q Wq,q x2 s|xq||x q| 4ps,qps,q .

The ﬁrst term can be bounded as follows:

(s,q) [d]2 p2 s,q(Ws,s + Wq,q)x2 sx2 q 4p2s,q = 2

(s,q) [d]2 (Ws,s + Wq,q)x2 sx2 q

4 (Ws,s + Wq,q)(s,q) [d]2, (x2 sx2 q)(s,q) [d]2

4 (Ws,s + Wq,q)(s,q) [d]2 (x2 sx2 q)(s,q) [d]2 1

The second term can be bounded as:

(s,q,q ) [d]3: q =q

ps,qps,q Wq,q x2 s|xq||xq | 4ps,qps,q 4 1

(q,q ) [d]2 |Wq,q | |xqxq |

W tr xx sp k .

Combining the above, we obtain

Ei W, ˆC2 = 4

2 1 (k + 1) i

= 2k(d + r)

Appendix C. The Price of Partial Information is at Least Linear

Theorem 12 Let k = 1. Then, the sample complexity of subspace learning is bounded below by:

m(d, k = 1, r, ϵ) Ω d

Proof Let ϵ (0, 1/4), and let A be a bandit subspace learner. For each s [d], we deﬁne a distribution Ps as follows. The zero vector is drawn with probability 1 cϵ (where c > 2), and the

SUBSPACE LEARNING WITH PARTIAL INFORMATION

vector es is drawn with probability cϵ. Note that E[xx ] = cϵ ese s , and thus the optimal projection is given by Π := ese s .

We next show that a successful subspace learner must identify the distribution. Let Ps be a concrete distribution. Denote by ˆΠ = ˆuˆu the output of the learner. Deﬁne θs = u2 s. It can be easily seen that if L(ˆΠ) L(Π ) < ϵ < cϵ/2, then θs > θq for any q = s. That is, a successful subspace learner must identify the index s.

Next we observe that if the size of the sample is at most o d

rϵ , then with a non-negligible probability, all the observations made by the learner are equal to zero. It follows that there exists a distribution Pj for which, E[L(ˆΠ) L(Π )] > cϵ/2 > ϵ.

Appendix D. Decomposing Elements in Cd k into a Convex Combination of Elements From Pd k

In this part we detail the decomposition procedure mentioned in Lemma 6. For a symmetric d d matrix A, we denote its eigenvalues by λ(A)1 . . . λ(A)d. For convenience, we deﬁne two subroutines. The procedure sort(x, s) returns a set of s indices, corresponding to the s largest values of x. Given a vector x Rd, the function diag : Rd Rd d returns a diagonal matrix X with Xj,j = xj.

Algorithm 3 Decomposition Procedure

Input: W Cd k = conv(Pd k) perform eigendecomposition W = Udiag(λ(A))U

λ := k 1(λ(A)1, . . . , λ(A)d) i := 1 repeat

J := sort(λ, k) ci := k 1 P j J ej s := minj J λj l := maxj [d]\J λj βi := min n sk, Pd j=1 λj lk o

λ := λ βici Πi := Udiag(kci)U

i := i + 1 until λ = (0, . . . , 0) For each j [i 1] choose A = Πj with probability βj

It has been proved by Warmuth and Kuzmin (2008) that the loop inside Algorithm 3 repeats at most d times, and decomposes W into a convex combination P βiΠi of elements from Pd k.

GONEN, ROSENBAUM, ELDAR AND SHALEV-SHWARTZ

Appendix E. Covariance Estimation with Missing Entries

Corollary 1 in Lounici (2014) provides bounds under the assumption that the instances are drawn from a sub-gaussian distribution (see assumption 1 in Lounici (2014)). This assumption is weaker than our boundedness assumption. Translating the results to our notation yields:

Lemma 13 Given m partial observations, let ˆC be the unbiased estimation constructed by POPCA. Let λ > 0 be a parameter.

C := C(λ) = argmin S Sd + { S ˆC 2 F + λ S 1} ,

where Sd + is the set of symmetric positive semi-deﬁnite d d matrices. Then, for a suitable choice of λ, with probability at least 1 1/(2d), we have

inf S Sd + { S C 2 F + c1λ2 1(C) tr(C)

r2m rank(S) log(2d)} ,

where c1 is a constant and λ1(C) is the leading eigenvalue of C.

Substituting m = (d/r)2 k

ϵ2 from Lemma 2 into the above bound, we obtain that the RHS is at most s

inf S Sd + { S C 2 F + c1λ2 1(C) tr(C)

k rank(S) log(2d))} .

This bound is always worse than our bound (i.e., larger than ϵ/

Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.

Raman Arora, Andrew Cotter, and Nathan Srebro. Stochastic optimization of pca with capped msg. ar Xiv preprint ar Xiv:1307.1674, 2013.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002.

Gilles Blanchard, Olivier Bousquet, and Laurent Zwald. Statistical properties of kernel principal component analysis. Machine Learning, 66(2-3):259 294, 2007.

Emmanuel J Cand es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717 772, 2009.

Nicolo Cesa-Bianchi, Shai Shalev-Shwartz, and Ohad Shamir. Efﬁcient learning with partially observed attributes. The Journal of Machine Learning Research, 12:2857 2878, 2011.

Yudong Chen, Ali Jalali, Sujay Sanghavi, and Constantine Caramanis. Low-rank matrix recovery from errors and erasures. Information Theory, IEEE Transactions on, 59(7):4324 4337, 2013.

SUBSPACE LEARNING WITH PARTIAL INFORMATION

Yuejie Chi, Y.C. Eldar, and Robert Calderbank. Petrels: Parallel subspace estimation and tracking by recursive least squares from partial observations. ar Xiv preprint ar Xiv:1207.6353, 2013.

Qian Du and James E Fowler. Hyperspectral image compression using jpeg2000 and principal component analysis. Geoscience and Remote Sensing Letters, IEEE, 4(2):201 205, 2007.

Elad Hazan, Satyen Kale, and Shai Shalev-Shwartz. Near-optimal algorithms for online matrix prediction. ar Xiv preprint ar Xiv:1204.0136, 2012.

Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, 1997.

Karim Lounici. High-dimensional covariance matrix estimation with missing observations. Bernoulli, 20(3):1029 1058, 2014.

Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Streaming pca with many missing entries. Preprint, 2014.

Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online pca with optimal regrets. In Algorithmic Learning Theory, pages 98 112. Springer, 2013.

Christos H Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. In Proceedings of the seventeenth ACM SIGACTSIGMOD-SIGART symposium on Principles of database systems, pages 159 168. ACM, 1998.

John Shawe-Taylor, Christopher KI Williams, Nello Cristianini, and Jaz Kandola. On the eigenspectrum of the gram matrix and the generalization error of kernel-pca. Information Theory, IEEE Transactions on, 51(7):2510 2522, 2005.

Koji Tsuda, Gunnar R atsch, and Manfred K Warmuth. Matrix exponentiated gradient updates for on-line learning and bregman projection. In Journal of Machine Learning Research, pages 995 1018, 2005.

Xiaodong Wang and H Vincent Poor. Blind multiuser detection: A subspace approach. Information Theory, IEEE Transactions on, 44(2):677 690, 1998.

M.K. Warmuth and D. Kuzmin. Randomized online pca algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9:2287 2320, 2008.

Jian Yang, David Zhang, Alejandro F Frangi, and Jing-yu Yang. Two-dimensional pca: a new approach to appearance-based face representation and recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(1):131 137, 2004.