# classification_with_low_rank_and_missing_data__f037845f.pdf

Classiﬁcation with Low Rank and Missing Data

Elad Hazan EHAZAN@CS.PRINCETON.EDU Princeton University and Microsoft Research, Herzliya

Roi Livni ROI.LIVNI@MAIL.HUJI.AC.IL The Hebrew University of Jerusalem and Microsoft Research, Herzliya

Yishay Mansour MANSOUR.YISHAY@GMAIL.COM Microsoft Research, Hertzelia and Tel Aviv University

We consider classiﬁcation and regression tasks where we have missing data and assume that the (clean) data resides in a low rank subspace. Finding a hidden subspace is known to be computationally hard. Nevertheless, using a non-proper formulation we give an efﬁcient agnostic algorithm that classiﬁes as good as the best linear classiﬁer coupled with the best low-dimensional subspace in which the data resides. A direct implication is that our algorithm can linearly (and non-linearly through kernels) classify provably as well as the best classiﬁer that has access to the full data.

1. Introduction

The importance of handling correctly missing data is a fundamental and classical challenge in machine learning. There are many reasons why data might be missing. For example, consider the medical domain, some data might be missing because certain procedures were not performed on a given patient, other data might be missing because the patient choose not to disclose them, and even some data might be missing due to malfunction of certain equipment. While it is deﬁnitely much better to have always complete and accurate data, this utopian desire is, in many cases, unfulﬁlled. For this reason we need to utilize the available data even if some of it is missing.

Another, very different motivation for missing data are recommendations. For example, a movie recommendations dataset might have users opinions on certain movies, which is the case, for example, in the Netﬂix motion picture

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

dataset. Clearly, no user has seen or reviewed all movies, or even close to it. In this respect recommendation data is an extreme case: the vast majority is usually missing (i.e., it is sparse to the extreme).

Many times we can solve the missing data problem since the data resides on a lower dimension manifold. In the above examples, if there are prototypical users (or patients) and any user is a mixture of the prototypical users, then this implicitly suggests that the data is low rank. Another way to formalize this assumption is to consider the data in a matrix form, say, the users are rows and movies are columns, then our assumption is that the true complete matrix has a low rank.

Our starting point is to consider the low rank assumption, but to avoid any explicit matrix completion, and instead directly dive in to the classiﬁcation problem. At the end of the introduction we show that matrix completion is neither sufﬁcient and/or necessary.

We consider perhaps the most fundamental data analysis technique of the machine learning toolkit: linear (and kernel) classiﬁcation, as applied to data where some (or even most) of the attributes in an example might be missing. Our main result is an efﬁcient algorithm for linear and kernel classiﬁcation that performs as well as the best classiﬁer that has access to all data, under low rank assumption with natural non-degeneracy conditions.

We stress that our result is worst case, we do not assume that the missing data follows any probabilistic rule other than the underlying matrix having low rank. This is a clear contrast to most existing matrix completion algorithms. We also cast our results in a distributional setting, showing that the classiﬁcation error that we achieve is close to the best classiﬁcation using the subspace of the examples (and with no missing data). Notably, many variants of the problem of ﬁnding a hidden subspace are computationally hard (see e.g. (Berthet & Rigollet, 2013)), yet as we show, learn-

Classiﬁcation with Low-Rank and Missing Data

ing a linear classiﬁer on a hidden subspace is non-properly learnable.

At a high level, we assume that a sample is a triplet (x, o, y), where x Rd is the complete example, o {1, . . . , d} is the set of observable attributes and y Y is the label. The learner observes only (xo, y), where xo omits any attribute not in o. Our goal is given a sample S = {(x(i) o , y(i))}m i=1 to output a classiﬁer f S such that w.h.p.:

E [ℓ(f S(xo), y)] min

w Rd w 1 E [ℓ(w x, y)] + ϵ,

where ℓis the loss function. Namely, we like our classiﬁer f S to compete with the best linear classiﬁer for the completely observable data.

Our main result is achieving this task (under mild regularity conditions) using a computationally efﬁcient algorithm for any convex Lipschitz-bounded loss function. Our basic result requires a sample size which is quasi-polynomial, but we complement it with a kernel construction which can guarantee efﬁcient learning under appropriate large margin assumptions. Our kernel depends only on the intersection of observable values of two inputs, and is efﬁciently computable. (We give a more detailed overview of our main results in Section 2.)

We complement our theoretical contributions with experimental ﬁndings that show superior classiﬁcation performance both on synthetic data and on publicly-available recommendation data.

Previous work. Classiﬁcation with missing data is a well studied subject in statistics with numerous books and papers devoted to its study, (see, e.g., (Little & Rubin, 2002)). The statistical treatment of missing data is broad, and to a fairly large extent assumes parametric models both for the data generating process as well as the process that creates the missing data. One of the most popular models for the missing data process is Missing Completely at Random (MCAR) where the missing attributes are selected independently from the values.

We outline a few of the main approaches handling missing data in the statistics literature. The simplest method is simply to discard records with missing data, even this assumes independence between the examples with missing values and their labels. In order to estimate simple statistics, such as the expected value of an attribute, one can use importance sampling methods, where the probability of an attribute being missing can depend on it value (e.g., using the Horvitz-Thompson estimator (Horvitz & Thompson, 1952)). A large body of techniques is devoted to imputation procedures which complete the missing data. This can be done by replacing a missing attribute by its mean (mean

imputation), or using a regression based on the observed value (regression imputation), or sampling the other examples to complete the missing value (hot deck).1 The imputation methodologies share a similar goal as matrix completion, namely reduce the problem to one with complete data, however their methodologies and motivating scenarios are very different. Finally, one can build a complete Bayesian model for both the observed and unobserved data and use it to perform inference (e.g. (Vishwanathan et al., 2005)). As with almost any Bayesian methodology, its success depends largely on selecting the right model and prior, this is even ignoring the computational issues which make inference in many of those models computationally intractable.

In the machine learning community, missing data was considered in the framework of limited attribute observability (Ben-David & Dichterman, 1998) and its many reﬁnements (Dekel et al., 2010; Cesa-Bianchi et al., 2010; 2011; Hazan & Koren, 2012). However, to the best of our knowledge, the low-rank property is not captured by previous work, nor is the extreme amount of missing data. More importantly, much of the research is focused on selecting which attributes to observe or on missing attributes at test or train time (see also (Eban et al., 2014; Globerson & Roweis, 2006)). In our case the learner has no control which attributes are observable in an example and the domain is ﬁxed. The latter case is captured in the work of (Chechik et al., 2008), who rescale inner-products according to the amount of missing data. Their method, however, does not entail theoretical guarantees on reconstruction in the worst case, and gives rise to non-convex programs.

A natural and intuitive methodology to follow is to treat the labels (both known and unknown) as an additional column in the data matrix and complete the data using a matrix completion algorithm, thereby obtaining the classiﬁcation. Indeed, this exactly was proposed in the innovative work of (Goldberg et al., 2010), who connected the methodology of matrix completion to prediction from missing data. Although this is a natural approach, we show that completion is neither necessary nor sufﬁcient for classiﬁcation. Furthermore, the techniques for provably completing a low rank matrix are only known under probabilistic models with restricted distributions (Srebro, 2004; Candes & Recht, 2009; Lee et al., 2010; Salakhutdinov & Srebro, 2010; Shamir & Shalev-Shwartz, 2011). Agnostic and nonprobabilistic matrix completion algorithms such as (Srebro et al., 2004; Hazan et al., 2012) we were not able to use for our purposes.

Is matrix completion sufﬁcient and/or necessary? We demonstrate that classiﬁcation with missing data is prov-

1We remark that our model implicitly includes meanimputation or 0-imputation method and therefore will always outperform them.

Classiﬁcation with Low-Rank and Missing Data

ably different from that of matrix completion. We start by considering a learner that tries to complete the missing entries in an unsupervised manner and then performs classiﬁcation on the completed data, this approach is close akin to imputation techniques, generative models and any other two step unsupervised/supervised algorithm. Our example shows that even under realizable assumptions, such an algorithm may fail. We then proceed to analyze the approach previously mentioned to treat the labels as an additional column.

To see that unsupervised completion is insufﬁcient for prediction, consider the example in Figure 1: the original data is represented by ﬁlled red and green dots and it is linearly separable. Each data point will have one of its two coordinates missing (this can even be done at random). In the ﬁgure the arrow from each instance points to the observed attribute. However, the rank-one completion of projection onto the pink hyperplane is possible, and admits no separation. The problem is clearly that the mapping to a low dimension is independent from the labels, and therefore we should not expect that properties that depend on the labels, such as linear separability, will be maintained.

Figure 1. Linearly separable data, for which certain completions make the data non-separable.

Next, consider a learner that treats the labels as an additional column. (Goldberg et al., 2010) Considered the following problem:

minimize Z rank(Z)

subject to: Zi,j = xi,j, (i, j) Ω, . (G)

where Ωis the set of observed attributes (or observed labels for the corresponding columns). Now assume that we always see one of the following examples: [1, , 1, ], [ , 1, , 1], or [1, , 1, 1, 1]. The observed labels are respectively 1, 1 and 1. A typical data matrix with one test point might be of the form:

1 1 1 1 1 1 1 1 1 1 1 1 1

First note that there is no 1-rank completion of this matrix. On the other hand, there is more than one 2-rank completion each lead to a different classiﬁcation of the test point. The ﬁrst two possible completions are to complete odd columns to a constant one vector, and even column vectors to a constant 1 vector. Then complete the labeling whichever way you choose. We can also complete the ﬁrst and last rows to a constant 1 vector, and the second row to a constant 1 vector. All possible completions lead to an optimal solution w.r.t Problem G but have different outcome w.r.t classiﬁcation. We stress that this is not a sample complexity issue. Even if we observe abundant amount of data, the completion task is still ill-posed.

Finally, matrix completion is also not necessary for prediction. Consider movie recommendation dataset with two separate populations, French and Chinese, where each population reviews a different set of movies. Even if each population has a low rank, performing successful matrix completion, in this case, is impossible (and intuitively it does not make sense in such a setting). However, linear classiﬁcation in this case is possible via a single linear classiﬁer, for example by setting all non-observed entries to zero. For a numerical example, return to the matrix M in Eq. 1. Note that we observe only three instances hence the classiﬁcation task is easy but does not lead to reconstruction of the missing entries.

2. Problem Setup and Main Result

We begin by presenting the general setting: A vector with missing entries can be modeled as a tuple x o, where x Rd and o 2d is a subset of indices. The vector x represents the full data and the set o represents the observed attributes. Given such a tuple, let us denote by xo a vector in (R { })d such that

( xi i o else

The task of learning a linear classiﬁer with missing data is to return a target function over xo that competes with best linear classiﬁer over x. Speciﬁcally, a sequence of triplets {(x(i), o(i), yi)}m i=1 is drawn iid according to some distribution D. An algorithm is provided with the sample S = {(xi oi, yi)}m i=1 and should return a target function f S over missing data such that w.h.p:

E [ℓ(f S(xo), y))] min w Bd(1) E [ℓ(w x, y))] + ϵ, (2)

where ℓ: R R 7 R is the loss function and Bd(r) denotes the Euclidean ball in dimension d of radius r. For brevity, we will say that a target function f S is ϵ-good if Eq. 2 holds.

Classiﬁcation with Low-Rank and Missing Data

Without any assumptions on the distribution D, the task is ill-posed. One can construct examples where the learner over missing data does not have enough information to compete with the best linear classiﬁer. Such is the case when, e.g., yi is some attribute that is constantly concealed and independent of all other features. Therefore, certain assumptions on the distribution must be made.

One reasonable assumption is to assume that the marginal distribution D over x is supported on a small dimensional linear subspace E and that for every set of observations, we can linearly reconstruct the vector x from the vector Pox, where Po : Rd R|o| is the projection on the observed attributes. In other words, we demand that the mapping Po|E : E Po E, which is the restriction of Po to E, is full-rank. As the learner doesn t have access to the subspace E, the learning task is still far from trivial.

We give a precise deﬁnition of the last assumption in Assumption 1. Though our results hold under the low rank assumption the convergence rates we give depend on a certain regularity parameter. Roughly, we parameterize the distance of Po|E from singularity, and our results will quantitatively depend on this distance. Again, we defer all rigorous deﬁnitions to Section 3.2.

2.1. Main Result

Our ﬁrst result is a an upper bound on the sample complexity of the problem. We then proceed to a more general statement that entails an efﬁcient kernel-based algorithm. Proofs are available in the supplamentary material (see (Hazan et al., 2015) for a full version). Theorem 1 (Main Result). Assume that ℓis a L-Lipschitz convex loss function, let D be a λ-regular distribution (see Deﬁnition 1), let γ(ϵ) log 2L/(λϵ)

Γ(ϵ) = dγ(ϵ)+1 d

There exists an algorithm (independent of D) that receives a sample S = {(xi oi, yi)}m i=1 of size m

Ω L2Γ(ϵ)2 log 1/δ

ϵ2 and returns a target function f S that

is ϵ-good with probability at least (1 δ). The algorithm runs in time poly(|S|).

As the sample complexity in Theorem 1 is quasipolynomial, the result has limited practical value in many situations. However, as the next theorem states, f S can actually be computed by applying a kernel trick. Thus, under further large margin assumptions we can signiﬁcantly improve performance. Theorem 2. For every γ 0, there exists an embedding over missing data

φγ : xo RΓ,

such that Γ = Pγ k=1 dk = dγ+1 d

d 1 , and the scalar product between two samples φγ(x1 o1) and φγ(x2 o2) can be efﬁciently computed, speciﬁcally it is given by the formula:

kγ(x1 o1, x2 o2) := |o(1) o(2)|γ 1

|o(1) o(2)| 1

i o(1) o(2) x(1) i x(2) i .

In addition, let ℓbe an L-Lipschitz loss function and S = {(xi oi, yi)}m i=1 a sample drawn iid according to a distribution D. We make the assumption that Pox 1 a.s. The followings hold:

1. At each iteration of Alg. 1 we can efﬁciently compute v t φγ(xt ot) for any new example xt ot. Speciﬁcally it is given by the formula

v t φγ(xt ot) :=

i=1 α(t 1) i k(xi oi, xt ot).

Hence Alg. 1 runs in poly(T) time and sequentially produces target functions ft(xo) = v t φγ(xo) that can be computed at test time in poly(T) time.

2. Run Alg. 1 with ﬁxed T > 0, 0 < ρ < 1

4 and ηt = 1 ρt.

T PT t=1 vt. For every v let

i=1 ℓ(v φγ(xi oi), yi)

then with probability (1 δ):

ˆFρ( v) min ˆFρ(v) + O L2Γ ln T/δ

3. For any ϵ > 0, if D is a λ-regular distribution and γ log 2L/(λϵ)

λ then for some v BΓ(Γ)

E [ℓ(v φγ(xo), y)] min w Bd(1) E [ℓ(w x, y)] + ϵ.

To summarize, Theorem 2 states that we can embed the sample points with missing attributes in a high dimensional, ﬁnite, Hilbert space of dimension Γ, such that:

The scalar product between embedded points can be computed efﬁciently. Hence, due to the conventional representor argument, the task of empirical risk minimization is tractable.

Following the conventional analysis of kernel methods: Under large margin assumptions in the ambient space, we can compute a predictor with scalable sample complexity and computational efﬁciency.

Classiﬁcation with Low-Rank and Missing Data

Finally, the best linear predictor over embedded sample points in a

Γ ball is comparable to the best linear predictor over fully observed data. Taken together, we can learn a predictor with sample complexity Ω(Γ2(ϵ)/ϵ2 log 1

δ ) and Theorem 1 holds.

For completeness we present the method together with an efﬁcient algorithm that optimizes the RHS of Eq. 3 via an SGD method. The optimization analysis is derived in a straightforward manner from the work of (Shalev-Shwartz et al., 2011). Other optimization algorithms exist in the literature, and we chose this optimization method as it allows us to also derive regret bounds which are formally stronger (see Section 2.2). We point out that the main novelty of this paper is in the introduction of a new kernel and our guarantees do not depend on a speciﬁc optimization algorithm.

Finally, note that φ1 induces the same scalar product as a 0-imputation. In that respect, by considering different γ = 1, 2, . . . and using a holdout set we can guarantee that our method will outperform the 0-imputation method.

2.2. Regret minimization for joint subspace learning and classiﬁcation

A signiﬁcant technical contribution of this manuscript is the agnostic learning of a subspace coupled with a linear classiﬁer. A subspace is represented by a projection matrix Q Rd d, which satisﬁes Q2 = Q. Denote the following class of target functions

F0 = {fw,Q : w Bd, Q Md d, Q2 = Q}

where fw,Q(xo) is the linear predictor deﬁned by w over subspace deﬁned by the matrix Q, as formally deﬁned in deﬁnition 2.

Given the aforementioned efﬁcient kernel mapping φγ, we consider the following kernel-gradient-based online algorithm for classiﬁcation called KARMA (Kernelized Algorithm for Risk-minimization with Missing Attributes):

Our main result for the fully adversarial online setting is given next, and proved in the supplamentary material and in the full version. Notice that the subspace E and associated projection matrix Q are chosen by an adversary and unknown to the algorithm. Theorem 3. For any γ > 1, λ > 0, X > 0, ρ > 0, B > 0, L-Lipschitz convex loss function ℓ, and λ-regular sequence {(xt, ot, yt)} w.r.t subspace E and associated projection matrix Q such that xt < X, Run Algorithm 1 with {ηt = 1

ρt}, sequentially outputs {vt Rt} that satisfy X

t ℓ(v t φγ(xt ot), yt) min w 1

t ℓ(fw,Q (xt ot), yt)

ρ (1 + log T) + ρ

2T B + e λγ

Algorithm 1 KARMA: Kernelized Algorithm for Riskminimization with Missing Attributes

1: Input: parameters γ > 1, {ηt > 0}, 0 < ρ < 1 2: Initialize: v1 = 0, α(0) = 0. 3: for t = 1 to T do 4: Observe example (xt ot, yt), suffer loss ℓ(v t φγ(xt ot), yt) 5: Update (ℓ denotes the derivative w.r.t. the ﬁrst argument)

(1 ηtρ) α(t 1) i i < t ηtℓ (v t φγ(xt ot), yt) i = t 0 else

i=1 α(t) i φγ(xi oi)

In particular, taking ρ = LX

BT , γ = 1 λ log T we obtain for every w 1:

t ℓ(v t φγ(xt ot), yt) ℓ(fw,Q (xt ot), yt) O(XL

3. Preliminaries and Notations

3.1. Notations

As discussed, we consider a model where a distribution D is ﬁxed over Rd O Y, where O = 2d consists of all subsets of {1, . . . , d}. We will generally denote elements of Rd by x, w, v, u and elements of O by o. We denote by Bd the unit ball of Rd, and by Bd(r) the ball of radius r.

Given a subset o we denote by Po : Rd R|o| the projection onto the indices in o, i.e., if i1 i2 ik are the elements of o in increasing order then (Pox)j = xij. Given a matrix A and a set of indices o, we let

Ao,o = Po AP o .

Finally, given a subspace E Rd we denote by PE : Rd Rd the projection onto E.

3.2. Model Assumptions

Deﬁnition 1 (λ-regularity). We say that D is λ-regular with associated subpsace E if the following happens with probability 1 (w.r.t the joint random variables (x, o)):

Classiﬁcation with Low-Rank and Missing Data

3. ker(Po PE) = ker(PE)

4. If λo > 0 is a strictly positive singular value of the matrix Po PE then λo λ.

Assumption 1 (Low Rank Assumption). We say that D satisﬁes the low rank assumption with associated subspace E if it is λ-regular with associated subspace E for some λ > 0.

Let o be a set of probable observables, and let Xo be the matrix obained by taking only columns in o from the data matrix X. If rank(Xo) < rank(X) then Assumption 1 is not satisﬁed. Since rank(Xo) < |o| we have that assumption 1 is met only if rank(X) is bounded by minimal number of observations. As in previous example, without some further restrictive assumptions, if the rank was larger than is required in the assumption, the problem of ﬁnding an ϵ-good function becomes ill-posed.

4. Learning under low rank assumption and λ-regularity.

Deﬁnition 2 (The class F0). We deﬁne the following class of target functions

F0 = {fw,Q : w Bd(1), Q Md d, Q2 = Q}

where fw,Q(xo) = (Pow) Q o,o (Pox).

(Here M denotes the pseudo inverse of M.)

The following Lemma states that, under the low rank assumption, the problem of linear learning with missing data is reduced to the problem of learning the class F0, in the sense that the hypothesis class F0 is not less-expressive.

Lemma 1. Let D be a distribution that satisﬁes the low rank assumption. For every w Rd there is f w,Q F0 such that a.s: f w,Q(xo) = w x.

In particular Q = PE and w = P E w .

4.1. Approximating F0 under regularity

We next deﬁne a surrogate class of target functions that approximates F0 Deﬁnition 3 (The classes Fγ). For every γ we deﬁne the following class

Fγ = {f γ w,Q : w Bd(1), Q Rd d, Q2 = Q}

f γ w,Q(xo) = (Pow)

j=0 (Qo,o)j (Pox)

Lemma 2. Let (x, o) be a sample drawn according to a λ-regular distribution D with associated subspace E. Let Q = PE and w 1 then a.s:

f γ w,I Q(xo) fw,Q(xo) (1 λ)γ

Corollary 1. Let ℓbe a L-Lipschitz function. Under λregularity, for every γ log L/λϵ

λ the class Fγ contains an ϵ-good target function.

4.2. Improper learning of Fγ and a kernel trick

Let G be the set of all ﬁnite, non empty, sequences of length at most γ over d. For each s G denote |s| the length of the sequence and send the last element of the sequence. Given a set of observations o we write s o if all elements of the sequence s belong to o. We let

j=1 dj = |G| = dγ+1 d

and we index the coordinates of RΓ by the elements of G:

Deﬁnition 4. We let φγ : (Rd O) RΓ be the embedding:

(φγ(xo))s =

( xsend s o 0 else

Lemma 3. For every Q and w we have:

f γ w,Q(xo) = X

s1 o ws1xs1+

{s:s o, 2 |s| t} ws1 Qs1,s2 Qs2,s3 Qs|s| 1,send xsend

Corollary 2. For every f γ w,Q Fγ there is v BΓ(Γ), such that:

f γ w,Q(xo) = v φγ(xo).

As a corollary, for every loss function ℓand distribution D we have that:

min v BΓ(Γ) E [ℓ(v φ(xo), y)] min f γ w,Q Fγ E h ℓ(f γ w,Q(xo), y) i

Due to Corollary 2, learning Fγ can be improperly done via learning a linear classiﬁer over the embedded sample set {φγ(xo)}m i=1. The ambient space RΓ may be very large. However, as we next show, the scalar product between two embedded points can be computed efﬁciently:

Classiﬁcation with Low-Rank and Missing Data

φγ(x(1) o1 ) φγ(x(2) o2 ) = |o1 o2|γ 1

k o1 o2 x(1) k x(2) k .

(We use the convention that 1γ 1

1 1 = limx 1 xγ 1

5. Experiments

We ve conducted both toy and real-data experiments to compare the performance of our new method vs. other existing methods: 0-imputation, mcb, mc1 and geom. We ve conducted experiments on binary classiﬁcation tasks, multi-class classiﬁcation and regression tasks. We begin by describing the implementation for each scenario

karma: We ve applied our method in all experiments. We evaluated the loss with γ = {1, 2, 3, 4} and C = {10 5, 10 4, . . . , 104, 105}. We chose γ and λ using a holdout set. A constant feature was added to allow bias and the data was normalized. For binary classiﬁcation we let ℓbe the Hinge loss, for multiclass we used the multiclass Hinge loss as described in (Crammer & Singer, 2002) and ﬁnally for regression tasks we used squared loss (which was also used at test time).

0-imputation: 0-imputation is simply reduced to γ = 1 . A constant feature was added to allow bias and data was normalized so the mean of each feature was 0.

mcb/ mc1: mcb and mc1 are methods suggested by (Goldberg et al., 2010). They are inspired by a convex relaxation of Problem G and differ by the way a bias term is introduced.

When applying mcb and mc1 in binary classiﬁcation tasks, we used the algorithm as suggested there. We ran their iterative algorithm until the incremental change was smaller than 1e 5 and used the logistic loss function to ﬁt the entries. In multi-class tasks the label matrix was given by a {0, 1} matrix with a 1 entry at the correct label. In regression tasks we ve used the squared loss to ﬁt the labels (as this is the loss we tested the results against).

geom: Finally we ve applied geom (Chechik et al.,

2008). This is an algorithm that tries to maximize the margin of the classiﬁer but the margin is computed with respect to the revealed entries subspace. We ve applied this method only for binary classiﬁcation as this method was designed speciﬁcally to correct the hinge loss methods. Again we used the iterative algorithm as suggested in there with 5 iterations. Regularization parameters were chosen using a holdout set.

5.1. Toy Examples

5.1.1. MCARMODEL

The ﬁrst toy data we ve experimented on was a regression task. We randomly picked m = 103 examples (normal distribution), in an r = 10 dimensional subspace embedded in a d = 20-dimensional linear space. We then randomly picked a classiﬁer w and let ˆyi = w x(i). The labels where normalized to have standard deviation 1 and mean 0 (i.e., yi = ˆ yi Ei[ ˆ yi]

i( ˆ yi Ei[ ˆ yi])2 ). We then randomly chose a

0.2 0.4 0.6 0.8 1 0

karma 0 imputation mc1/mcb

Figure 2. Missing Completely At Random features.

subset of observed features using iid Bernoulli distribution. Each feature had probability p = 0.1, 0.2, . . . , 0.9, 1 to be observed. The results appear in Figure 2: Square loss vs. fraction of observed features.

5.1.2. TOY DATA WITH LARGE MARGIN

As discussed earlier, under large margin assumption our algorithm enjoys strong guarantees. For this we considered a different type of noise model one that guarantees large margin. First we constructed fully observed sample of size m = 104: The data resides in r = 20 dimension subspace in a d = 200 dimension linear space. We then divided the features into three types- {A1, A2, A3}. Each subset contained enough information and we made sure that for each subset Ai there is a classiﬁer w Ai with large margin that correctly classify xo when o = Ai (in fact a random division of the features led to this). Thus, if at each turn o = Ai for some i, the 0-imputation method would guarantee to perform well. However, in our noise model, for each sample, each type had probability 1/3 to appear (independently, conditioned on the event that at least one type of feature appears). Thus there is no guarantee that 0-imputation will work well.

Such a noise distribution can model, for example, a scenario where we collect our data by letting people answer a survey. A person is likely to answer a certain type of question by and not answer a different type (say for example, he

Classiﬁcation with Low-Rank and Missing Data

is disinclined to reveal information on his ﬁnancial matters but care less about his recreational preferences).

( 1 A o 0 else . One can show that the class

{v φγ(xo) : v RΓ} can express the following target function whenever γ 3:

i χ(o; Ai)w Aixo

i1 =i2 χ(o; Ai1 Ai2)(w Ai1 + w Ai2) x+

χ(o; A1 A2 A3)(w A1 + w A2 + w A3) x

Where x is the vector received by ﬁlling zeros at non observed features. One can also verify that h(xo) will have zero expected loss. Also we can make sure that the corresponding v will have norm 7

200 400 600 800 1000 0

karma 0 imputation mcb/mc1

Figure 3. Block type revealed entries. Mean square error, Vs. sample size

Our method, indeed, picks a good classiﬁer with a comparably small sample size. Also, due to higher expressiveness it signiﬁcantly outperforms 0-imputation.

5.2. Real Data

Finally, we tested our method on real data sets. We emphasize that the missing data was inherent in the real data. The comparison appears in Table 1. The data sets were collected primarily from (Alcal-Fdez et al., 2011). The Jester data set was collected from (Goldberg et al., 2001), The books dataset was collected from (Ziegler et al., May 1014, 2005) and we ve also used the Movie Lens data set2. We took from the Jester data set around 1000 users that rated at least 36 Jokes out of 100. One joke that was rated by

2available here: http://grouplens.org/datasets/ movielens/

Table 1. Experiment Results (a) Multiclass Labeling name karma 0 imp mcb mc1 Cleveland 0.44 0.42 0.48 0.42 dermatology 0.03 0.04 0.04 0.04 marketing 0.70 0.71 0.70 0.70 movielens(occupation) 0.81 0.87 0.86 0.87

(b) Regression name karma 0 imp mcb mc1 jester 0.23 0.24 0.27 0.27 books 0.25 0.25 0.25 0.25 movielens (age) 0.16 0.22 0.25 0.25

(c) Binary Labeling name karma 0 imp mcb mc1 geom mammographic 0.17 0.17 0.17 0.18 0.17 bands 0.24 0.34 0.41 0.40 0.35 hepatitis 0.23 0.17 0.23 0.21 0.22 Wisconsin 0.03 0.03 0.03 0.04 0.04 horses 0.35 0.36 0.55 0.37 0.36 movielens (gender) 0.22 0.26 0.28 0.28 0.25

almost all users was used as a label (speciﬁcally joke number 5). The movielens data set includes users who rated various movies, and includes met-data such as age, gender and occupation. We used the meta-data to construct three different tasks. In the appendix we add the details as to data set sizes and percentage of missing values in each task. Throughout regression tasks were normalized to have mean zero and standard deviation 1.

6. Discussion and future work

We have described the ﬁrst theoretically-sound method to cope with low rank missing data, giving rise to a classiﬁcation algorithm that attains competitive error to that of the optimal linear classiﬁer that has access to all data. Our nonproper agnostic framework for learning a hidden low-rank subspace comes with provable guarantees, whereas heuristics based on separate data reconstruction and classiﬁcation are shown to fail for certain scenarios.

Our technique is directly applicable to classiﬁcation with low rank missing data and polynomial kernels via kernel (polynomial) composition. General kernels can be handled by polynomial approximation, but it is interesting to think about a more direct approach.

It is possible to derive all our results for a less stringent condition than λ-regularity: instead of bounding the smallest eigenvalue of the hidden subspace, it is possible to bound only the ratio of largest-to-smallest eigenvalue. This results in better bounds in a straightforward plug-and-play into our analysis, but was omitted for simplicity.

Classiﬁcation with Low-Rank and Missing Data

Acknowledgements: EH: This research was supported in part by the European Research Council project SUBLRN and the Israel Science Foundation grant 810/11.

RL is a recipient of the Google Europe Fellowship in Learning Theory, and this research is supported in part by this Google Fellowship.

YM: This research was supported in part by The Israeli Centers of Research Excellence (I-CORE) program, (Center No. 4/11), by a grant from the Israel Science Foundation, by a grant from United States-Israel Binational Science Foundation (BSF).

Alcal-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., Garca, S., Snchez, L., and Herrera., F. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17:2-3:255 287, 2011. URL http://sci2s.ugr.es/keel/ missing.php.

Ben-David, S. and Dichterman, E. Learning with restricted focus of attention. Journal of Computer and System Sciences, 56(3):277 298, 1998.

Berthet, Q. and Rigollet, P. Complexity theoretic lower bounds for sparse principal component detection. J. Mach. Learn. Res., W&CP, 30:1046 1066 (electronic), 2013.

Candes, E. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9:717 772, 2009.

Cesa-Bianchi, N., Shalev-Shwartz, S., and Shamir, O. Efﬁcient learning with partially observed attributes. In Proceedings of the 27th international conference on Machine learning, 2010.

Cesa-Bianchi, N., Shalev-Shwartz, S., and Shamir, O. Online learning of noisy data. IEEE Transactions on Information Theory, 57(12):7907 7931, dec. 2011. ISSN 0018-9448. doi: 10.1109/TIT.2011.2164053.

Chechik, Gal, Heitz, Geremy, Elidan, Gal, Abbeel, Pieter, and Koller, Daphne. Max-margin classiﬁcation of data with absent features. J. Mach. Learn. Res., 9:1 21, 2008. ISSN 1532-4435.

Crammer, Koby and Singer, Yoram. On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research, 2: 265 292, 2002.

Dekel, Ofer, Shamir, Ohad, and Xiao, Lin. Learning to classify with missing and corrupted features. Mach. Learn., 81(2):149 178, November 2010. ISSN 08856125.

Eban, Elad, , Mezuman, Elad, and Globerson, Amir. Discrete chebyshev classiﬁers. In Proceedings of the 27th international conference on Machine learning, 2014.

Globerson, Amir and Roweis, Sam. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on Machine learning, pp. 353 360. ACM, 2006.

Goldberg, Andrew B., Zhu, Xiaojin, Recht, Ben, Xu, Jun Ming, and Nowak, Robert D. Transduction with matrix completion: Three birds with one stone. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems 2010., pp. 757 765, 2010.

Goldberg, Ken, Roeder, Theresa, Gupta, Dhruv, and Perkins., Chris. Eigentaste: A constant time collaborative ﬁltering algorithm. Information Retrieval, 4(2):133 151, 2001. URL http://www.ieor.berkeley. edu/ goldberg/jester-data/.

Hazan, Elad and Koren, Tomer. Linear regression with limited observation. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.

Hazan, Elad, Kale, Satyen, and Shalev-Shwartz, Shai. Near-optimal algorithms for online matrix prediction. In COLT, pp. 38.1 38.13, 2012.

Hazan, Elad, Livni, Roi, and Mansour, Yishay. Classiﬁcation with low rank and missing data. ar Xiv preprint ar Xiv:1501.03273, 2015.

Horvitz, D. G. and Thompson, D. J. A generalization of sampling without replacement from a ﬁnite universe. Journal of the American Statistical Association, 47:663 685, 1952.

Lee, J., Recht, B., Salakhutdinov, R., Srebro, N., and Tropp, J. A. Practical large-scale optimization for maxnorm regularization. In NIPS, pp. 1297 1305, 2010.

Little, Roderick J. A. and Rubin, Donald B. Statistical Analysis with Missing Data, 2nd Edition. Wiley Interscience, 2002.

Salakhutdinov, R. and Srebro, N. Collaborative ﬁltering in a non-uniform world: Learning with the weighted trace norm. In NIPS, pp. 2056 2064, 2010.

Shalev-Shwartz, Shai, Singer, Yoram, Srebro, Nathan, and Cotter, Andrew. Pegasos: Primal estimated sub-gradient

Classiﬁcation with Low-Rank and Missing Data

solver for svm. Mathematical programming, 127(1):3 30, 2011.

Shamir, O. and Shalev-Shwartz, S. Collaborative ﬁltering with the trace norm: Learning, bounding, and transducing. JMLR - Proceedings Track, 19:661 678, 2011.

Srebro, Nathan. Learning with Matrix Factorizations. Ph D thesis, Massachusetts Institute of Technology, 2004.

Srebro, Nathan, Rennie, Jason, and Jaakkola, Tommi S. Maximum-margin matrix factorization. In Advances in neural information processing systems, pp. 1329 1336, 2004.

Vishwanathan, Alex Smola, Smola, Alex J, and Vishwanathan, SVN. Kernel methods for missing variables. In In Proceedings of the Tenth International Workshop on Artiﬁcial Intelligence and Statistics, 2005.

Ziegler, Cai-Nicolas, Mc Nee, Sean M., Konstan, Joseph A., and Lausen., Georg. Improving recommendation lists through topic diversiﬁcation. Proceedings of the 14th International World Wide Web Conference (WWW 05), 17:2-3, May 10-14, 2005. URL http://www2.informatik.uni-freiburg. de/ cziegler/BX/.