# online_matrix_completion_with_side_information__3abf0217.pdf Online Matrix Completion with Side Information Mark Herbster, Stephen Pasteris, Lisa Tse Department of Computer Science University College London London WC1E 6BT, England, UK {m.herbster,s.pasteris,l.tse}@cs.ucl.ac.uk We give an online algorithm and prove novel mistake and regret bounds for online binary matrix completion with side information. The mistake bounds we prove are of the form O( D γ2 ). The term 1 γ2 is analogous to the usual margin term in SVM (perceptron) bounds. More specifically, if we assume that there is some factorization of the underlying m n matrix into P Q>, where the rows of P are interpreted as classifiers in consistent with the observed matrix. The quasi-dimension term D measures the quality of side information. In the presence of vacuous side information, D = m + n. However, if the side information is predictive of the underlying factorization of the matrix, then in an ideal case, D 2 O(k + ) where k is the number of distinct row factors and is the number of distinct column factors. We additionally provide a generalization of our algorithm to the inductive setting. In this setting, we provide an example where the side information is not directly specified in advance. For this example, the quasi-dimension D is now bounded by O(k2 + 2). 1 Introduction We consider the problem of online binary matrix completion with side information. In our setting the learner receives data sequentially, so that on a trial t = 1, . . . , T: 1) the learner is queried by the environment to predict matrix entry (it, jt); 2) the learner predicts a label ˆyt 2 { 1, 1}; 3) the learner receives a label yt 2 { 1, 1} from the environment and 4) a mistake is incurred if yt 6= ˆyt. There are no probabilistic assumptions on how the environment generates its instances or their labels; it is an arbitrary process which in fact may be adversarial. The only restriction on the environment is that it does not see the learner s ˆyt until after it reveals yt. The learner s aim will be to minimize its expected regret, PT t=1 E[yt 6= ˆyt] min U t=1[yt 6= Uitjt], where the minimization is over U 2 { 1, 1}m n and the expectation is with respect to the learner s internal randomization. We will also consider mistake bounds where the aim is to minimize the learner s mistakes under the assumption (realizability) that there exists a U such that Uitjt = yt for all t 2 [T]. To aid the learner, side information is associated with each row and column. For instance, in the classic Netflix challenge [1], the rows of the matrix correspond to viewers and the columns to movies, with entries representing movie ratings. It is natural to suppose that we have side information in the form of demographic information for each user, and metadata for the movies. In this work, we consider both transductive and inductive models. In the former model, the side information associated with each row and column is specified completely in advance in the form of a pair of positive definite matrices that inform similarity between row pairs and column pairs. For the inductive model, a pair of kernel functions is specified over potentially continuous domains, where one is for the rows and the other is for the columns. What is not specified is the mapping from the domain of the kernel function to 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. 1 10 20 30 40 50 1 10 20 30 40 50 1 10 20 30 40 50 1 10 20 30 40 50 Figure 1: A (9, 9)-biclustered 50 50 binary matrix before/after permuting into latent blocks. specific rows or columns, which is only revealed sequentially. In the Netflix example, the inductive model is especially natural if new users or movies are introduced during the learning process. In Theorem 1, we will give regret and mistake bounds for online binary matrix completion with side information. Although this theorem has a broader applicability, our interpretation will focus on the case that the matrix has a latent block structure. Hartigan [2] introduced the idea of permuting a matrix by both the rows and columns into a few homogeneous blocks. This is equivalent to assuming that each row (column) has an associated row (column) class, and that the matrix entry is completely determined by its corresponding row and column classes. This has since become known as coor bi-clustering. This same assumption has become the basis for probabilistic models which can then be used to complete a matrix with missing entries. The authors of [3] give rate-optimal results for this problem in the batch setting and provide a literature overview. It is natural to compare this assumption to the dominant alternative, which assumes that there exists a low rank decomposition of the matrix to be completed, see for instance [4]. Common to both approaches is that associated with each row and column, there is an underlying latent factor so that the given matrix entry is determined by a function on the appropriate row and column factor. The low-rank assumption is that the latent factors are vectors in 0]. We denote the inner product of vectors x, w 2 [ p; q] = p> p + q> q. We let to be its pseudoinverse and transpose, respectively. The trace norm of a matrix X 2 X), where p indicates the unique positive square root of a positive semi-definite matrix, and tr( ) denotes the trace of a square matrix. This is given by tr(Y ) = Pn i=1 Yii for Y 2 0}, the set of matrices which are sign consistent with U. We also define SP1(U) = {V 2 =U max 1 i m k Pik max 1 j n k Qjk where the minimum is over all matrices P 2 2SP(U) max k Pik k Qjk |h Pi, Qji| . (2) This is the counterpart of the max-norm to learn a sign matrix. Observe that for U 2 { 1, 1}m n, 1 mc(U) k Ukmax min(pm, pn), where the lower bound follows from the right hand side of (2) and the upper bound follows since we may decompose U = UIn or as U = Im U. Note there may be a large gap between the margin complexity and the max-norm. In [6] a matrix in U 2 { 1, 1}n n was given such that mc(U) = log n and k Ukmax = (pn/log n). We denote the classes of m d row-normalized and block expansion matrices as N m,d := { ˆP =γU where the infimum is over all row-normalized matrices ˆP 2 N m,d and ˆQ 2 N n,d and every integer d. If the infimum does not exist then Dγ M,N(U) := +1. Note that the infimum exists iff k Ukmax 1/γ. The quasi-dimension quantifies how aligned the rows of ˆP ( ˆQ) are with the matrices M (N). Note that Dγ M,N(U) = m + n if k Ukmax 1/γ, M = Im and N = In. We now introduce notation specific to the graph setting. Let G be an m-vertex connected, undirected graph with positive weights. The Laplacian L of G is defined as D A, where D is the m m degree matrix and A is the m m adjacency matrix. Observe that as G is connected, L is a rank m 1 matrix with 1 in its null space. From L we define the (strictly) positive definite PDLaplacian L := L+ L . Observe that if u 2 [ 1, 1]m then (u>L u)RL 2(u>Lu RL+1), and similarly, (u>Lu)RL 1 2(u>L u)RL (see [28] for details of this construction). 3 Transductive Matrix Completion Algorithm 1 Predicting a binary matrix with side information in the transductive setting. Parameters: Learning rate: 0 < , quasi-dimension estimate: 1 b D, margin estimate: 0 < γ 1, non- conservative flag [NON-CONSERVATIVE] 2 {0, 1} and side information matrices M 2 Sm ++ with m + n 3 Initialization: M ; ; W 1 b D (m+n)Im+n. For t = 1, . . . , T Receive pair (it, jt) 2 [m] [n]. Define Xt := xt(xt) M +eit m p2RM N +ejt n p2RN M +eit m p2RM N +ejt n p2RN . (4) Predict Yt UNIFORM( γ, γ) [NON-CONSERVATIVE] ; yt tr 1 ; ˆyt sign( yt Yt) . Receive label yt 2 { 1, 1} . If yt 6= ˆyt then M M [ {t}. If yt yt < γ [NON-CONSERVATIVE] then log( W t) + yt Xt . Else W t+1 W t. Algorithm 1 corresponds to an adapted MATRIX EXPONENTIATED GRADIENT (MEG) algorithm [25] to perform transductive matrix completion with side information. Although the algorithm is a special case of MEG, the following theorem does not follow as a special case of the analysis in [25]. The underlying proof techniques are discussed in further detail in Appendix B.2. The following theorem includes an expected regret bound and a mistake bound, where the regret bound is the more flexible, noise-tolerant bound. Hence, in practice, the algorithm should always be run non-conservatively to obtain the regret bound, as per the theorem assumptions. We only include the conservative case for further insight as the bound is simpler, being analogous to Novikoff s (perceptron) bound. In this case, where we also assume realizability and exact tuning (γ = 1/ mc(U)), the mistakes are bounded by O(D mc(U)2). The term D evaluates the predictive quality of the side information provided to the algorithm. In order to evaluate D, we provide an upper bound in Theorem 3 that is more straightforward to interpret. Examples are given in Sections 4.1 and 5.1, where Theorem 3 is applied to evaluate the quality of side information in idealized scenarios. The mc(U) term is analogous to the inverse margin in perceptron, SVM, and other largin margin classifiers. Theorem 1. The expected regret of Algorithm 1 with non-conservative updates ([NON-CONSERVATIVE] = 1) and parameters γ 2 (0, 1], b D Dγ b D log(m+n) 2T , p.d. matrices M 2 Sm ++ and N 2 Sn ++ is bounded by [yt 6= Uitjt] 4 2( b D/γ2) log(m + n)T (5) for all U 2 { 1, 1}m n with k Ukmax 1/γ. The mistakes in the realizable case with conservative updates ([NON-CONSERVATIVE] = 0) and parameters 1/ = 1/γ mc(U), b D min V 2SP1(U) Dγ M,N(V ) are bounded by, |M| 3.6( b D/γ2) log(m + n) , (6) for all U 2 { 1, 1}m n with mc(U) 1/γ and yt = Uitjt for all t 2 M. If the side information is vacuous, that is M = Im and N = In, then D = m + n. In this scenario, we recover a special case1 of the analysis of [21] up to constant factors, then with the additional assumption of realizability we recover [24, Theorem 3.1]. The term D is difficult to directly quantify. In the next section, we specialize our analysis to the case that the matrix U has a latent block structure. 4 Latent Block Structure We introduce the concept class of (k, )-binary-biclustered matrices (previously defined in [24, Section 5]), in the following definition. We then give an upper bound to Dγ M,N(U) when a matrix has this type of latent structure in Theorem 3. The magnitude of the bound will depend on how predictive matrices M and N are of the latent block structure. In Sections 4.1 and 4.2, we will use a variant of the discrete Laplacian matrix for M and N to encode side information and illustrate the resultant bounds for idealized scenarios. Definition 2. The class of (k, )-binary-biclustered matrices is defined as k, = {U 2 { 1, 1}m n : r 2 [k]m, c 2 [ ]n, U 2 { 1, 1}k , Uij = U ricj, i 2 [m], j 2 [n]} . Thus each row ri is associated with a latent factor in [k] and each column cj is associated with a latent factor in [ ] and the interaction of factors is determined by a matrix U 2 { 1, 1}k . More visually, a binary matrix is (k, )-biclustered if there exists some permutation of the rows and columns into a k grid of blocks each uniformly labeled 1 or +1, as illustrated in Figure 1. Determining if a matrix is in Bm,n k, , may be done directly by a greedy algorithm. However, the problem of determining 1In [21], a regret bound for general loss functions for matrix completion without side information is given for (β, )-decomposable matrices. When β is at its minimum over all possible decompositions, we recover the bound up to constant factors with respect to the expected 0-1 loss. On the algorithmic level, our works are similar except that the algorithm of [21] contains an additional projection step that dominates the computation time of the update. if a matrix with missing entries may be completed to a matrix in Bm,n k,n was shown in [29, Lemma 8] to be NP-COMPLETE by reducing the problem to CLIQUE COVER. Many natural functions of matrix complexity are invariant to the presence of block structure. A function f : X ! < with respect to a class of matrices X is block-invariant if for all m, k, n, 2 N+ with m k, n , R 2 Bm,k and C 2 Bn, we have that f(X) = f(RXC>) for any k matrix X 2 X. The max-norm, margin complexity, rank and VC-dimension2 are all block-invariant. From the block-invariance of the max-norm, we may conclude that for U 2 Bm,n mc(U) k Ukmax = k U kmax min( This follows since we may decompose U = RU C> for some U 2 { 1, 1}k , R 2 Bm,k and C 2 Bn, and then use the observation in the preliminaries that the max-norm of any matrix in { 1, 1}m n is bounded by min(pm, pn). In the following theorem, we give a bound for the quasi-dimension Dγ M,N(U) which will scale with the dimensions of the latent block structure and the predictivity of M and N with respect to that block structure. The bound is independent of γ in so far as Dγ M,N(U) is finite. Theorem 3. If U 2 Bm,n 2 tr(R>MR)RM +2 tr(C>NC)RN +2k+2 M and N are PDLaplacians k tr(R>MR)RM + tr(C>NC)RN M 2 Sm ++ and N 2 Sn (8) as the minimum over all decompositions of U = RU C> for R 2 Bm,k, C 2 Bn, and U 2 { 1, 1}k . Thus for U 2 Bm,n M,N(U) (if k Ukmax 1/γ) min V 2SP1(U) Dγ M,N(U) (if mc(U) 1/γ) . The bound Dγ M,N(U) allows us to bound the quality of the side information in terms of a hypothetical learning problem. Recall that argminriyi 1:i2[m](r>Mr)RM is the upper bound on the mistakes per Novikoff s theorem [30] for predicting the elements of vector y 2 { 1, 1}m with a kernel perceptron using M 1 as the kernel. Hence the term O(tr(R>MR)RM) in (8) may be interpreted as a bound for a one-versus-all k-class kernel perceptron where R encodes a labeling from [k]m as one-hot vectors. We next show an example where D M,N(U) 2 O(k + ) with ideal side information. 4.1 Graph-based Side Information We may use a pair of separate graph Laplacians to represent the side information on the rows and the columns. A given row (column) corresponds to a vertex in the row graph ( column graph ). The weight of edge (i, j) represents our prior belief that row (column) i and row (column) j share the same underlying factor. Such graphs may be inherent to the data. For example, we have a social network of users and a network based on shared actors or genres for the movies in a Netflix type scenario. Alternatively, as is common in graph-based semi-supervised learning [31; 32] we may build a graph based on vectorial data associated with the rows (columns), for example, user demographics. Although the value of D will vary smoothly with the predictivity of M and N of the factor structure, in the following we give an example to quantify D in a best case scenario. Bounding D for ideal graph-based side information. In this ideal case we are assuming that we know the partition of [m] that maps rows to factors. The rows that share factors have an edge between them and there are no other edges. Given k factors, we then have a graph that consists of k disjoint cliques. However, to meet the technical requirement that the side information matrix M(N) is positive definite, we need to connect the cliques in a minimal fashion. We achieve this by connecting the cliques like a star graph. Specifically, a clique is arbitrarily chosen as the center and a vertex in that clique is arbitrarily chosen as the central vertex. From each of the other cliques, a vertex is chosen arbitrarily and connected to the central vertex. Observe that a property of this construction is that there is a path of length 4 between any pair of vertices. Now we can use the bound from Theorem 3, >MR)RM + 2 tr(C >NC)RN + 2k + 2 , 2Here, a hypothesis class H defines a matrix via U := (h(x))h2H,x2X . to bound D D in this idealized case. We focus on the rows, as a parallel argument may be made for the side information on the columns. Consider the term tr(R>MR)RM, where M := L is the PDLaplacian formed from a graph with Laplacian L. Then using the observation from the preliminaries that (u>L u)RL 2(u>Lu RL + 1), we have that tr(R>MR)RM 2 tr(R>LR)RL + 2k. To evaluate this, we use the well-known equality of tr(R>LR) = P (i,j)2E k Ri Rjk2. Observing that each of the m rows of R is a one-hot encoding of the corresponding factor, only the edges between classes then contribute to the sum of the norms, and thus by construction tr(R>LR) k 1. We bound RL 4, using the fact that the graph diameter is a bound on RL (see [33, Theorem 4.2]). Combining terms and assuming similar idealized side information on the columns, we obtain D 2 O(k + ). Observe then that since the comparator matrix is (k, )-biclustered, we have in the realizable case (with exact tuning), that mc(U)2 min(k, ) by (7). Thus, the mistakes of the algorithm are bounded by O(mc(U)2D ) = O(k ). This upper bound is tight up to logarithmic factors as we may decompose U = RU C> for some U 2 { 1, 1}k , R 2 Bm,k and C 2 Bn, and force a mistake for each of the k entries in U . Can side information provably help? Unsurprisingly, yes. Consider the set of matrices such that each row is either all +1 or all -1 . This set is exactly Bm,n 2,1 . Clearly, an adversary can force m mistakes, whereas with ideal side information the upper bound is O(1). Similar results to the above can be obtained via alternate positive definite embeddings. For example, consider a k-partition kernel of [m] where K ,S1,...,Sk(i, j) := [i, j 2 Sr : r 2 [k]] + [i = j] for some partition of [m] into disjoint sets S1, . . . , Sk. By using M 1 = (K(i, j))i,j2[m] one can obtain for small , bounds that are tighter than achieved by the Laplacian with respect to constant factors. We have focused on the Laplacian as a method for encoding side information as it is more straightforward to encode [32] softer knowledge of relationships. 4.2 Online Community Membership Prediction A special case of matrix completion is the case where there are m objects which are assumed to lie in k classes (communities). In this case, the underlying matrix U 2 { 1, 1} is given by Uij = 1 if i and j are in the same class and Uij = 1 otherwise. Thus this may be viewed as an online version of community detection or similarity prediction. In [22], this problem was addressed when the side information was encoded in a graph and the aim was to perform well when there were few edges between classes (communities). Observe that this is an example of a (k, k)-biclustered m m matrix where U = 2Ik 11> and there exists R 2 Bm,k such that U := RU R>. Since the max-norm is block-invariant, we have that k Ukmax = k U kmax. In the case of a general k k biclustered matrix, k U kmax k (see (7)). However in the case of similarity prediction , we have k U kmax 2 O(1). This follows since we have a decomposition U = P Q> by P , Q 2 LR)RL), a mistake bound of O(tr(R>LR)RL) is obtained, which recovers the bound of [22, Proposition 4] up to constant factors. This work extends the results in [22] for similarity prediction to regret bounds, and to the inductive setting with general p.d. matrices. In the next section, we will see how this type of result may be extended to an inductive setting. 5 Inductive Matrix Completion In the previous section, the learner was assumed to have complete foreknowledge of the side information through the matrices M and N. In the inductive setting, the learner has instead kernel side information functions M+ and N +. With complete foreknowledge of the rows (columns) that will be observed, one may use M+ (N +) to compute M (N), which corresponds to an inverse of a submatrix of M+ (N +). In the inductive, unlike the transductive setting, we do not have this foreknowledge and thus cannot compute M (N) in advance. Notice that the assumption of side information as kernel functions is not particularly limiting, as for instance the side information could be provided by vectors in K 1R) k Recall the bound (see (8)) on the quasi-dimension for a matrix U 2 Bm,n k, , where we have D D = k tr(R>MR)RM + tr(C>NC)RN for positive definite matrices. If we assume that the side information on the rows (columns) lies in [ r, r]d, then RM RM rd (RN RN rd) for the min kernel. Thus by applying the above proposition separately for the rows and columns and substituting into (8), we have that D D = k2(4r/δ )d + 2(4r/δ )d. We then observe that for this example,with an optimal tuning and well-separated side information on the rows and columns, the mistake bound for a (k, )-biclustered matrix in the inductive setting is of O(min(k, ) max(k, )2). However, our best lower bound in terms of k and is just k , as in the transductive setting. An open problem is to resolve this gap. 6 Discussion We have presented a regret bound with respect to the 0-1 loss for predicting the elements of matrices with latent block structure in the presence of side information. The bound scales as M,N(U) min(k, )T for U 2 Bm,n k, . In the case of idealized side information, the term D scaled linearly with k and in the transductive setting, and quadratically in the inductive setting. Problems for further research include resolving the gap between the upper and lower bounds in the inductive setting and building richer models of side information. In this work, the interrelation was restricted to equivalence relations on the row (column) spaces. A direct generalization would be to consider partial orderings. Such a generalization would be natural for ranking based on concept classes such as the online gambling scenario given in [21]. A broader generalization would allow side information to hint between row and column interrelationships rather than keeping them independent. 7 Acknowledgements We would like to thank Robin Hirsch for valuable discussions. This research was supported by the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agreement Number W911NF-16-3-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Army Research Laboratory, the U.S. Government, the U.K. Ministry of Defence or the U.K. Government. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. This research was further supported by the Engineering and Physical Sciences Research Council grant number EP/L015242/1. Broader Impact In general this work does not present any foreseeable specific societal consequence in the authors joint opinion. This is foundational research in regret-bounded online learning. As such it is not targeted towards any particular application area. Although this research may have societal impact for good or for ill in the future, we cannot foresee the shape and the extent. Funding Transparency Statement The authors were supported by the U.S. Army Research Laboratory and the U.K. Ministry of Defence under agreement number W911NF-16-3-0001 and by the Engineering and Physical Sciences Research Council grant number EP/L015242/1. Competing Interests The authors assert no competing interests. [1] J. Bennett and S. Lanning. The netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3 6, New York, August 2007. ACM. [2] J. A. Hartigan. Direct Clustering of a Data Matrix. Journal of the American Statistical Association, 67(337):123 129, 1972. [3] C. Gao, Y. Lu, Z. Ma, and H. H. Zhou. Optimal estimation and completion of matrices with biclustering structures. Journal of Machine Learning Research, 17:161:1 161:29, 2016. [4] E. Candès and B. Recht. Exact matrix completion via convex optimization. Commun. ACM, 55(6):111 119, June 2012. [5] S. Ben-David, N. Eiron, and H. U. Simon. Limitations of learning via embeddings in euclidean half spaces. Journal of Machine Learning Research, 3:441 461, 2003. [6] N. Linial, S. Mendelson, G. Schechtman, and A. Shraibman. Complexity measures of sign matrices. Combinatorica, 27(4):439 463, 2007. [7] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proceedings of the 18th Annual Conference on Learning Theory, pages 545 560, 2005. [8] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. Advances in Neural Information Processing Systems 17, 2005. [9] E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theor., 56(5):2053 2080, May 2010. [10] A. Maurer and M. Pontil. Excess risk bounds for multitask learning with trace norm regulariza- tion. In Proceedings of The 27th Conference on Learning Theory, pages 55 76, 2013. [11] Kai Yang Chiang, Inderjit S. Dhillon, and Cho Jui Hsieh. Using side information to reliably learn low-rank matrices from missing and corrupted observations. Journal of Machine Learning Research, 19, 2018. [12] M. Fazel, H. Hindi, and S. P. Boyd. A rank minimization heuristic with application to minimum orders system approximation. Proceedings of the American Control Conference, 2001. [13] J. Abernethy, F. Bach, T. Evgeniou, and J. Vert. Low-rank matrix factorization with attributes. In Ar Xiv preprint Ar Xiv: cs/0611124, 2006. [14] M Xu, R Jin, and Z. H. Zhou. Speedup matrix completion with side information: Application to multi-label learning. In Advances in Neural Information Processing Systems, 2013. [15] V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst. Matrix completion on graphs. Technical report, EPFL, 2014. [16] N. Rao, P. Yu, H.-F.; Ravikumar, and I. Dhillon. Collaborative Filtering with Graph Information: Consistency and Scalable Methods. In Advances in Neural Information Processing Systems, 2015. [17] X. Zhang, S. S. Du, and Q. Gu. Fast and Sample Efficient Inductive Matrix Completion via Multi-Phase Procrustes Flow. In Proceedings of Machine Learning Research, 2018. [18] S. A. Goldman, R. L. Rivest, and R. E. Schapire. Learning binary relations and total orders. SIAM J. Comput., 22(5), 1993. [19] S. A. Goldman and M. K. Warmuth. Learning binary relations using weighted majority voting. In Proceedings of the 6th Annual Conference on Computational Learning Theory, pages 453 462, 1993. [20] N. Cesa-Bianchi and O. Shamir. Efficient online learning via randomized rounding. In Advances in Neural Information Processing Systems 24, pages 343 351, 2011. [21] E. Hazan, S. Kale, and S. Shalev-Shwartz. Near-optimal algorithms for online matrix prediction. In Proceedings of the 23rd Annual Conference on Learning Theory, volume 23:38.1-38.13, 2012. [22] C. Gentile, M. Herbster, and S. Pasteris. Online similarity prediction of networked data from known and unknown graphs. In Proceedings of the 26th Annual Conference on Learning Theory, 2013. [23] M. Herbster, S. Pasteris, and S. Pontil. Predicting a switching sequence of graph labelings. Journal of Machine Learning Research, 16:2003 2022, 2015. [24] M. Herbster, S. Pasteris, and M. Pontil. Mistake bounds for binary matrix completion. In Advances in Neural Information Processing Systems 29, pages 3954 3962. 2016. [25] K. Tsuda, G. Rätsch, and M.K. Warmuth. Matrix exponentiated gradient updates for on-line learning and bregman projection. Journal of Machine Learning Research, 6:995 1018, 2005. [26] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285 318, April 1988. [27] S. Sabato, S. Shalev-Shwartz, N. Srebro, Daniel J. Hsu, and T. Zhang. Learning sparse low- threshold linear classifiers. Journal of Machine Learning Research, 16:1275 1304, 2015. [28] M. Herbster and M. Pontil. Prediction on a graph with a perceptron. In Advances in Neural Information Processing Systems 19, pages 577 584, 2006. [29] Robert Ganian, Iyad A. Kanj, Sebastian Ordyniak, and Stefan Szeider. Parameterized algorithms for the matrix completion problem. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1642 1651, 2018. [30] A.B. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, pages 615 622, 1962. [31] M. Belkin and P. Niyogi. Semi-supervised learning on riemannian manifolds. Machine Learning, 56:209 239, 2004. [32] Xiaojin Zhu and Andrew B. Goldberg. Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2009. [33] M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In Proceedings of the 22nd International Conference on Machine Learning, pages 305 312, 2005. [34] M. K. Warmuth, W. Kotłowski, and S. Zhou. Kernelization of matrix updates, when and how? In Algorithmic Learning Theory, pages 350 364, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. [35] R. Bhatia. Matrix Analysis. Springer Verlag, New York, 1997. [36] S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends R in Machine Learning, 4(2):107 194, 2011.