# a_unifying_view_of_representer_theorems__30919ca4.pdf A Unifying View of Representer Theorems Andreas Argyriou ARGYRIOUA@ECP.FR Ecole Centrale Paris, Center for Visual Computing Francesco Dinuzzo FRANCESD@IE.IBM.COM IBM Research, Dublin and Max Planck Institute for Intelligent Systems, T ubingen It is known that the solution of regularization and interpolation problems with Hilbertian penalties can be expressed as a linear combination of the data. This very useful property, called the representer theorem, has been widely studied and applied to machine learning problems. Analogous optimality conditions have appeared in other contexts, notably in matrix regularization. In this paper we propose a unified view, which generalizes the concept of representer theorems and extends necessary and sufficient conditions for such theorems to hold. Our main result shows a close connection between representer theorems and certain classes of regularization penalties, which we call orthomonotone functions. This result not only subsumes previous representer theorems as special cases but also yields a new class of optimality conditions, which goes beyond the classical linear combination of the data. Moreover, orthomonotonicity provides a useful criterion for testing whether a representer theorem holds for a specific regularization problem. 1. Introduction One of the dominant approaches in machine learning and statistics is to formulate a learning problem as an optimization problem to be solved. In particular, regularization has been widely used for learning or estimating functions or models from input and output data, particularly in supervised and semisupervised learning. Regularization in a Hilbert space H frames the problem of learning from data as a minimization of the type min{f( w, w1 , . . . , w, wm ) + γ Ω(w) : w H} . (1) Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). The objective function is the sum of an error term f which depends on prescribed data1 w1, . . . , wm H, and a regularization penalty Ω, which favors certain desirable properties of the solution. An optimal solution of problem (1) yields the desired function or vector, depending on the context of the original learning problem. It is known that, for a certain class of regularization and interpolation problems, one of the optimal solutions of (1) can be expressed as a linear combination of the data. More specifically, this is the case when the penalty Ωis a Hilbertian norm (or a nondecreasing function of that). This property, known as the representer theorem, has proven very useful because it renders many high or infinite dimensional regularization problems amenable to practical computation. This classical representer theorem was formulated in various guises in (Girosi, 1998; Kimeldorf & Wahba, 1970; Sch olkopf et al., 2001) and has been the topic of extensive further study (Argyriou et al., 2009; De Vito et al., 2004; Dinuzzo & Sch olkopf, 2012; Dinuzzo et al., 2007; Gnecco & Sanguineti, 2010; Mukherjee & Wu, 2006; Steinwart, 2003; Yu et al., 2013). In machine learning, the representer theorem is the main factor that enables application of the so-called kernel trick and underpins all of the widely used kernel methods (Sch olkopf & Smola, 2002), such as support vector machines, regularization networks, etc. Besides the classical result, more recently new types of representer theorems have been proven and studied. For example, it has been realized that analogous optimality conditions apply to the learning of vector-valued functions (Micchelli & Pontil, 2005), ℓ2-regularized multitask learning (Evgeniou et al., 2005) and structured prediction (Lafferty et al., 2004). Further developments occurred with the advent of matrix regularization problems used for multitask learning or collaborative filtering. Thus it has been shown that a type of representer theorem holds when the penalty Ωis a spectral function of matrices 1We use the term data in a more general sense than input vectors in Euclidean space see Section 3 for examples. A Unifying View of Representer Theorems (Amit et al., 2007; Argyriou et al., 2009; 2010) or operators (Abernethy et al., 2009). Very recently these results have been extended to matricizations of tensors as well (Signoretto et al., 2013). Other related results have appeared in the contexts of domain adaptation (Kulis et al., 2011), dimensionality reduction (Jain et al., 2010) and metric learning (Jain et al., 2012). Some variants of the classical theorem were shown in the contexts of semisupervised learning (Belkin et al., 2006), semiparametric representer theorems and kernel PCA (Sch olkopf et al., 2001). Moreover, there have appeared alternative approaches which lie outside the scope of this paper, such as a Bayesian variant of the classical theorem (Pillai et al., 2007), the theory of reproducing kernel Banach spaces (Zhang & Zhang, 2012) and an algorithmic theorem for matrices (Warmuth et al., 2012). Clearly therefore, representer theorems are important and ubiquitous tools in regularization and underly a wide range of frequently used machine learning methodologies. In this paper, we address the topic of representer theorems from a new and more abstract viewpoint. One of our contributions is to provide a unifying framework which subsumes the results that have already appeared in the literature. These include the classical, vector valued, structured prediction, multitask, tensor, semisupervised, semiparametric, dimensionality reduction, domain adaptation, metric learning results etc. In particular, we show that these theorems are only examples from a larger family. Each theorem in this family corresponds to a class of regularization penalties which are characterized by an orthomonotonicity property that we introduce. Another implication of our results is that we can now put the study of representer theorems on a formal basis and provide calculus rules and recipes for deriving new results. Most commonly used kernel methods (support vector machines, kernel ridge regression etc.), as well as methods for multitask learning, collaborative filtering and metric learning, fall within our framework. As an illustration of the theory, we demonstrate that regularization problems with a generalized family of matrix penalties, as well as similar problems on the positive semidefinite cone, admit appropriate representer theorems. In many practical situations this implies that the number of degrees of freedom and hence the complexity of solving the learning problem decreases significantly. 2 2. Mathematical Preliminaries In this section, we introduce the notation and the mathematical concepts necessary for our framework and the main results of Section 3. 2 The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7 2007-2013) under grant agreement No. 246556. 2.1. Notational conventions Let H be a real Hilbert space with inner product , and associated norm . We use L (H) to denote the set of linear operators from H to itself. We denote the identity operator by Id L (H) and the set of linear subspaces of H by V(H). Also let Nm denote the set of integers {1, . . . , m}, Md,n the set of real d n matrices and Mn the set of real n n matrices. Moreover, let Sn + denote the set of n n positive semidefinite matrices and Sn ++ the set of positive definite ones. We denote the t-th column of a matrix W Md,n by wt. We use the following notation for operations on sets A, B H: A + B := {a + b : a A, b B}, A B := {a b : a A, b B} λA := {λa : a A}, for every λ R. In the following, we will be working with subspace-valued maps S : H V(H). This choice is natural, since representer theorems are statements that solutions of certain optimization problems belong to certain subspaces. For more details, see Section 3 and our general definition of representer theorems. Given two subspace-valued maps S1 and S2, their sum S1 + S2 maps every x H to the sum of subspaces S1(x) + S2(x). 2.2. Quasilinear Subspace-Valued Maps To extend the concept of representers, we first introduce a variant of linearity appropriate for subspace-valued maps.3 Definition 2.1. We call the map S : H V(H) quasilinear if S(αx + βy) αS(x) + βS(y) for every x, y H, α, β R. Proposition 2.1. Let S denote a quasilinear subspacevalued map. Then S(αx) = αS(x) (2) and αS(x) S(αx + βy) + βS(y) (3) for every x, y H, α, β R. Definition 2.2. We call the map S : H V(H) idempotent if S(S(x)) = S(x), x H. Lemma 2.1. Let S be a quasilinear and idempotent subspace-valued map. Then, for every m N and every set {xi : i Nm} H, it holds that 3Proofs of the lemmas appearing throughout the paper can be found in the supplement. A Unifying View of Representer Theorems Thus addition of subspaces can be used to generate subspaces invariant under S. In addition to the quasilinearity and idempotence assumptions, we require that sums of images under S are closed. This ensures that orthogonal projection on such subspaces is feasible, which is a crucial step in the proof of representer theorems. For simplicity, to satisfy this property we assume that all images under S are finite dimensional. Another assumption necessary for the proof of our main result is that any point belongs to its image under S. Summarizing, we collect all of the above assumptions in the following definition. Definition 2.3. Let r N. We call the subspace-valued map S : H V(H) r-regular quasilinear if it is quasilinear, idempotent and if, for all x H, S(x) has dimensionality at most r and contains x. The simplest example of regular quasilinear subspacevalued map S is the map that associates a given vector to its own linear span, which thus has dimensionality one. Example 2.1. Suppose that S maps x 7 span{x} . Then S is 1-regular quasilinear. More generally, we can map each point to a subspace by applying a set of linear transformations and taking the linear subspace spanned by the resulting vectors. Example 2.2. Let r N and suppose that S maps x 7 span{Tix : i Nr} , where Ti L (H) for all i Nr. Then S is quasilinear. This map is r-regular quasilinear if Id span{Ti : i Nr}, Tj Tℓ span{Ti : i Nr} j, ℓ Nr. Remark 2.1. It may not hold that S maps any linear subspace of H to a linear subspace (as illustrated in Example 2.2 when, say, r = 2, and T1x, T2x, T1y, T2y are linearly independent for some x, y H). Even when this condition holds, the image of a linear subspace may be a different subspace (consider S(x) = H, x = 0, and span{x}). A special case of Example 2.2 is the following, defined for a space of matrices. As we shall see, this example is relevant to representer theorems for multitask learning. Example 2.3. Let H = Md,n equipped with the standard inner product, and suppose that S maps X 7 {XC : C Mn} . Then S is min{n2, dn}-regular quasilinear. 3. Characterization of the General Representer Theorem Our focus of interest is the variational problem of minimizing, over a Hilbert space, a regularization functional of the form J(w) = f( w, w1 , . . . , w, wm ) + γ Ω(w) . (4) The functional J is the sum of an error term f : Rm R {+ }, which depends on prescribed data w1, . . . , wm H, and a regularization term Ω: H R {+ }, which enforces certain desirable properties on the solution, scaled by a regularization parameter γ > 0. We allow both f and Ωto take the value + , so that interpolation problems and regularization problems of the Ivanov type can also be taken into account. Since the same functional J might be decomposed into a form like (4) in multiple ways, we fix m N and use the tuple (f, Ω, γ, w1, . . . , wm) to describe such a regularization functional. Example 3.1 (Interpolation in a Hilbert space). Let w1, . . . , wm H and y1, . . . , ym R be prescribed data and ( 0 if z = y + otherwise , z Rm. Then the interpolation problem min {Ω(w) : w H, w, wi = yi i Nm} . is equivalent to the problem of minimizing (4) over H. Example 3.2 (Ivanov regularization). Ivanov regularization amounts to solving a problem of the form min {f( w, w1 , . . . , w, wm ) : w H, ω(w) 1} , where ω : H R is a prescribed constraining function. Defining Ω(w) = 0 if ω(w) 1 + otherwise , this problem can be rewritten as the minimization of a functional of the form (4). Example 3.3 (Regularization in an RKHS). Reproducing Kernel Hilbert Spaces (RKHS) are Hilbert spaces H of functions w : X R defined over a nonempty set X such that all point-wise evaluation functionals are bounded, that is, for all x X there exists a constant Cx < + such that |w(x)| Cx w , w H. A Unifying View of Representer Theorems It can be shown that RKHS exhibit the so-called reproducing property w(x) = w, Kx , (x, w) X H, where the representers Kx H are expressible as sections of a symmetric and positive semidefinite kernel K : X X R such that Kx(y) = K(x, y), y X. The reproducing property of an RKHS allows for rewriting any regularization functional of the form J(w) = f(w(x1), . . . , w(xm)) + γ Ω(w) in the standard form (4), where the representers wi coincide with the kernel sections Kxi. Example 3.4 (Regularization with averaged data). In some estimation problems, it may be appropriate to assume that the measured output data are obtained by averaging a function w (to be estimated) with respect to suitable probability measures. Let P1, . . . , Pm denote probability measures on the measurable space (X, A), where A is a σ-algebra of subsets of X, and let H denote a Hilbert space of functions w : X R. If, for every i Nm, the expectation X w(x)d Pi(x) is a bounded linear functional over H, then one may consider synthesizing a function w by minimizing a functional of the form J(w) = f(EP1(w), . . . , EPm(w)) + γ Ω(w) which can be rewritten in the form (4) by introducing suitable representers wi. In particular, if H is an RKHS with a bounded reproducing kernel, the representers of the expectation functionals EPi are called kernel mean embeddings see, for example, (Muandet et al., 2012; Sriperumbudur et al., 2010) and references therein and can be explicitly expressed as X K(x, t)d Pi(x), i Nm, t X. Clearly, we are interested only in cases in which the optimization problem min{J(w) : w H} is well defined, that is, a minimizer of J exists. This always holds by construction in machine learning and statistics applications. More generally, existence of a minimizer can be ensured under lower semicontinuity and coercivity conditions on J. We will avoid specifying such precise conditions since they are not relevant to our purposes, instead assuming existence of minimizers for each problem of interest. The main question we address in this paper is to characterize the functions Ωfor which minimizers of the regularization functional (4) admit certain convenient representations. As already mentioned in the introduction, representer theorems have been proven for regularization with the Hilbertian norm, Schatten ℓp regularization (Abernethy et al., 2009; Argyriou et al., 2009; 2010) and in some other cases. These theorems state that a minimizer must lie in a subspace which depends on the data points w1, . . . , wm. This dependence on the data varies according to the regularization penalty Ω. For example, in the classical representer theorem (Ω= H), the subspace is simply the span of the data points. In the multitask theorem (Ωis a spectral function on matrices), the subspace is generated by the columns of the data matrices. Our goal is to unify this prior work under one framework and at the same time to extend the applicability of representer theorems to other regularization problems. The key to this is to associate representations of minimizers with the data points in an abstract way, specifically to associate a subspace to each data point. Hence we assume a subspacevalued map S : H V(H) and require that the representation for a minimizer of (4) be spanned by the elements of S(wi), i Nm. Definition 3.1. Let m N, S : H V(H) be a subspacevalued map and J = (f, Ω, γ, w1, . . . , wm) a regularization functional of the form (4). Then J is said to admit a representer theorem with respect to S if there exists a minimizer ˆw of J such that Definition 3.2. Let m N, S : H V(H) be a subspacevalued map and F a family of regularization functionals of the form (4). Then F is said to admit a representer theorem with respect to S if every J F admits a representer theorem with respect to S. Our main tool for characterizing regularization functionals that admit representer theorems is the property defined below, which we call orthomonotonicity. The connection between orthomonotonicity and representer theorems has appeared in (Argyriou et al., 2009) in the context of regularization with the Hilbertian norm or with orthogonally invariant matrix penalties. In Theorem 3.1, we extend this connection to a broader class of regularization penalties Ω which arise by varying the choice of the map S. Definition 3.3. We call the function Ω: H R {+ } orthomonotone with respect to the map S : H V(H), if Ω(x + y) Ω(x), x H, y S(x) . (5) Note that in this definition the left hand side of (5), or even both sides, may equal + . Theorem 3.1. Let r, m N, f : Rm R {+ }, Ω: H R {+ } and suppose that S : H V(H) is an r-regular quasilinear map. Then the following hold: A Unifying View of Representer Theorems 1. If Ωis orthomonotone w.r.t. S then, for any w1, . . . , wm H and any γ > 0 such that the regularization functional J = (f, Ω, γ, w1, . . . , wm) of the form (4) admits a minimizer, J admits a representer theorem w.r.t. S. 2. Let F denote the following family of regularization functionals of form (4) F = {(f, Ω, γ, w1, . . . , wm) : w1, . . . , wm H, γ > 0} (6) and assume that f is lower semicontinuous, admits a unique minimizer ˆz = 0, and there exists ε > 0 such that the sublevel set {z H : f(z) f(ˆz) + ε} is bounded. Ωis lower semicontinuous and is minimized at 0 r m. If F admits a representer theorem w.r.t. S then Ωis orthomonotone w.r.t. S. Proof. The first part of the theorem (sufficiency) can be proven by adapting a classical orthogonality argument. Take any w1, . . . , wm H, γ > 0. Let L = m P and let L denote its orthogonal complement. Due to the regular quasilinearity of S, L is a finite dimensional subspace that contains R = span{w1, . . . , wm}. Therefore any minimizer ˆw of the regularization functional J = (f, Ω, γ, w1, . . . , wm) can be decomposed as ˆw = u + v, u L, v L R . Applying Lemma 2.1 we obtain that S(u) L and hence that v S(u) . If Ωis orthomonotone then J( ˆw) = f( u + v, w1 , . . . , u + v, wm ) + γ Ω(u + v) = f( u, w1 , . . . , u, wm ) + γ Ω(u + v) f( u, w1 , . . . , u, wm ) + γ Ω(u) = J(u), so that u L is also a minimizer. Now, let us prove the second part of the theorem (necessity). Let us fix arbitrary x H and y S(x) . The goal of the proof is to establish orthomonotonicity, namely the inequality Ω(x + y) Ω(x). (7) The proof is organized in three cases. 1. First, we observe that for x = 0 the inequality follows directly from the hypothesis on Ω. 2. Secondly, observe that if Ω(x + y) = + , inequality (7) is trivially satisfied. 3. It remains to prove (7) in the case when x = 0 and Ω(x + y) = C < + . (8) Since S is r-regular quasilinear and r m, S(x) has dimensionality µ m. Let us choose a set of vectors {bi(x) : i Nm} spanning S(x) in a way such that x, bi(x) = ˆzi, i Nm. (9) Such a set can always be constructed, since x S(x), x = 0 and ˆz = 0, as follows. Pick u1, . . . , uµ 1 S(x), such that {u1, . . . , uµ 1, x} forms an orthogonal basis of S(x). Without loss of generality, we may assume that ˆzm = 0. Then we may define bi(x) = ui + ˆzi x 2 x for 1 i µ 1 and bi(x) = ˆzi x 2 x for µ i m. Clearly these vectors span S(x) and satisfy (9). For every γ > 0, the functional Jγ x = (f, Ω, γ, b1(x), . . . , bm(x)) belongs to F. Therefore, by the representer theorem w.r.t. S, Jγ x admits a minimizer i=1 S(bi(x)) S(x) (where the last inclusion follows from the idempotence of S). Let zγ x = ( wγ x, b1(x) . . . wγ x, bm(x) ) . Using the facts that f is minimized at ˆz, Jγ x is minimized at wγ x and y S(x) , we obtain f(ˆz) + γ Ω(wγ x) f (zγ x) + γ Ω(wγ x) = Jγ x(wγ x) Jγ x (x + y) = f(ˆz) + γ Ω(x + y) (10) for all γ > 0. Note that f(ˆz) is finite since ˆz is the unique minimizer of f. Hence we conclude that Ω(x + y) Ω(wγ x) , γ > 0. (11) In order to conclude the proof, we show that wγ x converges to x as γ 0+. Using (10), (8) and the hypothesis that Ωis minimized at 0, we obtain 0 f(zγ x) f(ˆz) γ (Ω(x + y) Ω(wγ x)) = γ (C Ω(wγ x)) γ (C Ω(0)) < + , γ > 0. Now, let γk denote a sequence of positive real numbers such that limk γk = 0. From (12) it follows that lim k f(zγk x ) = f(ˆz). A Unifying View of Representer Theorems It follows that there exists an index M such that, for all k M, zγk x belongs to the bounded set {z H : f(z) f(ˆz) + ε}. Therefore, the sequence zγk x is bounded and it has a convergent subsequence. Now, take an arbitrary convergent subsequence of zγk x and let z denote its limit. Since f is lower-semicontinuous and ˆz is its only minimizer, it must be z = ˆz. Hence, the whole sequence zγk x converges to ˆz, namely lim k wγk x , bi(x) = ˆzi, i Nm. In view of (9) we have, for every i Nm, lim k wγk x x, bi(x) = lim k wγk x , bi(x) ˆzi = 0 and therefore lim k wγk x x, u = 0, u S(x). (13) Since x, wγk x S(x), the sequence wγk x x is confined to the subspace S(x). Since S(x) is finitedimensional, (13) implies that wγk x converges strongly to x. By passing to the limit inferior in (11) and using the lower semicontinuity of Ω, inequality (7) follows. 3.1. Loss Functions Which Lead to Orthomonotonicity Observe that part 1 of Theorem 3.1 (sufficiency of orthomonotonicity) only requires existence of minimizers of J, without any specific additional assumptions on the error term. On the other side, part 2 (necessity of orthomonotonicity) holds under additional assumptions on f. In the following, we provide examples of functions f that satisfy such assumptions, showing that most of the error functions considered in practice do so. The vast majority of error functions used are additively separable, namely of the form i=1 V (zi, yi), (14) where V : R R R {+ } and yi R are prescribed output data. 3.1.1. REGRESSION LOSS FUNCTIONS In this section we show that, for a broad class of regression loss functions, it is possible to find output data such that, if the family of regularization functionals (6) admits a representer theorem, then Ωis orthomonotone. Definition 3.4. We call the function V : R R R {+ } a regression loss function if V (z, y) = φ(z y), where φ : R R {+ } is lower semicontinuous with bounded sublevel sets and minimized at zero. The class of functions defined above includes any loss of the form φ(t) = |t|p with p > 0 (in particular, square and absolute loss), the interpolation loss ( 0 if t = 0 + otherwise , as well as the ε-insensitive loss φ(t) = max{0, |t| ε}, which is not uniquely minimized at zero. Lemma 3.1. Assume that V is a regression loss function. Then, for every p N, there exist output data {yi : i N2p} R and a function fu : Rp R {+ } satisfying the hypothesis of Theorem 3.1, Part 2, such that the error functional w 7 f( w, w1 , . . . , w, wp , w, w1 , . . . , w, wp ) , with f defined by (14) for m = 2p, equals the error functional w 7 fu( w, w1 , . . . , w, wp ) . 3.1.2. BINARY CLASSIFICATION LOSS FUNCTIONS Definition 3.5. We call the function V : R { 1, +1} R a regular binary classification loss function if V (z, y) = φ(yz), where φ : R R is lower semicontinuous, nonincreasing, there exists α > 0 such that the function ψα(t) = φ(t) + φ( αt), admits a unique minimizer ˆt = 0 and there exists ε > 0 such that the sublevel set {t R : ψα(t) min ψα + ε} is bounded. It can be seen easily that this definition is satisfied by most commonly used binary classification loss functions, including the logistic loss φ(t) = log(1 + e t), the exponential loss φ(t) = e t and the hinge loss φ(t) = max{0, 1 t}. To verify the uniqueness of the minimizer for these three losses, choose for instance α = 1/2. Lemma 3.2. Assume that V is a regular binary classification loss function. Then, for every p N, there exist output data {yi : i N2p} { 1, +1} and a function fu : Rp R satisfying the hypothesis of Theorem 3.1, Part 2, such that the error functional w 7 f( w, w1 , . . . , w, wp , w, αw1 , . . . , w, αwp ) , with f defined by (14) for m = 2p, equals the error functional w 7 fu( w, w1 , . . . , w, wp ) . A Unifying View of Representer Theorems 3.2. Properties of Orthomonotone Functions An obvious first fact about orthomonotone functions is that nesting of maps preserves orthomonotonicity. Proposition 3.1. If S, S : H V(H) are such that S(x) S (x) for all x H, then any Ω: H R {+ } orthomonotone with respect to S is also orthomonotone with respect to S . Thus, enlarging the map S enlarges the class of orthomonotone functions as well. In the extreme case when S maps every point to H, the orthomonotone class includes all functions. At the other extreme, S maps every point to {0} and the orthomonotone class equals the set of constant functions. A convenient way to obtain new orthomonotone functions (and hence new representer theorems) is by applying simple operations to known orthomonotone functions. For example, shifting the argument inside an orthomonotone function yields an orthomonotonefunction with respect to a larger map. This fact implies that Theorem 3.1 can be modified to apply to functions Ωthat are minimized at points other than 0. Proposition 3.2. Let a H and Ω: H R {+ } orthomonotone with respect to the map S : H V(H). If S is quasilinear then the function x 7 Ω(x + a) is orthomonotone with respect to the map x 7 S(x) + S(a). Another useful rule combines functions which are orthomonotone with respect to different maps. Proposition 3.3. Let Ω1 : H R {+ } be orthomonotone with respect to a map S1 : H V(H) and Ω2 : H R {+ } be orthomonotone with respect to a map S2 : H V(H). Also let h : (R {+ })2 R {+ } be elementwise nondecreasing, that is, h(a , b ) h(a, b) whenever a a and b b. Then the function Ω: H R {+ }, Ω(w) = h (Ω1(w), Ω2(w)) , w H, is orthomonotone with respect to the map S1 + S2. This rule holds more generally for any finite number of orthomonotone functions. In particular, any nonnegative linear combination of orthomonotone functions is also orthomonotone with respect to the sum of the corresponding maps. The same applies to the maximum and to the minimum of orthomonotone functions. Finally, there is a composition rule for orthomonotone functions, similar to the chain rule for differentiation. Proposition 3.4. Let Ω: H R {+ } be orthomonotone with respect to a map S : H V(H) and let T L (H) be a continuous operator. Then the function Ω T is orthomonotone with respect to T S T . 4. Examples of Representer Theorems We now proceed to describe the set of orthomonotonefunctions for specific regularization problems of interest. For each problem, we describe the map S, provide a class of orthomonotone functions and state the resulting representer theorem. Example 4.1. Assume that the dimension of H is at least two and let S be defined as in Example 2.1. Then, the definition of representer theorem 3.1 reduces to the classical linear combination of the representers where ci R and the definition of orthomonotonicity (5) reduces to Ω(x + y) Ω(x), x, y H : x, y = 0. If Ωis lower-semicontinuous, this last condition is satisfied if and only if Ω(w) = h( w ) with h : R R {+ } nondecreasing. See Theorem 1 of (Dinuzzo & Sch olkopf, 2012) and (Argyriou et al., 2009; Yu et al., 2013) for related results. This is a generalized version of the well known classical representer theorem (Girosi, 1998; Kimeldorf & Wahba, 1970; Sch olkopf et al., 2001) which has found wide application to regularization methods in Hilbert spaces. The case of regularization with a bias term can be recovered easily by choosing the error function f as a minimum with respect to the bias variable. This technique also yields semiparametric theorems (Sch olkopf & Smola, 2002). Example 4.2. Let H and S be defined as in Example 2.3. Then the representer theorem 3.1 reduces to i=1 Wi Ci , where Ci are matrices in Mn and the definition of orthomonotonicity (5) reduces to Ω(X + Y ) Ω(X), X, Y Md,n : XT Y = 0. Moreover, in this case the orthomonotonicity property is equivalent to Ω(W) = h(W W) with h : Sn + R {+ } being a matrix nondecreasing function (with respect to the partial order of positive semidefinite matrices). A Unifying View of Representer Theorems See (Argyriou et al., 2009; Yu et al., 2013) as well as (Amit et al., 2007; Argyriou et al., 2008; 2010; Evgeniou et al., 2005) for special cases. The above representation extends the classical representer theorem to matrix learning problems, such as regularization with penalties involving the Frobenius norm, the trace norm and general spectral penalties. These methods have been used for multitask learning, collaborative filtering, kernel learning, domain adaptation and other problems. Problems like multitask learning benefit substantially from the representer theorem since in those cases the data matrices are rank-one (and in collaborative filtering they are also sparse). Indeed, whenever the data matrices are rank-one, that is, Wi = aib T i , we can let vi = b T i Ci and write ˆW = m P i=1 aiv T i , so that an equivalent optimization problem with substantially fewer degrees of freedom can be obtained. This last representation ensures that (4) is equivalent to an optimization problem whose number of variables is mn, which can be much smaller than dn, the size of matrix W. It can also be seen that for other penalties of the type Ω(W) = g(R W GWR), with R Mn,k, G Sd ++ and g matrix nondecreasing, other representer theorems can be derived from the above result, by applying the change of variable W = G 1 2 W. These apply, for example, to spectral functions of QWR, with Q Mℓ,d, which have been proposed for multi-task learning (Dinuzzo, 2013; Dinuzzo & Fukumizu, 2011). In addition, Example 4.2 relates to certain optimization problems with positive semidefinite matrix variables. Indeed, problem (4) with H = Mn, Ω(W) = g(R W WR), g matrix nondecreasing and rank-one data, yields a problem of the type min{f(y 1 Zx1, . . . , y m Zxm) + γg(R ZR) : Z Sn +} (15) by the change of variable Z = W W. Thus a representer theorem for this family of problems follows directly from Example 4.2. Some results for special cases of (15), applied to metric and semisupervised learning, have already appeared in (Jain et al., 2010; 2012). Example 4.3. Assume H = Md,n equipped with the standard inner product, and suppose that S maps X 7 {XC + DX : C Mn, D Md} . Then S is nd-regular quasilinear. For this map, definition 3.1 reads i=1 (Wi Ci + Di Wi) , and the definition of orthomonotonicity (5) reduces to Ω(X+Y ) Ω(X), X, Y Md,n : XT Y = 0 XY T = 0 , which is satisfied by all functions such that Ω(W) = h(W T W, WW T ), where h : Sn + Sd + R {+ } is matrix nondecreasing in each matrix argument. The family of regularizers described in the last example includes, for instance, functions of the form Ω(W) = QW + WR , where is any orthogonally invariant norm. Such penalties are of considerable interest in many matrix learning problems, since they allow for incorporating information about both row and column dependencies, by designing the matrices Q and R. This can be applied, for instance, to collaborative filtering problems when side information about both users and items is available. When the data matrices are rank-one (such as in multi-task learning and collaborative filtering problems), the representer theorem above makes it again possible to obtain a significant reduction in the number of degrees of freedom, since the solution ˆW can be rewritten in the form ˆ W = m P i=1 (aiv T i +uib T i ), where the number of variables, m(n + d), can be much smaller than nd, the size of W. 5. Conclusion We have presented a framework which unifies existing results about representer theorems for regularization problems and allows for a more formal study of these results. We introduced a new definition of representer theorem to include a broader family of representation results. We showed that each theorem in this family corresponds to a regular quasilinear subspace-valued map. Moreover, we characterized the class of regularization penalties corresponding to each representer theorem via the orthomonotonicity property. Orthomonotone functions exhibit simple calculus rules, which can be used to obtain new representer theorems by combining existing ones. Our new framework opens a number of possibilities for further investigation. First of all, it calls for more detailed characterizations of regular quasilinear subspace-valued maps and orthomonotone functions, given their importance in the mathematical construction that leads to the representer theorems. Secondly, it can lead to the derivation of new families of regularization penalties and corresponding methodologies, for example, for matrix and tensor regularization. Finally, it lays the foundation for a new and more general class of kernel methods, obtained by plugging the expression of the generalized representer theorem into the objective functional J and considering the resulting optimization problem. A Unifying View of Representer Theorems Abernethy, J., Bach, F., Evgeniou, T., and Vert, J-P. A new approach to collaborative filtering: Operator estimation with spectral regularization. Journal of Machine Learning Research, 10:803 826, 2009. Amit, Y., Fink, M., Srebro, N., and Ullman, S. Uncovering shared structures in multiclass classification. In Proceedings of the Twenty-Fourth International Conference on Machine learning, 2007. Argyriou, A., Evgeniou, T., and Pontil, M. Convex multi-task feature learning. Machine Learning, 73(3):243 272, 2008. Argyriou, A., Micchelli, C. A., and Pontil, M. When is there a representer theorem? Vector versus matrix regularizers. Journal of Machine Learning Research, 10:2507 2529, 2009. Argyriou, A., Micchelli, C.A., and Pontil, M. On spectral learning. The Journal of Machine Learning Research, 11:935 953, 2010. Belkin, M., Niyogi, P., and Sindhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399 2434, 2006. De Vito, E., Rosasco, L., Caponnetto, A., Piana, M., and Verri, A. Some properties of regularized kernel methods. Journal of Machine Learning Research, 5:1363 1390, 2004. Dinuzzo, F. Learning output kernels for multi-task problems. Neurocomputing, 118:119 126, 2013. Dinuzzo, F. and Fukumizu, K. Learning low-rank output kernels. Journal of Machine Learning Research, 20:181 196, 2011. Dinuzzo, F. and Sch olkopf, B. The representer theorem for Hilbert spaces: a necessary and sufficient condition. In Advances in Neural Information Processing Systems 25, pp. 189 196, 2012. Dinuzzo, F., Neve, M., De Nicolao, G., and Gianazza, U. P. On the representer theorem and equivalent degrees of freedom of SVR. Journal of Machine Learning Research, 8:2467 2495, 2007. Evgeniou, T., Micchelli, C. A., and Pontil, M. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615 637, 2005. Girosi, F. An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6):1455 1480, 1998. Gnecco, G. and Sanguineti, M. Regularization techniques and suboptimal solutions to optimization problems in learning from data. Neural Computation, 22(3):793 829, 2010. Jain, P., Kulis, B., and Dhillon, I. S. Inductive regularized learning of kernel functions. In Advances in Neural Information Processing Systems, pp. 946 954, 2010. Jain, P., Kulis, B., Davis, J. V., and Dhillon, I. S. Metric and kernel learning using a linear transformation. The Journal of Machine Learning Research, 13:519 547, 2012. Kimeldorf, G. S. and Wahba, G. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics, 41(2):495 502, 1970. Kulis, B., Saenko, K., and Darrell, T. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1785 1792, 2011. Lafferty, J., Zhu, X., and Liu, Y. Kernel conditional random fields: representation and clique selection. In Proceedings of the Twenty-First International Conference on Machine learning, pp. 64, 2004. Micchelli, C. A. and Pontil, M. On learning vector valued functions. Neural Computation, 17:177 204, 2005. Muandet, K., Fukumizu, K., Dinuzzo, F., and Sch olkopf, B. Learning from distributions via support measure machines. In Advances in Neural Information Processing Systems 25, pp. 10 18. MIT Press, 2012. Mukherjee, S. and Wu, Q. Estimation of gradients and coordinate covariation in classification. The Journal of Machine Learning Research, 7:2481 2514, 2006. Pillai, N. S., Wu, Q., Liang, F., Mukherjee, S., and Wolpert, R. L. Characterizing the function space for Bayesian kernel models. Journal of Machine Learning Research, 8:1769 1797, 2007. Sch olkopf, B. and Smola, A. J. Learning with Kernels. MIT Press, 2002. Sch olkopf, B., Herbrich, R., and Smola., A.J. A generalized representer theorem. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, 2001. Signoretto, M., De Lathauwer, L., and Suykens, J. A. K. Learning tensors in reproducing kernel Hilbert spaces with multilinear spectral penalties. ar Xiv preprint ar Xiv:1310.4977, 2013. Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Sch olkopf, B., and Lanckriet, G.R.G. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517 1561, 2010. Steinwart, I. Sparseness of support vector machines. Journal of Machine Learning Research, 4:1071 1105, 2003. Warmuth, M., Kotłowski, W., and Zhou, S. Kernelization of matrix updates, when and how? In Algorithmic Learning Theory, pp. 350 364. Springer, 2012. Yu, Y., Cheng, H., Schuurmans, D., and Szepesvari, C. Characterizing the representer theorem. In Proceedings of the 30th International Conference on Machine Learning, pp. 570 578, 2013. Zhang, H. and Zhang, J. Regularized learning in Banach spaces as an optimization problem: representer theorems. Journal of Global Optimization, 54(2):235 250, 2012.