# entangled_kernels__beyond_separability__c15232a2.pdf

Journal of Machine Learning Research 22 (2021) 1-40 Submitted 8/19; Revised 10/20; Published 1/21

Entangled Kernels - Beyond Separability

Riikka Huusari riikka.huusari@aalto.fi Helsinki Institute for Information Technology HIIT Department of Computer Science Aalto University 02150 Espoo, Finland

Hachem Kadri hachem.kadri@lis-lab.fr Department of Computer Science Aix-Marseille University, CNRS, LIS 13013 Marseille, France

Editor: Arthur Gretton

We consider the problem of operator-valued kernel learning and investigate the possibility of going beyond the well-known separable kernels. Borrowing tools and concepts from the ﬁeld of quantum computing, such as partial trace and entanglement, we propose a new view on operator-valued kernels and deﬁne a general family of kernels that encompasses previously known operator-valued kernels, including separable and transformable kernels. Within this framework, we introduce another novel class of operator-valued kernels called entangled kernels that are not separable. We propose an eﬃcient two-step algorithm for this framework, where the entangled kernel is learned based on a novel extension of kernel alignment to operator-valued kernels. We illustrate our algorithm with an application to supervised dimensionality reduction, and demonstrate its eﬀectiveness with both artiﬁcial and real data for multi-output regression.

Keywords: Kernel Learning, Entangled Kernels, Operator-valued Kernels, Vector-valued RKHS, Multi-output Learning

1. Introduction

There is a growing body of learning problems for which each instance in the training set is naturally associated with a set of discrete and/or continuous labels (Izenman, 1975; Caruana, 1997; Micchelli and Pontil, 2005; Alvarez and Lawrence, 2011; Dembczy nski et al., 2012; Baldassarre et al., 2012). Output kernel learning algorithms approach these problems by learning simultaneously a vector-valued function in a reproducing kernel Hilbert space (RKHS) and a positive semi-deﬁnite matrix that describes the relationships between the labels (Dinuzzo et al., 2011; Dinuzzo and Fukumizu, 2011; Ciliberto et al., 2015; Jawanpuria et al., 2015). The main idea of these methods is to learn a separable operator-valued kernel. Operator-valued kernels appropriately generalize the well-known notion of reproducing kernels and provide a means for extending the theory of reproducing kernel Hilbert spaces from scalarto vector-valued functions. They were introduced as a machine learning tool in Micchelli and Pontil (2005) and have since been investigated for use in various

c 2021 Riikka Huusari and Hachem Kadri.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-665.html.

Huusari and Kadri

Figure 1: Illustration on the restrictiveness of a separable kernel. K is the scalar-valued kernel matrix, T is the output similarity matrix and G is the operator-valued kernel matrix. Every block of the big kernel matrix G = K T has the same structure, and thus models the output dependencies almost the same no matter the input interactions in K.

machine learning tasks, including multi-task learning (Evgeniou et al., 2005), functional regression (Kadri et al., 2016), structured output prediction (Brouard et al., 2016), quantile learning (Sangnier et al., 2016), multi-view learning (Minh et al., 2016) and reinforcement learning (Lever et al., 2016). The kernel function evaluated on two data samples in this setting outputs a linear operator (a matrix in the case of ﬁnite-dimensional output spaces, K(x, z) Rp p with p the dimension of the output space) which encodes information about multiple output variables. A challenging question in vector-valued learning is what sort of interactions should the operator-valued kernel learn and quantify, and how should one build and design these kernels. This is the main question investigated in the paper in the context of non-separability between input and output variables. Some classes of operator-valued kernels have been proposed in the literature (Caponnetto et al., 2008; Alvarez et al., 2012), with separable kernels being one of the most widely used for learning vector-valued functions due to their simplicity and computational eﬃciency. These kernels are formulated as a product between a kernel function for the input space alone, and a matrix that encodes the interactions among the outputs. Indeed, the name of the class refers to the fact that dependencies between input and output variables are considered separately. In order to overcome the need for choosing a kernel before the learning process, output kernel learning methods learn the output matrix from data (Dinuzzo et al., 2011; Ciliberto et al., 2015; Jawanpuria et al., 2015). However there are limitations in using separable kernels. These kernels use only one output matrix and one input kernel function, and then cannot capture diﬀerent kinds of dependencies and correlations. Moreover the kernel matrix associated to separable operator-valued kernels is a rank-one kronecker product matrix (i.e, computed by only one kronecker product K T, where K is the scalar-valued kernel matrix and T is the output similarity matrix), which is restrictive as it assumes a strong repetitive structure in the operator-valued kernel matrix that models input and output interactions as illustrated in Figure 1. To go beyond separable kernels, some attempts have been made to learn a weighted sum of them in the multiple kernel learning framework (Kadri et al., 2012; Sindhwani et al., 2013; Gregorov a et al., 2017). Another approach, proposed by Lim et al. (2015), is to

Entangled Kernels

learn a combination of a separable and a transformable kernel, the latter being a type of non-separable kernel based on representing the data via label-dependent transformations. In that work, the form of the transformable kernel is ﬁxed in advance but allows to encode nonseparable dependencies between inputs and outputs. Despite these previous investigations, the lack of knowledge about the full potential of operator-valued kernels and how to go beyond the restrictive separable kernel clearly hampers their widespread use in machine learning and other ﬁelds. This paper deals with the problem of learning non-separable kernels. It is a signiﬁcant extension of our previous conference paper (Huusari and Kadri, 2019), giving more thorough treatment of the background material, additional theoretical results, full proofs, and more insights to the developed framework. It also provides a theoretical analysis of the generalization error of the learning method, along with expanded experimental section. Our main contributions are:

By leveraging tools from the ﬁeld of quantum computing, we introduce a novel class of kernels based on the notion of partial trace which generalizes the trace operation to block matrices. This class of partial trace kernels we propose is very broad and encompasses previously known operator-valued kernels, including separable and transformable kernels, which we illustrate with examples.

From the new class of partial-trace kernels we derive another new class of operatorvalued kernels, called entangled kernels, that are not separable. As far as we are aware, this is the ﬁrst time such an operator-valued kernel categorization has been performed.

We further study this class of kernels and develop a new algorithm called EKL (Entangled Kernel Learning) that in two steps learns an entangled kernel and a vectorvalued function. For the ﬁrst step of kernel learning, we propose a novel deﬁnition of alignment between an operator-valued kernel and labels of a multi-output learning problem. To our knowledge, this is the ﬁrst proposition on how to extend alignment to the context of operator-valued kernels. Our algorithm oﬀers improvements to the high computational cost usually associated with learning with general operator-valued kernels.

We prove a bound on the generalization error of our method using the notion of Rademacher complexity.

We provide an empirical evaluation of EKL. First, we illustrate how EKL works by applying it to the task of supervised dimensionality reduction in the multi-task setting. We also thoroughly study its performance and demonstrate its eﬀectiveness on artiﬁcial data as well as real benchmarks. Finally we compare the running times of learning with various classes of operator-valued kernels.

The remainder of this paper is organized as follows. We begin in Section 2 with a short background on quantum entanglement and learning with operator-valued kernels. In Section 3, we describe some known classes of operator-valued kernels and review previous work on learning separable operator-valued kernels. Section 4 then introduces the new

Huusari and Kadri

X input space Y output space

k( , ) scalar-valued kernel K( , ) operator-valued kernel

K reproducing kernel Hilbert space of k

H reproducing kernel Hilbert space of K

K the kernel matrix of k G the (block) kernel matrix of K

φ feature map of k (from X to Y)

Γ feature map of K (from X to L(Y, H))

L(A, B) the set of trace-class operators from A to B

L(A) the set L(A, A)

A , u the transpose of a matrix A or a vector u

A , u the adjoint of an operator A or a vector u

A 0 a positive semi-deﬁnite (psd) matrix

A(A,B) alignment between matrices A and B

A1 A2 the tensor product of Hilbert spaces A1 and A2

A1 A2 the tensor product of operators A1 and A2

tr(A) the trace of a matrix or an operator A L(A)

tr A2(A) the partial trace of a matrix or an operator A L(A1 A2)

Table 1: Notation summary.

classes of partial-trace and entangled kernels. Our new algorithm EKL for learning entangled kernels is given in Section 5, along with generalization analysis. In Section 6, we present our experimental results for both synthetic and real-life data. We conclude in Section 7 and present some technical details in the appendix.

1.1 Notation

We denote scalars, vectors and matrices as a, a and A respectively. The notation A 0 will be used to denote a positive semi-deﬁnite (psd) matrix. Throughout the paper we use n as the number of labeled data samples and p as the number of outputs corresponding to one data sample. We denote our set of data samples by {xi, yi}n i=1 on X Y, where X is a Polish space and Y is a separable Hilbert space. Usually, X and Y are respectively Rd and Rp equipped with the standard Euclidean metric. Without loss of generality, we can assume that X = Rd and Y = Rp, and thus denote our data set as {xi, yi}n i=1. We use k( , ) as a scalar-valued, and K( , ) as an operator-valued kernel function; the corresponding kernel matrices are K Rn n and G Rnp np, the latter containing blocks of size p p. We denote by K and H the reproducing kernel Hilbert spaces (RKHS) associated to the kernels k and K, respectively. Table 1.1 summarizes the notation used in this paper.

Entangled Kernels

2. Background

We now give some background about quantum entanglement, and review the basics of learning with operator-valued kernels.

2.1 Quantum Entanglement

The ﬁeld of quantum computing is vast, and rapidly growing. This section is not intended to provide a broad overview or exhaustive survey of the literature on quantum etanglement, but gives some notions on entanglement as a quantum property of mixed composite quantum systems that inspired our entangled kernel design. We refer the reader to Horodecki et al. (2009), Bengtsson and Zyczkowski (2017), and Rieﬀel and Polak (2011, chap. 10) for more background information. We will now start with very basics of quantum computation, for this we refer the reader to Rieﬀel and Polak (2011, chap. 2-3). A major diﬀerence of quantum computing and quantum information theory to their classical counterparts is that instead of bits the particles carrying information are qubits. Unlike bits which have only two possible states, 0 and 1, qubits can exist in those and any combination of them. More formally, a qubit takes values aψ0 + bψ1 where ψ0 and ψ1 are orthonormal basis vectors and a, b C such that |a|2 + |b|2 = 1.1 In quantum information theory the actual state aψ0+bψ1 cannot be recovered by measurement; it is always measured as either ψ0 or ψ1 according to the probabilities proportional to multipliers a and b. In the heart of quantum computing there is a notion of quantum systems, consisting of one or more qubits. For a system of one qubit, the system s basis consists of two-dimensional vectors. For a system of n qubits, the description requires a 2n-dimensional Hilbert space to capture all possible combinations of the qubit values. This brings forward the notion of entanglement; any state that cannot be written as a tensor product of n single-qubit states is said to be entangled. Perhaps the simplest example of an entangled state is

2 (ψ00 + ψ11) (1)

as it cannot be written as

(a1ψ0 + b1ψ1) (a2ψ0 + b2ψ1) = a1a2ψ00 + a1b2ψ01 + b1a2ψ10 + b1b2ψ11

with any multipliers a1, a2, b1 and b2, and where ψ01 = ψ0 ψ1, similarly for others. A quantum system exists in a state, describing all the information that can be learned of the system. A quantum system can also be divided into parts or subsystems. We focus here only on bipartite quantum systems, i.e., systems composed of two distinct subsystems. The Hilbert space F associated with a bipartite quantum system is given by the tensor product F1 F2 of the spaces F1 anf F2 corresponding to each of the subsystems. A question to ask in this context is, what information of the system can be obtained by only considering a part of it, either F1 or F2? A quantum state can be either pure or mixed . While states of pure systems can be represented by a state vector ψ F, for mixed states the

1. In the ﬁeld of quantum computing, it is more usual to use the Dirac s bra-ket notation for the basis vectors. In this notation bra x| denotes a row vector and ket |x a column vector, and a qubit would take values a|0 + b|1 .

Huusari and Kadri

characterization is done with density operators (or matrices) ρ, positive Hermitian operators with trace equal to one. Pure states can also be modeled with a density operator allowing for uniform treatment, in this case ρ = ψψ . The entanglement present in a bipartite quantum system can be modeled through the partial trace of the system. Given the density operator ρ modeling the whole system, the state of, say, the ﬁrst subsytem is described by a reduced density matrix, given by taking the partial trace of ρ over F2. If the system can be accurately represented with only the two subsystems, then it is not entangled. In the following we review the notions of partial trace, separability and entanglement of bipartite quantum systems in more detail. We denote the set of bounded linear operators from a Hilbert space B to B with ﬁnite trace norm as L(B). Let F1 and F2 be separable Hilbert spaces.

Deﬁnition 1 (partial trace)

Let {ei}i be an orthonormal basis for F2. For an operator A in L(F1 F2) its partial trace, tr F2A, is an operator in L(F1) deﬁned by the relation

x, (tr F2A)y = X

i x ei, A(y ei)

for all x, y F1.

This deﬁnition follows the ones in Bhatia (2009, Equation 4.34) and Attal (2015b, Equation 2.10). In the ﬁnite-dimensional case where F1 = Rp and F2 = RN, the operator A Rp N p N is a block matrix where each block is of size N N, and the partial trace is obtained by computing the trace of each block (see Figure 2). To see this, let us denote the orthonormal bases of F1 and F2 by {gj}p j=1 and {ei}N i=1, respectively. Any block matrix A Rp N p N can be written as A = Pp s,t=1 gsg t Ast, where Ast RN N is the (s, t)th block of A. Now, when investigating one element of the operator tr F2A at position (l, m) we see that it exactly corresponds to the trace of block Alm:

[tr F2A]lm = gl, (tr F2A)gm =

i=1 gl ei, A(gm ei)

s,t=1 gl ei, (gsg t Ast)(gm ei) =

s,t=1 gl ei, gsg t gm Astei

s=1 gl ei, gs Asmei =

s=1 gl, gs ei, Asmei =

i=1 ei, Almei

The last equality is easy to see from the deﬁnition of trace. The following theorem shows how to compute partial trace for separable operators (Attal, 2015b, Theorem 2.29).

Theorem 2 (partial trace of a tensor product of operators)

Let B and C be operators in L(F1) and L(F2), respectively. If A is an operator in L(F1 F2) of the form B C, then tr F2(A) = B tr(C).

Entangled Kernels

Figure 2: Illustration of partial trace operation. The partial trace operation applied to N N-blocks of a p N p N matrix gives a p p matrix as an output.

The notion of partial trace is a generalization of the trace operation to block structured matrices (Rieﬀel and Polak, 2011, chap. 10). Note that there are two ways of generalizing trace to block matrices. Another possibility would be the so-called block trace (Filipiak et al., 2018) which, informally, is deﬁned as a sum of the diagonal blocks of a matrix; with A Rp N p N it would be the sum Pp t=1 Att in which each Att is of size N N. However in this work we only consider the blockwise trace deﬁnition we discussed earlier. In the case where the density matrix ρ of a mixed bipartite state can be written as ρ = ρ1 ρ2, where ρ1 and ρ2 are density matrices on F1 and F2 of the subsystems, the partial trace of ρ with respect to F2 is ρ1. This form of mixed product states is restrictive and does not exhibit correlations between the two subsystems. A convex sum of diﬀerent product states, ρ = X

i pi ρi 1 ρi 2, (2)

with pi 0 and P i pi = 1, however, will in general represent certain types of correlations between the subsystems of the composite quantum system. These correlations can be described in terms of the classical probabilities pi, and are therefore considered classical. States of the form (2) thus are called separable mixed states. In contrast, a mixed state is entangled if it cannot be written as a convex combination of product states, i.e.,

ρi 1, ρi 2, pi 0 such that ρ = X

i pi ρi 1 ρi 2. (3)

Entangled states are one of the most commonly encountered classes of bipartite states possessing quantum correlations (Mintert et al., 2009). A challenging problem in quantum computing is to identify necessary and suﬃcient conditions for quantum separability. Given a density matrix ρ of a bipartite quantum state, the quantum separability problem asks whether ρ is entangled or separable. A useful and eﬃcient necessary condition for checking if a given block density matrix is separable in some block size partition, is to use positive partial transpose (PPT) condition, sometimes also called Peres-Horodecki criterion (Peres, 1996; Horodecki, 1997). The partial transpose of a p N p N block matrix P with blocks (Pij)p i,j=1 is the block matrix of the same size containing the transposed blocks (P ij)p i,j=1. If a density matrix is separable, then it has positive partial transpose. It is necessary for any separable density matrix to have positive

Huusari and Kadri

partial transpose; yet in general this condition is not suﬃcient in guaranteeing separability, as there might be non-separable density matrices fulﬁlling the PPT condition. However this condition guarantees that if the partial transpose matrix has a negative eigenvalue, the state is entangled.

2.2 Learning with Operator-valued Kernels

We now review the basics of operator-valued kernels (Ov Ks) and their associated vectorvalued reproducing kernel Hilbert spaces (RKHSs) in the setting of supervised learning. Vector-valued RKHSs were introduced to the machine learning community by Micchelli and Pontil (2005) as a way to extend kernel machines from scalar to vector outputs. Given a set of training samples {xi, yi}n i=1 on X Y, the optimization problem

arg min f H

i=1 V (yi, f(xi)) + λ f 2 H, (4)

where f is a vector-valued function and V : Y Y R+ is a convex loss function, can be solved in a vector-valued RKHS H by the means of a vector-valued extension of the representer theorem.

Deﬁnition 3 (vector-valued RKHS)

A Hilbert space H of functions from X to Y is called a reproducing kernel Hilbert space if there is a positive semi-deﬁnite L(Y)-valued kernel K on X X such that:

i. the function z 7 K(x, z)y belongs to H, z, x X, y Y,

ii. f H, x X, y Y, f, K(x, )y H = f(x), y Y (reproducing property).

Deﬁnition 4 (positive semi-deﬁnite operator-valued kernel)

A L(Y)-valued kernel K on X X is a function K( , ) : X X L(Y); it is positive semi-deﬁnite if:

i. K(x, z) = K(z, x) , where superscript denotes the adjoint operator,

ii. and, for every n N and all {(xi, yi)i=1,...,n} X Y, X

i,j yi, K(xi, xj)yj Y 0.

Theorem 5 (bijection between vector-valued RKHS and positive semi-deﬁnite operatorvalued kernel)

An L(Y)-valued kernel K on X X is the reproducing kernel of some Hilbert space H, if and only if it is positive semi-deﬁnite.

Theorem 6 (representer theorem)

Let K be a positive semi-deﬁnite operator-valued kernel and H its corresponding vectorvalued RKHS. The solution ˆf H of the regularized optimization problem (4) has the following form

i=1 K(x, xi)ci, with ci Y. (5)

Entangled Kernels

With regard to the classical representer theorem, here the kernel K outputs a matrix and the weights ci are vectors. The proofs of Theorem 5 and 6 can be found in Micchelli and Pontil (2005) and Kadri et al. (2016). For further reading on operator-valued kernels and their associated RKHSs, see, e.g., Caponnetto et al. (2008); Carmeli et al. (2010); Alvarez et al. (2012).

3. Learning Operator-valued Kernels

In this section we ﬁrst review some known classes of operator-valued kernels, before moving on to describing ways to learn them. Most of the works in this ﬁeld consider separable kernels, but a few specialized methods exist also for non-separable kernels.

3.1 Known Classes of Operator-valued Kernels

Some well-known classes of operator-valued kernels include separable and transformable kernels. Note that here and throughout the rest of the manuscript we consider the case where the output space Y is of ﬁnite dimension p (i.e., Y = Rp and L(Y) = Rp p ).

Deﬁnition 7 (Separable operator-valued kernel)

A separable operator-valued kernel is a function K : X X Rp p, that can be written as

K(x, z) = k(x, z)T, x, z X, (6)

in which k is a scalar-valued kernel function, and T Rp p is a positive semi-deﬁnite matrix.

This class of kernels is very attractive in terms of computational time, as it is easily decomposable. However the matrix T acts only on the outputs independently of the input data, which makes it diﬃcult for these kernels to capture input-output relations. In the same spirit a more general class, sum of separable kernels, can be deﬁned as follows.

Deﬁnition 8 (Sum of separable operator-valued kernels)

A sum of separable operator-valued kernels is a function K : X X Rp p, that can be written as K(x, z) = X

l kl(x, z)Tl, x, z X, (7)

in which kl are a scalar-valued kernels and Tl Rp p are positive semi-deﬁnite.

This class of operator-valued kernels can capture more complex similarities, but still assumes that the unknown input-output dependencies can be decomposed into a product of two separate kernel functions that encode interactions among inputs and outputs independently.

Deﬁnition 9 (Transformable operator-valued kernel)

A transformable operator-valued kernel is a function K : X X Rp p, that can be written as K(x, z) = h ek(Slx, Smz) ip

l,m=1 , x, z X, (8)

in which ek : e X e X R is a scalar-valued kernel function and {St}p t=1 are mappings from X to e X.

Huusari and Kadri

non symmetric

Figure 3: Illustration on diﬀerences of separable (left) and non-separable (right) operatorvalued kernels. For separable kernels the p p output matrix is always a psd symmetric matrix.

In transformable kernels the data is transformed with the mappings {St}p t=1 before feeding it to the scalar-valued kernel function; which transformations to use depends on which outputs the element in K(x, z) corresponds to. The mappings St operate on input data while depending on outputs; however they are not intuitive nor easy to interpret and determine. One example of such kernels, which was proposed in Caponnetto et al. (2008), is the kernel function K(x, z) := (eσlm x,z : l, m {1, . . . , p}) with σ = (σlm) a positive semideﬁnite matrix. In this transformable kernel the matrix entries outputted by the kernel are computed using linear transformations of the data. Indeed, it is easy to see that K(x, z) = Qp i=1 e S(i) l x,S(i) m z p l,m=1, where σ = Pp i=1 λiuiui is the eigenvalue decomposition of σ and

S(i) t x :=

λuitx, for all t = 1, . . . , p. Separable kernels are the most common operator-valued kernels to be used and learned. As already mentioned, they are nevertheless a relatively restrictive class of kernels, as the relationships between inputs and outputs are modelled independently of each other as was already illustrated in Figure 1. Moreover, only certain types of interactions can be modelled, as T should be a psd matrix, and symmetric. Figure 3 illustrates this.

3.2 Learning Separable Operator-valued Kernels

Most works on learning operator-valued kernels consider the Output Kernel Learning (OKL) framework, where a separable operator-valued kernel is learned by ﬁxing the scalar-valued kernel and learning the operator T. The method is named for the observation that learning T does not depend on input data values, but only on the outputs. All the output kernel learning algorithms are based on joint optimization, that is, the kernel is learned jointly with the learning problem, giving an optimization problem that generally can be written as

i=1 V (yi, f T,c(xi)) + λ Ω(f T,c) + γ Θ(T).

Here V is the loss function for classiﬁcation/regression and Ωis the accompanying regularization term, while Θ regularizes the output matrix. With separable operator-valued kernels, applying the representer theorem we get that f T,c( ) = Pn j=1 k(xj, )Tcj.

Entangled Kernels

Many algorithms solve the output kernel learning problem. The ﬁrst output-kernel learning algorithm was introduced in Dinuzzo et al. (2011) with Frobenius norm regularizer on the output matrix T. Dinuzzo and Fukumizu (2011) considers learning low-rank output kernels, that is, separable kernels where the rank of T is constrained to be less or equal to some r. The optimization is performed with having also a regularizer on tr(T) in addition to the rank constraint. More general or eﬃcient formulations of output kernel learning have been proposed in Ciliberto et al. (2015) and Jawanpuria et al. (2015). To go beyond the standard OKL, Kadri et al. (2012) extended the multiple kernel learning framework (G onen and Alpaydın, 2011) that is popular in learning scalar-valued kernels into operator-valued kernel framework. The multiple kernel learning refers to paradigm where given multiple (scalar-valued) kernels ki, a combination k(x, z) = Pl i=1 αiki(x, z) is learned and then used in the predictive learning problem at hand. Similarly, Kadri et al. (2012) focus on learning a ﬁnite linear combination of separable operator-valued kernels. They consider two formulations of the optimization problem. In the ﬁrst one the separable operator-valued kernels all share the same output operator T, meaning that the full kernel is

i=1 αiki(x, z)T.

The second formulation considers the case where also the output operators diﬀer across the operator-valued kernels in the sum, giving

i=1 αiki(x, z)Ti.

In both of these versions only the multipliers αi are learned, and the kernels are ﬁxed, comparably to the case of multiple kernel learning (MKL) for scalar-valued kernels. Notably, the operators T and Ti are ﬁxed in advance, which makes the use of this method diﬃcult as it is not obvious how one should choose T without learning it. Some works have continued this line of investigation. Sindhwani et al. (2013) considers learning a combination of separable kernels that share the operator T, while optimizing both the combination of the basis scalar-valued kernels and the matrix T. Another approach by Gregorov a et al. (2017) considers combining a set of scalar-valued kernels with a set of output matrices Ti. However they impose diagonal structure on the output matrices, restricting the types of relations they are able to model. In this setting the diagonal values can be interpreted as model weights of the kernel in standard MKL setting.

3.3 Learning Non-separable Operator-valued Kernels

There are very few works that consider learning non-separable kernels. Lim et al. (2015) considers an application to modelling time series data and goes further than separability by learning a combination of a separable and a transformable kernel. They consider a transformable kernel deﬁned as

[Ktransf.(x, z)]st = exp( γ(xs zt)2),

Huusari and Kadri

where xs and zt are the sth and tth elements of vectors x and z respectively. This transformable kernel applies a Gaussian kernel to pairs of elements of the data vectors, giving a d d-matrix as an output if the data dimension is d. The separable kernel in their work is

Ksep.(x, z) = exp( γ x z 2) T,

and the full kernel matrix they consider in learning is K = Ktransf. Ksep., a Hadamard or element-wise matrix product of the two operator-valued kernel matrices. When they learn this kernel, they consider learning the matrix T from the separable part of it. This class of kernels cannot generalize to the learning problems we consider. The greatest restriction is, that the kernel outputs a d d matrix, d being the dimension of the input data. This is very rarely the same as the dimension of the outputs. Another specialized operator-valued kernel is that of Huusari et al. (2018), where a nonseparable kernel is learned in context of multi-view learning, by incorporating a learnable metric operating between the views into the kernel. The output of a kernel is a v v matrix where v is the number of views in the data. Similarly to the previous work, this is not applicable for a general multi-output setting we consider. Having only few specialized works outside the separability framework motivates our more general entangled kernel learning paradigm.

4. Partial Trace and Entangled Kernels

This section ﬁrst revisits the known classes of operator-valued kernels and discusses the inclusions between them. After that we introduce the two novel classes of operator-valued kernels, the partial trace kernels that encompass the known classes of operator-valued kernels, and the entangled kernels that are a class of kernels distinct from the separable. While it is straightforward to see that separable kernels belong to the larger class of sum of separable, the picture is less clear for transformable kernels. The following examples clarify this situation.

Example 1 (transformable but not separable kernel)

On the space X = R, consider the kernel

K(x, z) = xz xz2

K is a transformable kernel, but not a (sum of) separable kernel. We obtain that K is transformable simply by choosing in Def. 9 the kernel ek(x, z) = xz, S1(x) = x, and S2(x) = x2. From the property of positive semi-deﬁniteness of the operator-valued kernel, it is easy to see that the matrix T of a separable kernel is symmetric (see Def. 7), and since the matrix K(x, z) is not, K is not a separable kernel.

Example 2 (transformable and separable kernel)

Let K be the kernel function deﬁned as

K(x, z) = x, z T, x, z X,

Entangled Kernels

All operator-valued kernels

Partial trace

Sum of Separable

Transformable

Figure 4: Illustration of inclusions among various operator-valued kernel classes.

where T Rp p is a rank one positive semi-deﬁnite matrix. K is both separable and transformable kernel. Since T is of rank one, it follows that K(x, z)

lm = ulum x, z , with T = uu . We can see that K is transformable by replacing in Def. 9 the kernel ek(x, z) by x, z and St(x) by utx, t = 1, . . . , p. K is separable by construction.

It is worth noting that separable kernels are not limited to ﬁnite-dimensional output spaces, while transformable kernels are. Figure 4 depicts inclusions among kernel classes discussed here and the two new families of operator-valued kernels we propose: partial trace kernels and entangled kernels. We now deﬁne the two novel classes of operator-valued kernels. The ﬁrst one, the class of partial trace kernels, encompasses both (sum of) separable and transformable kernels, while the second, entangled kernels, is a class of non-separable kernels. We start by introducing the more general class of partial trace kernels. The intuition behind this class of kernels is that in the scalar-valued case any kernel function k can be written as the trace of an operator in L(K), where K is the reproducing kernel Hilbert space associated to the scalar-valued kernel k. It is easy to see that k(x, z) = φ(x), φ(z) = tr(φ(x)φ(z) ).2 The following deﬁnition of partial trace kernels can be motivated as a generalization of the kernel trick by using the partial trace operator instead of trace.

2. There is some abuse of notation in using the transpose symbol for a feature map representation which can be inﬁnite-dimensional. In this case, we can write k(x, z) = tr(φ(x)φ(z) ), where φ(x)φ(z) is the rank one operator deﬁned for all u K by (φ(x)φ(z) )u = φ(z), u φ(x).

Huusari and Kadri

Deﬁnition 10 (Partial trace kernel)

A partial trace kernel is an operator-valued kernel function K having the following form

K(x, z) = tr K(Pφ(x),φ(z)), (9)

where Pφ(x),φ(z) is an operator on L(Y K), and tr K is the partial trace on K (i.e., over the inputs).

Depending on the choice of the operator Pφ(x),φ(z), partial trace kernels may not be positive semi-deﬁnite. Since the partial trace preserves positive semi-deﬁniteness (Filipiak et al., 2018), it is clear that the partial trace kernel is positive semi-deﬁnite if for every n N and x1, . . . , xn X the matrix [Pφ(xi),φ(xj)]n i,j=1 is positive. Even though all the kernel subclasses considered here and shown in Figure 4 are positive semi-deﬁnite, our deﬁnition of partial trace kernel is general and ﬂexible enough to cover many kernels and allows the design of new ones that may or not be positive semi-deﬁnite. As for the scalar-valued case, operator-valued kernels which are not positive semi-deﬁnite may be useful for learning in reproducing kernel Kre ın spaces (Ong et al., 2004; Saha and Palaniappan, 2020). The class of partial trace kernels is very broad and encompasses the classes of separable and transformable kernels (see Figure 4). From the deﬁnition of the partial trace operation, we can see that if we choose Pφ(x),φ(z) = P l Tl φl(x)φl(z) , where φ(x)φ(z) is the rank one operator deﬁned in Footnote 2, we recover the case of sum of separable kernels. In the same way, if we ﬁx [Peφ(x),eφ(z)]p l,m=1 = eφ Sl(x) eφ Sm(z) in Eq. 9, computing the trace of each block using the partial trace will give the transformable kernel. With this in mind, we can use the partial trace kernel formulation to induce a novel class for operator-valued kernels which are not separable, with the goal to characterize inseparable correlations between inputs and outputs.

Deﬁnition 11 (Entangled kernel)

An entangled operator-valued kernel K is deﬁned as

K(x, z) = tr K U T (φ(x)φ(z) ) U , (10)

in which T L(Y) is a positive semi-deﬁnite operator, and U L(Y K) is not separable.

When Y and K have ﬁnite dimensions p and N, respectively, L(Y K) is simply the set of matrices of dimensions p N p N and φ(x)φ(z) is the matrix φ(x)φ(z) , where φ(z)

denotes the transpose of φ(z). In the following we abuse the notation and denote N as the dimensionality of feature representation φ(x). However we do not restrict ourselves to ﬁnite dimensions and N in this notation can also be inﬁnite. In this deﬁnition, U not being separable means that it cannot be written as U = B C, with B Rp p and C RN N. The term T (φ(x)φ(z) ) represents a separable kernel function over inputs and outputs, while U characterizes the entanglement shared between them. We note that U not being equal to B C marks the crucial diﬀerence to separable kernels; if U were B C, then the class described above would be part of separable kernels (see Theorem 2). Some intuition to U can be seen from its role of an entangled similarity in the joint feature space. It is entangled in the sense that it cannot be decomposed into two sub - matrices of similarity between inputs and between outputs independently. The partial

Entangled Kernels

trace is the operation used to recover the sub-similarity matrix between the outputs from the entangled joint similarity matrix. In the particular case of separability, the partial trace will give the output metric.

Theorem 12 Entangled kernels given by the Deﬁnition 11 are positive semi-deﬁnite kernels.

Proof The proof is based on the observation that T (φ(x)φ(x) ) = Oφ(x)TO φ(x), where Oφ(x) is the operator deﬁned by

Oφ(x) : Y Y K

y 7 y φ(x),

and O φ(x) is its adjoint deﬁned by:

O φ(x) : Y K Y

y u 7 φ(x), u y.

Indeed, y Y, u K, we have T (φ(x)φ(z) ) (y u) = Ty φ(x)φ(z) u

= φ(z), u Ty φ(x)

= Oφ(x)T φ(z), u y

= Oφ(x)TO φ(z)(y u).

Now, to show that the entangled kernel deﬁned by Eq. 10 is positive semi-deﬁnite, let us compute X

i,j yi, K(xi, xj)yj Y = X

D yi, tr K U T (φ(xi)φ(xj) ) U yj E

i,j tr y i IN U T (φ(xi)φ(xj) ) U yj IN 3

i,j tr y i IN U Oφ(xi)TO φ(xj) U yj IN

= tr(STS ) 0,

as clearly tr(STS ) = tr(SVV S ) = tr((SV(SV) ) and trace preserves positivity. Here we have denoted S = P i y i IN UOφ(xi). Moreover, from the fact that tr K(A ) = tr K(A) , we immediately obtain that K(x, z) = K(z, x) , which completes the proof.

While by deﬁnition entangled kernels cannot be separable, the following example shows that an entangled kernel can be transformable.

3. See Filipiak et al. (2018, Lemma 2.11).

Huusari and Kadri

Example 3 (Entangled and transformable kernel)

An entangled kernel given by Deﬁnition 11 with symmetric and supersymmetric U (that is, all its N N blocks are symmetric), T = 1p1T p Rp p and with a scalar-valued kernel with ﬁnite-dimensional feature mapping, φ(x) RN, is a transformable kernel. This can be seen from the following calculation:

Pφ(x),φ(z) =

i,j=1 Uljφ(x)φ(z) U im

j=1 Uljφ(x)

i=1 Uimφ(z)

j=1 Uljφ(x)

i=1 Umiφ(z)

Now Sl(x) = Pp k=1 Ulkφ(x); recall that transformable kernels can be obtained from partial trace kernels (9) with [Peφ(x),eφ(z)]p l,m=1 = eφ Sl(x) eφ Sm(z) .

Choice of the matrix U is crucial to the class of entangled kernels. In the next section we develop an algorithm that learns an entangled kernel from data.

5. Entangled Kernel Learning

In general, there is no knowing whether input and output data are or are not entangled. In this sense, learning the entangled K in Eq. 10 by imposing that U is inseparable can sometimes be restrictive. In our entangled kernel learning approach we do not impose any separability restriction, with the hope that our learning algorithm can automatically detect the lack or presence of entanglement. Key to our method is a reformulation of the entangled kernel K (Eq. 10) via Choi-Kraus representation.

Theorem 13 (Choi-Kraus representation Choi, 1975; Kraus, 1983; Rieﬀel and Polak, 2011)

The map K(x, z) = tr K U T (φ(x)φ(z) ) U can be generated by an operator sum representation

i=1 Miφ(x)φ(z) M i , (11)

where r is called the Kraus rank and Mi Rp N are the Kraus operators.

Proof See Appendix A

Using this formulation, entangled kernel learning consists of ﬁnding a (possibly low-rank when r is small) decomposition of the kernel by learning the matrices Mi, i = 1, . . . , r, where these matrices merge the matrices T and U. Looking at the simpler class of separable kernels, they model the relations between tasks (i.e., ouputs) using only the output matrix T of size p p. The matrix T acts as a

Entangled Kernels

covariance matrix on the labels independently of the inputs and if connected to the existing deep neural network approaches for multi-task learning (i.e. the task-decoder methods which learn a shared representation at a high-level layer followed by task-speciﬁc decoders, see, e.g., Meyerson and Miikkulainen, 2018), can then play the role of a decoder that predicts labels using the outputs of related tasks. Entangled kernels, however, capture task relatedness through the matrices Mi, of size p N modeling the relationships of tasks and features. A small value of r is a (Kraus) low-rank assumption which reduces the number of parameters and promotes sharing information and knowledge between tasks. This would be more similar to the column-based approaches in deep multi-task learning, which consider multiple deep neural networks in parallel and share the layer parameters between these networks (Meyerson and Miikkulainen, 2018). It is not easy to see from the theorem how exactly T, U and Mi interact. While the proof of the theorem (see Appendix A) gives the explicit relation between them, in order to make this more clear let us consider only one element of the output of the kernel. From the Choi-Kraus representation (11) we have

[K(x, z)]s,t =

i=1 Mi[s, :]φ(x)φ(z) (Mi[t, :]) ,

and on the other hand from the deﬁnition of the entangled kernels (Deﬁnition 11) we get that

[K(x, z)]s,t = tr

k=1 Usk Tklφ(x)φ(z) U lt

l,k=1 Tkl φ(z) U lt Uskφ(x).

Here we have used notation Mi[s, :] to refer to the row s of matrix Mi, Tij to refer to the element in position (i, j) in T, and Uij similarly ordered to the block of size N N in U. From this we see that both the rows of Mi and rows of (blocks of) U act on transforming the features φ(x). If we restrict the rank of the entangled kernel by restricting the r in (11), we are in essence restricting the row space of (blocks of) U. It is important to note, that while every entangled kernel can be represented like this, the representation is not restricted only to entangled kernels. Thus by learning the Mi we expect to learn the meaningful relationships in the data, be they entangled or not. Yet, when learning even a low-rank entangled kernel, we are bound to learn more parameters in the Mi than we could learn from just (full-rank) T, thus making the class of kernels we consider much more expressive. To make this explicit, let us consider the separable kernels in the Choi-Kraus represenation framework. For the class of separable kernels U equals identity, and the T in (10) is the same as the T in Deﬁnition 7. We can describe the operator T with a set of vectors tj Rp for which T = P j tjt j .

Remark 14 (Choi-Kraus representation of seprabale kernels)

For separable kernels, each Mi in the Choi-Kraus represenation (11) can be described as Mi = Mjk = tje k , with j = 1, ..., p, k = 1, ..., N and {ek}k is an orthonormal basis of RN.

Huusari and Kadri

This can be conﬁrmed with the following calculation:

i=1 Miφ(x)φ(z) M i =

k=1 tje k φ(x)φ(z) ekt j

k=1 e k φ(x)φ(z) ek

j=1 tjt j =

k=1 φ(x)φ(z) ek, ek T

= tr φ(x)φ(z) T = φ(x), φ(z) T = k(x, z) T.

It is clear that the class of entangled kernels is much more expressive than the class of separable kernels. With this in mind, we now turn our attention to describing the framework in which we can learn the Mi in entangled kernels. Because the feature space of the scalar-valued kernel can be of very large dimensionality (or inﬁnite-dimensional), we consider an approximation to speed up the computation. For example random Fourier features or Nystr om approximation (Rahimi and Recht, 2008; Williams and Seeger, 2001) give us ˆφ such that

k(x, z) = φ(x), φ(z) ˆφ(x), ˆφ(z) .

We note that our approximation is on scalar-valued kernels, not operator-valued, although there are methods for approximating them, too, directly (Brault et al., 2016; Minh, 2016). Our approximated kernel is thus

i=1 ˆMi ˆφ(x)ˆφ(z) ˆM i , (12)

where ˆφ(x) Rm and ˆMi Rp m, from where our goal is to learn the ˆMi. We can write our np np kernel matrix as

i=1 vec( ˆMi ˆΦ) vec( ˆMi ˆΦ) (13)

i=1 ( ˆΦ Ip) vec( ˆMi) vec( ˆMi) ( ˆΦ Ip) 4

where ˆΦ = [ˆφ(x1), , ˆφ(xn)] is of size m n. Further, if we denote

i=1 vec( ˆMi) vec( ˆMi) , (14)

we can simply write ˆG = ( ˆΦ Ip)D( ˆΦ Ip). (15)

To learn an entangled kernel, we need to learn the psd matrix D.

4. Matrix cookbook Eq. (520): http://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf.

Entangled Kernels

It is good to note that also from this formulation of the learnable entangled kernel, we can easily recover the class of separable kernels. Indeed, as we already mentioned, the representation (11) is not restricted to only entangled kernels, and thus it is possible to recover other frameworks also from (15). In this case, if we choose D = (IN T) with T Rp p (note that requirement for D to be psd and symmetric also restricts T to be such), we can write

ˆG = ( ˆΦ Ip)(IN T)( ˆΦ Ip) = ( ˆΦ IN ˆΦ Ip TIp) = K T,

and see that we have recovered a separable kernel. Notably, we can see that our proposed class of learnable kernels are much more expressive than that of separable kernels, as they can be recovered as a special case.

5.1 The Learning Algorithm

We will now describe how to learn the psd matrix D from the entangled kernel (15), and how to eﬃciently formulate the vector-valued learning problem. Kernel alignment (Cristianini et al., 2002; Cortes et al., 2012) is a measure of similarity between two (scalar-valued) kernels. Alignment between two matrices M and N is deﬁned as

A(M, N) = Mc, Nc F Mc F Nc F , (16)

where subscript c refers to centered matrices, that is, Mc = HMH where H = In 1 n11 , if M is a n n matrix. Here 0 A(M, N) 1, with higher value showing better alignment and higher similarity. Kernel alignment has been used in learning a kernel by considering the so-called ideal kernel as the target of the alignment. In the context of binary classiﬁcation, the ideal kernel is deﬁned as yy where y = [y1, ..., yn] is the n-vector of class labels (Cristianini et al., 2002). The work of Kandola et al. (2002) extends the ideal kernel to regression and to unbalanced classiﬁcation where there are more samples from one class than from the other. We extend the concept of alignment into the case of multiple outputs. As already mentioned, in the previous works the ideal kernel has been the linear kernel deﬁned on output values, yy . Extended to multi-output setting, it is natural to consider the linear kernel KY = Y Y, where Y is of size p n, containing the labels associated to data sample i on its ith column. Yet, if we wish to use this ideal kernel in learning our entangled kernel we face problems as we would be trying to align a np np matrix with a n n one. We thus propose as the ﬁrst part of our optimization problem to align the partial trace of our entangled kernel matrix, trp( ˆG), to the output kernel KY . As this term does not consider the full operator-valued kernel matrix, for the second term we consider y = vec(Y), a vectorization of the matrix containing the labels, and take matrix Ky = yy as the second ideal kernel in our setting. In this kernel each p p block is an outer product between the labels associated with the samples, or [Ky]ij = yiy j , and we can directly align the np np matrix Ky to the full entangled kernel ˆG. Our optimization problem is thus a convex combination

max D (1 γ) A trp( ˆG), Y Y + γ A ˆG, yy , (17)

Huusari and Kadri

Algorithm 1 Entangled Kernel Learning (EKL)

Input: matrix of features Φ, labels Y // 1) Kernel learning: Solve for Q in eq.17 (D = QQ ) within a sphere manifold // 2) Learning the predictive function: if Predict with scalar-valued kernel then

c K = (trp( ˆG) + λI) 1Y O(m3 + mnp) else

c G = ( ˆG + λI) 1 vec(Y) O(r3 + mnp2) Return D = QQ , c

where γ [0, 1]. We note that by applying Lemma 2.11 from Filipiak et al. (2018), we can write trp( ˆG) = ˆΦ trp(D) ˆΦ. Intuitively the ﬁrst alignment learns a scalar-valued kernel matrix that can be obtained via partial trace applied to the more complex operator-valued kernel, while the second term focuses on the possibly entangled relationships in the data. One possibility for using the entangled kernel framework is to learn a scalar-valued kernel for multi-output problem using the partial-trace formulation and to use it in a kernel machine. The optimization problem is solved with a gradient-based approach. To make sure that the resulting kernel is valid (psd), we write D = QQ with Q of size mp r with r at most mp, and perform the optimization over Q. The gradients for alignment terms are straightforward to calculate. The optimization is performed on sphere manifold as a way to normalize D.5

After we have learned the entangled kernel, we solve the learning problem by choosing the squared loss function

min c vec(Y) ˆGc 2 + λ ˆGc, c . (18)

For this c update we can ﬁnd the classical closed-form solution, c = ( ˆG+λI) 1 vec(Y). Note that this computation is, by considering the entangled structure of ˆG, more computationally eﬃcient than a general (say, some transformable) operator-valued kernel. Generally we can say that the complexity of calculating the predictive function with nonseparable operatorvalued kernels is O(n3p3). In our proposed network, however, we can apply the Woodbury formula for the matrix inversion and only invert a r r matrix. We can assume that m n and often in multi-output data sets p n. Furthermore as r mp, we can see that n dominates the complexity calculations. Thus we write the complexity of the cupdate as O(r3 +mnp2), where we have also considered the most costly steps of the matrix multiplications involved. Moreover it is possible, with our kernel learning framework, to learn a scalar-valued kernel by ﬁrst learning the entangled kernel as proposed and then extracting the scalarvalued one by using partial trace operator. The scalar-valued kernel proposed to be used in predicting is now K = trp( ˆG), and it can be used in regression setting as usual; calculating c K = (K + λI) 1Y and using that to obtain predictions. We note that even here our

5. We used the toolbox from pymanopt.github.io for the implementation (Townsend et al., 2016).

Entangled Kernels

learning c predicting Y

No structure O(n3p3) O(tnp2) Separable O(n3 + p3) O(tnp + np2) Low-rank separable O(m3 + m2n + p3) O(tp2 + (n + t)pm) Entangled O(r3 + mnp2) O((t + n + r)pm) Entangled, ptr O(m3 + mnp) O(tm2 + (n + t)pm)

Table 2: Complexities of calculating parameters of predictive function (vector c), and predicting labels (calculating Gtc with Gt of size tp np) with various operatorvalued kernels. The low-rank separable refers to the case when the scalarvalued kernel matrix of the separable kernel has a low-rank approximation, i.e. G = K T = UU T with some U of size n m with m < n.

framework brings forward advancements in computational complexity; after having an entangled kernel the cost of calculating c K using again Woodbury formula is centered around inverting a m m matrix. Again when we assume m n, and consider the most dominant term arising from the matrix multiplications, the complexity of the c K-step is O(m3+mnp). There are also gains in predictive complexity. Predicting with a general operator-valued kernel has the complexity O(tnp2). With an entangled kernel of our formulation, this reduces to O((n + t + r)mp). For the ptr EKL kernel, assuming partial trace of D is already calculated, the predictive complexity is O(tm2+(n+t)mp) instead of O(tnp), beneﬁcial with small m. The complexities of calculating the parameters (c) of the predictive function and predicting with various operator-valued kernels are summarized in Table 2. It is good to note that the complexities for separable kernels could be reduced by approximating the scalarvalued kernel matrix, as well as that it would be also possible to approximate a general operator-valued kernel matrix. While we show this approximation for separable kernels in Table 2 for completeness, it is good to remember that these kind of approximations are not intrinsic to the approaches, unlike with our entangled kernels. Additionally we have assumed in the complexities for general and separable kernels that the big operatorvalued kernel matrix G and scalar-valued kernel matrix K as well as the output matrix T are already known at the time of the computations. Similarly, we have assumed that for entangled kernels the Φ and Q are already available at the time of these calculations.

5.2 Rademacher Generalization Bound

We now provide a generalization analysis of our EKL algorithm using Rademacher complexities (Bartlett and Mendelson, 2002). The notion of Rademacher complexity has been generalized to vector-valued hypothesis spaces (Maurer, 2006; Sindhwani et al., 2013; Sangnier et al., 2016). Previous work has analyzed the case where the matrix-valued kernel is ﬁxed prior to learning, while our analysis considers the kernel learning problem, similarly to the bound in Huusari et al. (2018). It provides a Rademacher bound for our algorithm when both the vector-valued function f and the kernel via Q, are learnt. We start by recalling that the feature map associated to the matrix-valued kernel K is the mapping

Huusari and Kadri

Γ : X L(Y, HK), where X is the input space, Y = Rp, and L(Y, HK) is the set of bounded linear operators from Y to HK (see, e.g., Micchelli and Pontil, 2005; Carmeli et al., 2010 for more details). It is known that K(x, z) = Γ(x) Γ(z). We denote by ΓQ the feature map associated to our entangled kernel (Equation 15). The hypothesis class of EKL is

Hβ = {x 7 fu,Q(x) = ΓQ(x) u : Q , u H β},

with = {Q : Q F = 1} and β is a regularization parameter. Let σ1, . . . , σv be an iid family of vectors of independent Rademacher variables where σi Rp, i = 1, . . . , n. The empirical Rademacher complexity of the vector-valued class Hβ is the function ˆRn(Hβ) deﬁned as

ˆRn(Hβ) = 1

sup f H sup Q

i=1 σ i fu,Q(xi)

Theorem 15 (Rademacher complexity bound for EKL)

The empirical Rademacher complexity of Hβ can be upper bounded as

ˆRn(Hβ) β rκp

for kernels k that satisfy k(x, x) κ, x X.

The proof for the theorem can be found in Appendix B. We now make use of this result to bound the generalization error of EKL for the case where the second stage of the algorithm is kernel ridge regression (Equation 18). Similar results can be given using other algorithms such as SVMs in the second stage.

Corollary 16 (Generalization bound for EKL)

Let M > 0. Assume that f(x) y Y M for all (x, y) X Y and f Hβ. Then, the following holds with probability larger than 1 δ over samples of size n for all f Hβ:

R(f) ˆR(f) + 4

δ 2n , (19)

where R(f) := E(x,y) f(x) y 2 Y is the expected risk w.r.t. the square loss and ˆR(f) := 1 n Pn i=1 f(xi) yi 2 Y is the empirical risk of f.

The proof of Corollary 16, which is deferred to Appendix C, is based on a general Rademacher complexity learning bound provided in Bartlett and Mendelson (2002) (see also Mohri et al. 2018, Theorem 3.3) and a vector-contraction inequality for the Rademacher complexity of classes of vector-valued functions proved in Maurer (2016).

6. Experiments

In this section we investigate the performance of our algorithm with real and artiﬁcial data sets. We start with an illustration of its behaviour, and move on to experiments on the predictive performance. In this latter setting, we compare our proposed Entangled Kernel Learning (EKL)6 algorithm to OKL (Dinuzzo et al., 2011); a kernel learning method for

6. Code is available at RH s personal website, riikkahuusari.com.

Entangled Kernels

separable kernels (we use the code provided by the authors 7), and KRR; kernel ridge regression. Furhtermore, we investigate performance of predicting with scalar-valued kernel extracted from the operator-valued kernel matrix EKL learns, and call this ptr EKL. In all the experiments we cross-validate over various regularization parameters λ, and for EKL also the γs controlling the combination of alignments. In the experiments we consider (normalized) mean squared error (n MSE) and normalized improvement to KRR (n I) (Ciliberto et al., 2015) as error measures. Finally we conclude the experimental section by comparing the performance of learning and predicting with various operator-valued kernels, showing the advantage of entangled kernels.

6.1 Illustration

We consider the digits data set available at scikit-learn 8 (Pedregosa et al., 2011) and make it into multi-task data set by encoding class-memberships into vectors consisting of values 1, +1. We consider only the ﬁrst four classes (digits 0-3) for clarity of illustration; in this case for example label vector 1 1 +1 1 stands for digit 2. We took 25 data samples from each class as training samples, and further 25 from each for test data samples. We trained EKL by restricting the number of columns in matrix Q, r, to two. This drastic reduction of learnable parameters is not expected to produce the best accuracy (as shown in the later experiments), but instead to allow us to visualize and illustrate a part on how EKL works. In this setting the learned EKL matrix is not separable, according to the PPT condition (end of Section 2.1). Figure 5 shows parts of the entangled kernel matrix, so that in each part the samples making it up belong to one class of digits. We can see that for each class the task relationships are modelled diﬀerently. This is vastly diﬀerent from the OKL approach, where the p p output matrix has the same structure for all the data samples, as we illustrated in Figure 1. In our case the output matrix structure is extremely dependent on the inputs. Another way to see this dependency is via an application to supervised multi-task dimensionality reduction. We note that Z = ( ˆΦ Ip)Q, is a matrix of size np r and acts as an approximated feature map of the operator-valued kernel matrix, in the sense that the full kernel is ZZ . We can interpret the Z to give dimensionality reduction of the data with respect to all the outputs individually, as in we have p dimensionality reductions of size n r. As a way to project data to lower dimensions, our illustrative approach closely resembles supervised dimensionality reduction (Fukumizu et al., 2004; Sugiyama, 2006). We note that the focus of our work is not in this speciﬁc task, nor is our framework directly applicable to those works. We consider multi-output learning, making the application of EKL to this problem the ﬁrst in supervised multi-task dimensionality reduction, as far as we know. Figure 6 shows the results of the low-dimensional projection for each of the tasks. The supervised dimensionality reduction is successful for all the tasks: we can see that each task is modelled individually, as the samples corresponding to task under question are projected to a separate cluster from the other samples. An advantage of our approach is that our

7. http://people.tuebingen.mpg.de/fdinuzzo/okl.html 8. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html

Huusari and Kadri

Figure 5: Small selections (bottom and right images) of the np np entangled kernel matrix G (top left) trained on digits data set. The data in big matrix is ordered so, that all samples of class 0 come ﬁrst, then 1, 2 and ﬁnally of 3, and for clarity of illustration white lines separate the classes. Each small selection contains 3 samples of the class(es) indicated in the title of the subﬁgure. For this data the output of the kernel function is a 4 4 matrix. The structure of the output matrix is extremely dependent on the inputs. The oﬀ-diagonal p p blocks of the kernel matrix are not symmetrical unlike with separable kernels.

Entangled Kernels

2 4 6 8 10 0

2 4 6 8 10 12

2 4 6 8 10 0

2 4 6 8 10 12 0

tr tst 0 1 2 3

Figure 6: Supervised dimensionality reduction results on two dimensions for each of the four tasks in digits data sets, task in question indicated by the subﬁgure title. The classes of the projected data samples are indicated with colours, and the shape of the blob indicates whether the sample was used in training stage (round) or projected as test sample (triangle).

model allows calculating new reduced features for new data samples; given ˆΦt a matrix of features of new samples, the predicted reduced features are now simply Zt = ( ˆΦ t Ip)Q. These projections are for the most part very accurate as shown in Figure 6, respecting the task clusters especially for the easiest classes.

6.2 Performance of EKL

We now turn to consider the predictive performance of EKL. We ﬁrst show experiments on artiﬁcial data, before turning to real-world data sets.

6.2.1 Artificial data

EKL is expected to learn complex relationships within the data. To illustrate this, we created data with bi-linear model TCA + ICK = Y, where T, C and A are randomly created p p, p n, and n n matrices respectively. K is linear kernel calculated from randomly generated data X Rn d; this scalar-valued kernel is given to all the learning algorithms along with noisy labels Y. The results are shown in Figure 7. We can see that when p is larger than n (or comparable) the predictive capabilities of EKL are much better than for other methods. Here

Huusari and Kadri

20 40 60 80 100 p

KRR OKL EKL ptr EKL

20 30 40 50 60 70 80 90 100 n

KRR OKL EKL ptr EKL

Figure 7: Results (mean squared error) of the simulated experiments with ﬁxed amount of inputs and varying number of outputs (top), and ﬁxed amount of outputs and varying inputs (bottom).The advantage of learning complex relationships is the biggest with small n.

0.2 0.4 0.6 0.8 1.0 rank

n: 20, p: 100

KRR OKL EKL ptr EKL

Figure 8: Results (mean squared error) on the simulated data set with varying number of columns used in learning Q with n = 20 and p = 100. The number of columns, or rank level, is shown as percentage of the full possible rank. OKL And KRR are plotted as a reference, naturally the rank constraint of EKL does not aﬀect these methods.

predicting with the scalar-valued kernel extracted form learned entangled kernel gives the best results.

We also investigated the eﬀect the choice of rank r (number of columns in matrix Q) has for EKL performance (Figure 8). As the rank increases, the performance of EKL gets better. This is true to an extent also for ptr EKL, however there seems not to be as strong eﬀect as with EKL. This is not so surprising; ptr EKL has fewer parameters aﬀecting predictive performance, so decreasing the amount of them shouldn t change the results as much as for full EKL.

Entangled Kernels

n = 50 n = 100 n = 200 n = 1000 method n MSE n I n MSE n I n MSE n I n MSE n I

KRR 0.2418 0.0281 0.0000 0.1668 0.0097 0.0000 0.1441 0.0037 0.0000 0.1273 0.0006 0.0000 OKL 0.2445 0.0296 -0.0109 0.1672 0.0099 -0.0026 0.1442 0.0037 -0.0009 0.1273 0.0006 -0.0000 EKL/ptr EKL 0.2381 0.0250 0.0139 0.1661 0.0097 0.0040 0.1440 0.0037 0.0003 0.1273 0.0006 0.0001

Table 3: Results on Sarcos data set with various number of training samples used, averaged over 10 data partitions. The advantage of learning complex relationships decreases with amount of data samples increasing.

n = 5 n = 10 n = 15 method n MSE n I n MSE n I n MSE n I

KRR 0.951 0.101 0.000 0.813 0.141 0.000 0.761 0.037 0.000 OKL 1.062 0.250 -0.092 0.900 0.196 -0.094 0.788 0.058 -0.034 EKL/ptr EKL 0.840 0.084 0.124 0.722 0.036 0.107 0.728 0.033 0.044

Table 4: Results on Weather data set averaged over 5 data partitions.

6.2.2 Real data

We consider the following regression data sets: Concrete slump test9 with 103 data samples and three output variables; Sarcos 10 is a data set characterizing robot arm movements with 7 tasks; Weather11 has daily weather data (p = 365) from 35 stations. With these data sets, we considered linear kernels and used the original features in EKL, and full rank in learning Q. Furthermore, we consider the u Wave Gesture data set12, a multi-view data set for classiﬁcation, which we use in a setting of predicting a view from another, giving us a regression problem with 314 output variables. We again consider linear kernels with the data set, and investigate also the eﬀect of chosen rank of a matrix Q. The main advantage of learning complex dependencies in the data lies in the setting where number of samples is relatively low; a phenomenon observed already in output kernel learning setting (Ciliberto et al., 2015; Jawanpuria et al., 2015). With small amounts of data learning the complex relationships in EKL is even more beneﬁcial than learning the output dependencies of OKL. Figure 9 shows this advantage on Concrete data set when number of instances used in training is small. Here, in contrast to our simulated data, EKL performs better than ptr EKL. For Sarcos data set we consider the setting in Ciliberto et al. (2015) and show the results in Table 3 (predicting is done to all 5000 test samples). As can be expected, the results with the Sarcos data set which has very few outputs (p = 7) do not show improvement over the compared methods when the number of samples is large. However we can ascertain that our EKL ﬁnds the same solutions than the other methods with the large sample sizes - indeed the methods perform identically when n increases. We expect that the main improvement of the EKL lies in the cases when n and p are comparable,

9. UCI data set repository 10. www.gaussianprocess.org/gpml/data/ 11. https://www.psych.mcgill.ca/misc/fda/ 12. http://www.cs.ucr.edu/ eamonn/time series data/

Huusari and Kadri

10 15 20 25 30 35 40 n

KRR OKL EKL ptr EKL

Figure 9: Results (normalized mean squared error) on the concrete data set with varying amount of training data (n) used. The advantage of learning complex relationships is biggest on small n.

10 15 20 25 30 35 40 n

KRR OKL EKL Ptr EKL

Figure 10: Running times in seconds (including calculating the predictive function given a kernel and predicting) on the concrete data set with varying amount of training data (n) used.

n = 12 n = 20 n = 40 method n MSE n I n MSE n I n MSE n I

KRR 1.070 0.347 0.000 0.710 0.183 0.000 0.552 0.065 0.000 OKL 1.069 0.347 0.001 0.710 0.183 0.001 0.552 0.064 0.000 EKL 0.796 0.164 0.266 0.634 0.103 0.097 0.547 0.046 0.007 ptr EKL 0.843 0.186 0.212 0.726 0.170 -0.023 0.627 0.058 -0.130

Table 5: Results on Concrete data set, averaged over 10 data partitions.

Entangled Kernels

n = 20 n = 30 n = 40

KRR 3.012 0.2337 2.853 0.0780 2.262 0.3157 OKL 3.403 0.0899 3.232 0.1877 2.440 0.3451 EKL 1.136 0.0937 1.123 0.0852 1.107 0.1019 ptr EKL 1.323 0.1807 1.208 0.0932 1.132 0.1221

KRR 1.773 0.0655 2.008 0.3391 1.937 0.3170 OKL 1.802 0.0603 2.096 0.3673 2.074 0.3483 EKL 0.991 0.0517 0.943 0.0237 0.908 0.0130 ptr EKL 1.026 0.0509 0.940 0.0143 0.914 0.0138

KRR 1.081 0.1437 1.028 0.1235 0.902 0.1020 OKL 1.173 0.1720 1.146 0.1753 0.993 0.1486 EKL 0.671 0.0202 0.638 0.0214 0.632 0.0325 ptr EKL 0.681 0.0226 0.651 0.0345 0.632 0.0307

Table 6: Normalized mean squared errors over three runs on the u Wave Gesture data set, classes 1-3, with varying amount of training data and full rank of the EKL.

rank 100% rank 60% rank 20%

Class 1 EKL 1.123 0.0852 1.134 0.0762 1.171 0.1199 ptr EKL 1.208 0.0932 1.228 0.0970 1.325 0.1630

Class 2 EKL 0.943 0.0237 0.939 0.0091 0.941 0.0074 ptr EKL 0.940 0.0143 0.958 0.0053 0.987 0.0137

Class 3 EKL 0.638 0.0214 0.644 0.0222 0.654 0.0271 ptr EKL 0.651 0.0345 0.647 0.0344 0.647 0.0394

Table 7: Normalized mean squared errors over three runs with n = 30 on the u Wave Gesture data set, classes 1-3, with varying rank of the EKL.

or p > n. This is clearly seen with the Weather data set, where number of outputs is much larger than the number of data samples (Table 4). While we investigate the running times of the compared algorithms more in depth in next subsection, we here in Figure 10 display them for the Concrete data set. The Figure shows the combined time for both calculating the predictive function given the kernel, and predicting. To show further the advantage of our method in the case where n < p, we turn our attention to u Wave Gesture data set with three views. Each of the views is of dimensionality 314. The training set partition contains in total 896 data samples from the three views and eight classes. However as we want to investigate the case with large distinction between n and p, we only consider the smaller sets of data belonging to each class, with around 100 samples each which we divide into training and testing sets for the compared algorithms. Tables 6 and 7 show the results. In Table 6 we present the errors for the ﬁrst three classes

Huusari and Kadri

10 15 20 25 30 35 40 p

training times, n 20

general separable entangled 0.5 entangled 0.75 entangled 1 ptr entangled 0.5 ptr entangled 0.75 ptr entangled 1

10 15 20 25 30 35 40 n

training times, p 20

general separable entangled 0.5 entangled 0.75 entangled 1 ptr entangled 0.5 ptr entangled 0.75 ptr entangled 1

10 15 20 25 30 35 40 p

pred times, n 20

general separable entangled 0.5 entangled 0.75 entangled 1 ptr entangled 0.5 ptr entangled 0.75 ptr entangled 1

10 15 20 25 30 35 40 n

pred times, p 20

general separable entangled 0.5 entangled 0.75 entangled 1 ptr entangled 0.5 ptr entangled 0.75 ptr entangled 1

Figure 11: Times for training (top row) or predicting (bottom row) with the various operator-valued kernels with ﬁxed n/p indicated in the subtitle of the ﬁgures. For entangled and partial trace entangled we show the results with various choices of m and r, indicated with the number after method name in the legend. For example entangled 0.75 means that m = 0.75 n and r = 0.75 mp, meaning both are 75% of the largest possible value. t = n for all the experiments.

with full rank parameter r in the EKL algorithms. This setting is very extreme with the tiny sample size and large p, and it is clear that our method obtains the best result. We also detail in Table 7 the EKL results with respect to the rank parameter r on 100%, 60% and 20% of the full rank. Even though the error rises a little in most cases, it is not signiﬁcant especially compared to the results with KRR and OKL. This further justiﬁes the use of our algorithm in the more eﬃcient, low-rank setting.

6.3 Running Times of Learning with Ov Ks

As we show in Table 2, there is a big diﬀerence in computational complexities between various operator-valued kernels. Namely, the complexities under question are those of cal-

Entangled Kernels

10 20 30 40 50 60 70 p

training times, n 20

separable entangled 0.5 entangled 0.75 ptr entangled 0.5 ptr entangled 0.75 ptr entangled 1

10 20 30 40 50 60 70 n

training times, p 20

separable entangled 0.5 entangled 0.75 ptr entangled 0.5 ptr entangled 0.75 ptr entangled 1

10 20 30 40 50 60 70 p

pred times, n 20

separable entangled 0.5 ptr entangled 0.5 ptr entangled 0.75 ptr entangled 1

10 20 30 40 50 60 70 n

pred times, p 20

separable entangled 0.5 ptr entangled 0.5 ptr entangled 0.75 ptr entangled 1

Figure 12: A closer look at the times for training (top row) or predicting (bottom row) with the various operator-valued kernels with ﬁxed n/p indicated in the subtitle of the ﬁgures. For entangled and partial trace entangled we show the results with various choices of m and r, indicated with the number after method name in the legend. For example entangled 0.75 means that m = 0.75 n and r = 0.75 mp, meaning both are 75% of the largest possible value. t = n for all the experiments.

culating the parameters c of the predictive function, and of calculating the label predictions. In this section we perform experiments to highlight these diﬀerences. We note that here we do not learn the kernels, but only compare in the experiments the diﬀerences on pre-deﬁned entangled/separable or general operator-valued kernels.

In our experiments we created random data (random kernel matrices, random c from which labels were calculated) with which we performed our calculations. We repeated our experiments ﬁve times with diﬀerent random data, and each time timed the execution of learning and predicting ﬁve times. In our results (Figures 11 and 12) we present averages of these runs.

Huusari and Kadri

Figure 11 shows the times of the runs for all the methods listed in Table 2. As expected, the operator-valued kernels with no structure are among the worst in every run, with scalarvalued kernels much better than them. The performance of entangled kernels naturally rests on the crucial parameters m and r. In the experiments we modify both at the same time, from full values to 75% and 50% of the full value. With maximum m and r entangled kernels are as slow as the general operator-valued kernels, but outperform them in predictive step. With 75% they train comparably to separable kernels, while outperforming them with 50%. Partial trace version of entangled kernels naturally has the lowest computational cost in training step. In predicting, entangled kernels are always better than general operatorvalued kernels. Separable and partial trace kernels perform similarly, which can be seen better in Figure 12 showing a closer look to separable and entangled kernels.

7. Conclusion

In this work we shed new light on meaning of inseparable kernels by deﬁning a general framework for constructing operator-valued kernels based on the notion of partial trace and using ideas borrowed from the ﬁeld of quantum computing. Instances of our framework include entangled kernels, a new conceptually interesting class of kernels that is designed to capture more complex dependencies between input and output variables as the more restricted class of separable kernels. We have proposed a new algorithm, entangled kernel learning (EKL), that learns this entangled kernel in kernel alignment framework. The ﬁrst step uses a deﬁnition of kernel alignment, extended here for use with operator-valued kernels with help of partial trace operator. In contrast to output kernel learning (OKL), EKL is able to learn inseparable kernels and can model a larger variety of interactions between input and output data. Moreover, the structure of the entangled kernels enables more eﬃcient computation than that with general operator-valued kernels. Our illustration on artiﬁcial data and experiments on real data give validation to our approach. Like with previous work with separable kernels, the main advantage of learning complex relations in the data lies in setting where number of data samples is relatively low. In the experimental section, we observed that diﬀerence in performance between EKL and OKL decreases as sample size increases. We think that one possible reason for that is that number of outputs relative to samples is not as large in this case, and it will be interesting to thoroughly investigate EKL in the setting where the number of outputs is even larger. Moreover, the eﬀect of choosing the number of columns in matrix Q (or choosing the rank of the entangled kernel) would warrant further study, especially with small values (low rank kernel setting). It is also well-known that feature representation of a kernel is not unique, and we leveraged this in our work. Studying the eﬀect on this could be worthwhile, as well as any theoretical consequences it has. Using the Kraus representation, entangled kernels can be viewed as a mapping from a covariance matrix on the input features to a covariance matrix on the output labels. Recently, Riemannian networks for symmetric positive deﬁnite (SPD) learning have been introduced in Huang and Gool (2017). These networks receive SPD matrices, such as covariance descriptors, as inputs, and preserve their SPD structure across the layers. The SPD structure is preserved via the bilinear mapping (Bi Map) layer, which can be seen as a particular case of a Choi-Kraus representation with Kraus rank equal to 1. Generalizing

Entangled Kernels

SPD networks to higher Kraus ranks could provide a way to eﬃciently learn entangled kernels with deep learning machinery, and would be an interesting direction to explore in future work. Overall, we hope that our work featuring the ﬁrst comprehensive description on how to learn non-separable operator-valued kernels will give a boost to the ﬁeld of learning inseparable kernels.

Acknowledgments

We thank the anonymous reviewers for their helpful comments and suggestions, which improved the quality of the paper. We also thank F. Denis, P. Arrighi and G. Di Molfetta for fruitful discussions. This work has been funded by the French National Research Agency (ANR) project Quant ML (grant number ANR-19-CE23-0011). A large part of this research was done while R.H. was at Aix-Marseille University; the part in Aalto University in part been funded by Academy of Finland grants 334790 (MAGITICS) and 310107 (MACOME).

Appendix A. Proof for Choi-Kraus representation theorem (Theorem 13)

The following result is useful to prove Theorem 13.

Theorem 17 (Watrous, 2018, Theorem 2.22) Let Φ be a linear map from L(K) to L(Y), where K and Y are Euclidean spaces. The following statements are equivalent:

1. There exists an operator A L(K, Y Z) for a some choice of an Euclidean space Z, such that Φ(x) = tr Z(AXA )

for all X X.

2. There existes a collection {Aa : a Σ}, for some choice of an alphabet Σ, for which

for all X X.

As mentioned in Watrous (2018), this theorem is an amalgamation of results that are generally attributed to Stinespring (1955); Kraus (1971, 1983); Choi (1975), and presents only the ﬁnite-dimensional analogues of the results they proved which hold for inﬁnitedimensional spaces.

Proof of Theorem 13 The proof follows the same arguments as the proof of Theorem 6.5 from Attal (2015a), which holds for inﬁnite dimensional spaces. It is based on the observation that for v Y we have vv (φ(x)φ(x) ) = Ov(φ(x)φ(x) ) O v, in which Ov is the

Huusari and Kadri

operator deﬁned by

and O v is its adjoint deﬁned by

O v : Y K K

y u 7 v, y u.

The entangled kernel K can now be written as follows

K(x, z) = tr K U T (φ(x)φ(z) ) U

t tr K U vtv t (φ(x)φ(z) ) U

t tr K U Ovt(φ(x)φ(z) ) O vt U

t tr K St(φ(x)φ(z) )S t ,

where St = U Ovt. Using Theorem 17, which is also valid for inﬁnite-dimensional spaces (Watrous, 2018), we obtain that there exists a set of matrices {Mi, 1 i r} for which K(x, z) = Pr i=1 Miφ(x)φ(z) M i . This completes the proof.

Appendix B. Proof of Theorem 15 (Rademacher bound for EKL)

We provide here the proof for our Rademacher complexity bound.

Proof of Theorem 15 We start by recalling that the feature map associated to the operator-valued kernel K is the mapping Γ : X L(Y, H), where X is the input space, Y = Rp, and L(Y, H) is the set of bounded linear operators from Y to H (see, e.g., Micchelli and Pontil (2005); Carmeli et al. (2010) for more details). It is known that K(x, z) = Γ(x) Γ(z). We denote by ΓQ the feature map associated to our entangled kernel (Equation 15). We also deﬁne the matrix Σ = (σ)n i=1 Rnp

ˆRn(Hβ) = 1

sup f H sup Q

i=1 σ i fu,Q(xi)

sup u sup Q

i=1 σi, ΓQ(xi) u Rp

sup u sup Q

i=1 ΓQ(xi)σi, u H

Entangled Kernels

i=1 ΓQ(xi)σi H

i,j=1 σi, KQ(xi, xj)σj Rp

sup Q ( Σ, GQΣ Rnp)1/2 #

sup Q Σ, [Φ Ip]QQ [Φ Ip]Σ 1/2 #

sup Q tr([Φ Ip]ΣΣ [Φ Ip]QQ )1/2 #

sup Q tr([[Φ Ip]ΣΣ [Φ Ip]]2)1/4tr(QQ QQ )1/4 #

sup Q tr([Φ Ip][Φ Ip]ΣΣ )1/2tr(QQ QQ )1/4 #

sup Q tr([ΦΦ Ip]ΣΣ )1/2 #

n E h tr([K Ip]ΣΣ )1/2i

E h tr([K Ip]ΣΣ ) i 1/2 (5)

tr h [K Ip] E(ΣΣ ) i 1/2

Here (1) and (3) are obtained with reproducing property, (2) and (4) with Cauchy-Schwarz inequality, and (5) with Jensen s inequality. For kernels k that satisfy tr(K) κn, we obtain that

ˆRn(Hβ) β κp n .

Appendix C. Proof of Corollary 16 (Generalization bound for EKL)

The following two results are useful to prove Corollary 16. First, Bartlett and Mendelson (2002) provides the following generalization bound based on Rademacher complexity (see also Mohri et al. 2018, Theorem 3.3).

Huusari and Kadri

Theorem 18 (Mohri et al. 2018, Theorem 3.3)

Let G be a family of functions mapping from an arbitrary input space Z to [0,M]. Then, for any δ > 0, with probability at least 1 δ over the draw of an i.i.d. sample S of size n, the following holds for all g G:

i=1 g(zi) + 2 ˆRn(G) + 3M

δ 2n , (20)

where ˆRn(G) is the empirical Rademacher complexity of G.

The second result, from Maurer (2016), provides a contraction inequality for the Rademacher complexity of classes of vector-valued functions.

Corollary 19 (Maurer 2016, Corollary 1)

Let X be any set, (x1, . . . , xn) X n, let F be a class of functions f : X ℓ2 and let hi : ℓ2 R have Lipschitz norm L. Then

i σihi(f(xi))

2LE sup f F

i,j=1 σijfj(xi), (21)

where σij is an independent doubly indexed Rademacher sequence and fj(xi) is the j-th component of f(xi).

We now make use of the above results to prove the generalization error bound of EKL.

Proof of Corollary 16

Since f(x) y Y M for all (x, y) X Y and f Hβ, for any y the function y 7 y y 2 Y is 2M-Lipschitz. Then by Corollary 19, for any sample S = ((x1, y1), . . . , (xn, yn)), the Rademacher complexity of the family G = {(x, y) 7 f(x) y 2 Y : f Hβ} is upper bounded as follows: ˆRn(G) 2

2M ˆRn(Hβ). (22)

Combining this inequality with the general Rademacher complexity learning bound of Theorem 18 and the Rademacher complexity bound of Hβ given in Theorem 15 completes the proof.

Mauricio A Alvarez and Neil D Lawrence. Computationally eﬃcient convolved multiple output gaussian processes. Journal of Machine Learning Research, 12:1459 1500, 2011.

Mauricio A Alvarez, Lorenzo Rosasco, Neil D. Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends R in Machine Learning, 4(3):195 266, 2012.

Stephane Attal. Lectures in quantum noise theory. Lecture 6: Quantum channels, 2015a.

Entangled Kernels

Stephane Attal. Lectures in quantum noise theory. Lecture 2: Tensor Products and Partial Traces, 2015b.

Luca Baldassarre, Lorenzo Rosasco, Annalisa Barla, and Alessandro Verri. Multi-output learning via spectral ﬁltering. Machine learning, 87(3):259 301, 2012.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

Ingemar Bengtsson and Karol Zyczkowski. Geometry of quantum states: an introduction to quantum entanglement. Cambridge university press, 2017.

Rajendra Bhatia. Positive deﬁnite matrices, volume 24. Princeton university press, 2009.

Romain Brault, Markus Heinonen, and Florence d Alch e Buc. Random fourier features for operator-valued kernels. In Asian Conference on Machine Learning (ACML), pages 110 125, 2016.

C eline Brouard, Marie Szafranski, and Florence d Alch e Buc. Input output kernel regression: Supervised and semi-supervised structured output prediction with operator-valued kernels. Journal of Machine Learning Research, 17(176):1 48, 2016.

Andrea Caponnetto, Charles A Micchelli, Massimiliano Pontil, and Yiming Ying. Universal multi-task kernels. Journal of Machine Learning Research, 9(Jul):1615 1646, 2008.

Claudio Carmeli, Ernesto De Vito, Alessandro Toigo, and Veronica Umanita. Vector valued reproducing kernel hilbert spaces and universality. Analysis and Applications, 08(01):19 61, 2010.

Rich Caruana. Multitask learning. Machine learning, 28(1):41 75, 1997.

Man-Duen Choi. Completely positive linear maps on complex matrices. Linear algebra and its applications, 10(3):285 290, 1975.

Carlo Ciliberto, Youssef Mroueh, Tomaso Poggio, and Lorenzo Rosasco. Convex learning of multiple tasks and their structure. In International Conference on Machine Learning (ICML), 2015.

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 13(1):795 828, 2012.

Nello Cristianini, John Shawe-Taylor, Andre Elisseeﬀ, and Jaz S Kandola. On kernel-target alignment. In Advances in Neural Information Processing Systems, pages 367 373, 2002.

Krzysztof Dembczy nski, Willem Waegeman, Weiwei Cheng, and Eyke H ullermeier. On label dependence and loss minimization in multi-label classiﬁcation. Machine Learning, 88(1-2):5 45, 2012.

Francesco Dinuzzo and Kenji Fukumizu. Learning low-rank output kernels. In Asian Conference on Machine Learning (ACML), 2011.

Huusari and Kadri

Francesco Dinuzzo, Cheng S. Ong, Gianluigi Pillonetto, and Peter V Gehler. Learning output kernels with block coordinate descent. In International Conference on Machine Learning (ICML), pages 49 56, 2011.

Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615 637, 2005.

Katarzyna Filipiak, Daniel Klein, and Erika Vojtkov a. The properties of partial trace and block trace operators of partitioned matrices. Electronic Journal of Linear Algebra, 33, 2018.

Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5(Jan):73 99, 2004.

Mehmet G onen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12(Jul):2211 2268, 2011.

Magda Gregorov a, Alexandros Kalousis, and St ephane Marchand-Maillet. Forecasting and granger modelling with non-linear dynamical dependencies. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), 2017.

Pawe l Horodecki. Separability criterion and inseparable mixed states with positive partial transposition. Physical Review Letters, 232:333, 1997.

Ryszard Horodecki, Pawe l Horodecki, Micha l Horodecki, and Karol Horodecki. Quantum entanglement. Reviews of modern physics, 81(2):865, 2009.

Zhiwu Huang and Luc Van Gool. A riemannian network for spd matrix learning. In AAAI Conference on Artiﬁcial Intelligence, pages 2036 2042, 2017.

Riikka Huusari and Hachem Kadri. Entangled kernels. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2019.

Riikka Huusari, Hachem Kadri, and C ecile Capponi. Multi-view metric learning in vectorvalued kernel spaces. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2018.

Alan J Izenman. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis, 5(2):248 264, 1975.

Pratik Jawanpuria, Maksim Lapin, Matthias Hein, and Bernt Schiele. Eﬃcient output kernel learning for multiple tasks. In Advances in Neural Information Processing Systems, 2015.

Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, and Francis R Bach. Multiple operator-valued kernel learning. In Advances in Neural Information Processing Systems, 2012.

Hachem Kadri, Emmanuel Duﬂos, Philippe Preux, St ephane Canu, Alain Rakotomamonjy, and Julien Audiﬀren. Operator-valued kernels for learning from functional response data. Journal of Machine Learning Research, 16:1 54, 2016.

Entangled Kernels

Jaz Kandola, John Shawe-Taylor, and Nello Cristianini. On the extensions of kernel alignment. 2002.

Karl Kraus. General state changes in quantum theory. Annals of Physics, 64(2):311 335, 1971.

Karl Kraus. States, eﬀects and operations: fundamental notions of quantum theory. Springer Verlag, 1983.

Guy Lever, John Shawe-Taylor, Ronnie Staﬀord, and Csaba Szepesv ari. Compressed conditional mean embeddings for model-based reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence, 2016.

N eh emy Lim, Florence d Alch e Buc, C edric Auliac, and George Michailidis. Operator-valued kernel-based vector autoregressive models for network inference. Machine learning, 99(3): 489 513, 2015.

Andreas Maurer. The Rademacher complexity of linear transformation classes. In International Conference on Computational Learning Theory (COLT), pages 65 78, 2006.

Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory (ALT), pages 3 17, 2016.

Elliot Meyerson and Risto Miikkulainen. Beyond shared hierarchies: Deep multitask learning through soft layer ordering. In International Conference on Learning Representations (ICLR), 2018.

Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural Computation, 17:177 204, 2005.

Ha Quang Minh. Operator-valued bochner theorem, fourier feature maps for operatorvalued kernels, and vector-valued learning. ar Xiv preprint ar Xiv:1608.05639, 2016.

Ha Quang Minh, Loris Bazzani, and Vittorio Murino. A unifying framework in vectorvalued reproducing kernel hilbert spaces for manifold regularization and co-regularized multi-view learning. Journal of Machine Learning Research, 17(25):1 72, 2016.

Florian Mintert, Carlos Viviescas, and Andreas Buchleitner. Basic concepts of entangled states. In Entanglement and Decoherence, pages 61 86. Springer, 2009.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.

Cheng Soon Ong, Xavier Mary, St ephane Canu, and Alexander J Smola. Learning with non-positive kernels. In International Conference on Machine Learning (ICML), page 81, 2004.

Fabian Pedregosa, Ga el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Huusari and Kadri

Asher Peres. Separability criterion for density matrices. Physical Review Letters, 77(8): 1413, 1996.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177 1184, 2008.

Eleanor G Rieﬀel and Wolfgang H Polak. Quantum computing: A gentle introduction. MIT Press, 2011.

Akash Saha and Balamurugan Palaniappan. Learning with operator-valued kernels in reproducing kernel kre ın spaces. In Advances in Neural Information Processing Systems, 2020.

Maxime Sangnier, Olivier Fercoq, and Florence d Alch e Buc. Joint quantile regression in vector-valued rkhss. In Advances in Neural Information Processing Systems, pages 3693 3701, 2016.

Vikas Sindhwani, Minh Ha Quang, and Aur elie C. Lozano. Scalable matrix-valued kernel learning for high-dimensional nonlinear multivariate regression and granger causality. In Uncertainty in Artiﬁcial Intelligence (UAI), 2013.

Forrest W Stinespring. Positive functions on C*-algebras. Proceedings of the American Mathematical Society, 6(2):211 216, 1955.

Masashi Sugiyama. Local ﬁsher discriminant analysis for supervised dimensionality reduction. In International Conference on Machine Learning (ICML), pages 905 912, 2006.

James Townsend, Niklas Koep, and Sebastian Weichwald. Pymanopt: A Python toolbox for optimization on manifolds using automatic diﬀerentiation. Journal of Machine Learning Research, 17(137):1 5, 2016.

John Watrous. The theory of quantum information. Cambridge University Press, 2018.

Christopher KI Williams and Matthias Seeger. Using the nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems, pages 682 688, 2001.