# predicting_a_switching_sequence_of_graph_labelings__5d94037b.pdf

Journal of Machine Learning Research 16 (2015) 2003-2022 Submitted 7/15; Published 9/15

Predicting a Switching Sequence of Graph Labelings

Mark Herbster m.herbster@cs.ucl.ac.uk Stephen Pasteris s.pasteris@cs.ucl.ac.uk Massimiliano Pontil m.pontil@cs.ucl.ac.uk Department of Computer Science University College London Gower Street, London WC1E 6BT, UK

Editor: Alex Gammerman and Vladimir Vovk

We study the problem of predicting online the labeling of a graph. We consider a novel setting for this problem in which, in addition to observing vertices and labels on the graph, we also observe a sequence of just vertices on a second graph. A latent labeling of the second graph selects one of K labelings to be active on the ﬁrst graph. We propose a polynomial time algorithm for online prediction in this setting and derive a mistake bound for the algorithm. The bound is controlled by the geometric cut of the observed and latent labelings, as well as the resistance diameters of the graphs. When specialized to multitask prediction and online switching problems the bound gives new and sharper results under certain conditions.

Keywords: online learning over graphs, kernel methods, matrix winnow, switching

1. Introduction

We consider the problem of learning online a set of K binary labelings of a graph. In a simple scenario this set of labelings corresponds to a switching sequence of labelings. Initially we focus on this setting before introducing our more general model. Consider the following game for predicting the labeling of a graph: Nature presents a graph; nature queries a vertex i1; the learner predicts ˆy1 { 1, 1} as the label of the vertex; nature presents a label y1; nature queries a vertex i2; the learner predicts ˆy2; and so forth. The learner s goal is to minimize the total number of mistakes M = |{t : ˆyt = yt}|. If nature is adversarial, the learner will always mispredict, but if nature is regular or simple, there is hope that a learner may make only a few mispredictions. Thus, a central goal of online learning is to design algorithms whose total mispredictions can be bounded relative to the complexity of nature s labeling. To predict a single labeling of a graph, one may employ a kernel perceptron algorithm based on the graph Laplacian (Herbster and Pontil, 2006). This method achieves a bound of M O(Rφ), where φ is the cut (the number of edges joining disagreeing labels) and R is the (resistance) diameter of the graph. Thus φ measures the complexity of the labeling and R is a structural parameter of the graph independent of the labeling. Such a bound is particularly appealing when the parameters are mildly dependent or independent of the number of vertices in the graph (see Herbster and Pontil, 2006, for a discussion).

c 2015 Mark Herbster, Stephen Pasteris and Massimiliano Pontil.

Herbster, Pasteris and Pontil

In the switching setting, we now consider a sequence colored by K graph labelings with S K switches. We illustrate a switching sequence in Figure 1. In this color illustration

t=1 t=5 t=7 t=10 t=14 t=16

k=1 k=3 k=2 k=3 k=2 k=1

Figure 1: A switching sequence over 20 trials with K=3 graph labelings and S=5 switches.

there are S = 5 switches between K = 3 graph labelings. At each trial, a vertex of the graph is labeled according to one of the K binary functions. There are at most S trials at which the binary function currently in use is changed. In the speciﬁc example, the labeling 1 is used in trials 1 4 and 16 20, labeling 2 is used in trials 5 6 and 10 13, and labeling 3 is used in trials 7 9 and 14 15. We will give an algorithm that achieves

k=1 φk K log(n) , (1)

where φk is the cut of the k-th binary labeling, n is the number of vertices in the graph, and the O(x) notation absorbs a polylogarithmic factor in x. Note that the term R PK k=1 φk is the cost of learning the K binary labelings, given the information of which labeling is active on each trial. Since this information is not available to the learner, we pay a multiplicative term K log(n) and an additive term for the number of switches S. The particularly salient feature of this bound is that we pay the cost R PK k=1 φk of learning all the binary labelings only once. This, and the fact that S K, implies that the algorithm is maintaining an implicit memory of past graph labelings learned. In the more general setting, the learner is given two graphs: an observed n-vertex graph G and a p-vertex latent graph H. Hidden from the learner is a set {ω1, . . . , ωK} of K binary labelings of G. On each trial one of these labelings is active, the learner receives a pair of vertices, i G and j H, and the learner s aim is to predict the currently active binary label of vertex i. It is the unknown K-ary label of j that determines the active labeling of G and hence the current label of i. After making its prediction the learner receives only the current label of i. The learner never receives the label of j. Note that if the learner did in fact receive the label of j, the learning problem would separate into K independent graph labeling tasks. Thus the graph H is called latent because the vertex labels of this graph are never observed, although it controls which of the K labelings of G is active at each given trial. We propose a polynomial time algorithm for predicting the labelings of the observed graph and we derive a mistake bound for this algorithm. The bound involves two additive terms, which measure the complexity of the K binary labelings, and the complexity of the latent labeling, respectively; as well as a multiplicative term of the order of K log(K(n+p)). Returning to the switching example, the latent graph can be thought of as a line graph,

Predicting a Switching Sequence of Graph Labelings

where the sequence of vertices corresponds to the sequence of trials (although as we shall see in Section 6, for technical reasons we will need instead a binary support tree). The latent K-labeling function will then have a cut equal to S, the number of switches; and the bound (1) will be obtained as a special case of the general result described in this paper.

The paper is organized in the following manner. In Section 2, we comment about related work. In Section 3, we introduce the learning problem. In Section 4, we discuss the proposed learning algorithm. Section 5 presents our main result and details its proof. In Section 6, we illustrate our result in two speciﬁc examples and make ﬁnal remarks.

2. Related Work

The problem of learning a labeling of a graph is a natural one in the online learning setting (Herbster et al., 2005; Herbster and Pontil, 2006), as well as a foundational technique for a variety of semi-supervised learning methods (Blum and Chawla, 2001; Kondor and Laﬀerty, 2002; Zhu et al., 2003). In the online setting, fast algorithms have been developed that operate on trees and path graphs (Herbster et al., 2008, 2009; Cesa-Bianchi et al., 2009, 2010; Vitale et al., 2011).

Our main application is to learning a switching sequence of graph labelings. Switching has been studied extensively in the online learning literature. The results divide largely into two directions: switching in the experts model (Herbster and Warmuth, 1998; Vovk, 1999; Bousquet and Warmuth, 2003; Gyorﬁet al., 2005; Koolen and Rooij, 2008; Hazan and Seshadhri, 2009; Adamskiy et al., 2012; Cesa-Bianchi et al., 2012); and switching in online linear prediction model, see e.g. (Herbster and Warmuth, 2001; Kivinen et al., 2004; Cesa-Bianchi and Gentile, 2006). As we may view learning a graph labeling as learning a linear classiﬁer based on a Laplacian kernel, our algorithm is directly comparable to these previous results. The implicit assumption of those switching techniques is that they learn a sequence of linear classiﬁers w1, w2, . . . and that this sequence is slowly changing over time, i.e, they are interested in predicting well when a drifting cost O(P t wt wt+1 ) is small. Our assumption is diﬀerent: we consider that there exists a small set of K distinct classiﬁers, and we switch repeatedly between classiﬁers within this set. This setting is analogous to the setting proposed in an open problem by Freund (2000). Freund s challenge was to give an eﬃcient algorithm in the expert advice model for the problem of switching repeatedly between a small set of experts within a larger set of experts. The problem was solved by Bousquet and Warmuth (2003) (see also Adamskiy et al., 2012). Those results, however, do not directly transfer to the graph labeling setting as the number of needed experts is 2n for an n-vertex graph, and computing the marginal probabilities with a natural prior (i.e., an Ising distribution) on a graph even without switching is a well-known #P-complete problem (Goldberg and Jerrum, 2007).

An example of predicting in our more general setting applies to online multitask learning and is inspired by Cavallanti et al. (2010, Corollary 3). We adapt their model to our graph labeling set-up. Further related work includes (Dekel et al., 2007), which considered learning multiple tasks related through a joint loss function; and (Avishek et al., 2011), which generalized the usual setting to include negatively correlated tasks as well as positively correlated tasks. Rather than learning a group of interelated linear classiﬁers it is also

Herbster, Pasteris and Pontil

natural to consider multi-task learning with expert advice. Two prominent results include those of Abernethy et al. (2007) and Adamskiy et al. (2012).

Our main technical debt is to the following four papers. Firstly, the mistake bound analysis of matrix winnow (Warmuth, 2007), which strongly informs the proof of our main result. Secondly, our analysis of using matrix winnow on graphs is inspired by the graph Laplacian construction in (Gentile et al., 2013). Thirdly, our ﬁrst two techniques require a modiﬁcation of the Laplacian to ensure strict positive deﬁniteness, and here we used the simple construction from (Herbster and Pontil, 2006). Finally we use the binary support tree construction (Herbster et al., 2008) to model the trial sequence in the switching setting.

In this section, we present the problem under study. We begin by introducing some graph terminology.

We are given two undirected graphs, an n-vertex graph G and a p-vertex graph H. We let V(G) and V(H) be the set of vertices in G and H, respectively, and let LG and LH be the corresponding graph Laplacians. For every positive integer d, we deﬁne Nd = {1, . . . , d}, the set of integers from 1 and up to including d. Unless confusion arises, for simplicity we identify vertices by their indices. Indices i, i , it Nn will always be associated with vertices in G, and indices j, j , jt Np will be associated with vertices in H.

A labeling of a graph is a function which maps vertices on the graph to a set of labels. We deﬁne the cut induced by a labeling of a graph as the number of edges whose end vertices have diﬀerent labels. Note that this deﬁnition is independent of the number of labels used. We will use the notation cut G(u) to denote the cut associated with the labeling u of graph G. In particular if u is a binary labelling with label set { 1, 1} then cut G(u) = 1

In the paper we refer to G as the observed graph since during the learning process we will observe both a vertex of G and a corresponding label, whereas we refer to H as the latent graph because we will only observe a vertex of H but never observe the corresponding label. As we will see, the latent graph provides side information which can guide the prediction tasks on the observed graph. The goal is to predict well the binary labels associated to vertices in G using sequential information of the form (it, jt, yt) V(G) V(H) { 1, 1} for t = 1, 2, . . . , T; the true label yt is determined by using one of the K binary classiﬁers, and which of these is active at each trial is determined by a K-class classiﬁer which acts on the latent graph H. Speciﬁcally, we let ω1, . . . ωK be the binary classiﬁers (labelings) on graph G. Each ωk is a function from V(G) to { 1, 1}. The latent labeling controls which of the K labelings of G is currently active and it is given by a function µ : V(H) NK. In the paper, when confusion does not arise, we simply regard the functions ωk as vectors in { 1, 1}n and µ as a vector in {1, . . . , K}p.

We consider the following online learning game between nature and learner. The learner knows the graphs G and H from the outset but does not initially know the labelings ω1, . . . , ωK, and as we already noted never observes the latent labeling µ. On trial t nature presents the learner with vertices (it, jt) Nn Np, the learner predicts a value ˆyt { 1, 1} and then the true label yt is revealed to the learner. This label is computed by nature as yt = ωµjt,it, that is the it-th component of the binary vector ωµjt. We deﬁne

Predicting a Switching Sequence of Graph Labelings

Algorithm 1

Input: An n-vertex graph G and p-vertex graph H.

Parameters: K, ˆθ, η.

Initialization: W 0 I K(n+p), where I is the (n + p) (n + p) identity matrix.

For t = 1, . . . , T

Get pair of vertices it, jt V(G) V(H).

Deﬁne the matrix Xt := 1

2xtx T t , with xt given by Equation (4).

( 1 if Tr (W t 1Xt) K+1

1 otherwise.

Receive label yt { 1, 1} and if ˆyt = yt update

W t exp (log (W t 1) + η(yt ˆyt)Xt) . (2)

the set of mistakes as M := {t : ˆyt = yt} and the number of mistakes M := |M|. The aim of the learner is for M to be small. Before presenting the learning algorithm we require some more notation. Given a matrix A we deﬁne A+, AT and Tr (A) to be its pseudoinverse, transpose and trace respectively. We let Sd be the set of d d symmetric matrices and let Sd + and Sd ++ be the subset of positive semideﬁnite and strictly positive deﬁnite matrices. Recall that the set of symmetric matrices Sd + has the following partial ordering: for every A, B Sd + we say that A B if and only if B A Sd +. Every real valued function f induces a spectral function f : Sd Sd

which is obtained by applying f to the eigenvalues of A. Speciﬁcally, if {λi, ui}d i=1 is an eigensystem of A, that is, u1, . . . , ud are orthonormal vectors and λi are real numbers such that A = Pd i=1 λiuiu T i , then we deﬁne f(A) = Pd i=1 f(λi)uiu T i . Examples of spectral functions which we will use are exp(t), log(t) and t log t. Note that the last two functions are well deﬁned only on Sd ++ and the last function can be extended to Sd + as a limiting process. Finally, for vectors α Rn and β Rp we deﬁne [α, β] Rn+p to be the concatenation of α and β, which we regard as a column vector. Hence [α, β] T α, β = αT α + βT β.

4. The Algorithm

The learning algorithm we propose ﬁts into the broad category of online matrix learning. At the core of the algorithm is an implicit spectral regularization, and we use a modiﬁcation of matrix winnow (Warmuth, 2007) as our base algorithm. As input the algorithm is given the graphs G and H. The algorithm then depends on two input parameters, K > 1 and ˆθ. The ﬁrst parameter is the number labelings of the observed graph, which then determines the learning rate η. The second parameter ˆθ is a scaled threshold for the linear classiﬁer. The parameter ˆθ is an upper bound on a measure of the complexity of the underlying learning problem, which is denoted by θ (cf. (6)).

Herbster, Pasteris and Pontil

We map a pair of vertices on the observed and latent graphs to a rank one positive semideﬁnite matrix, and use a linear classiﬁer in the embedded space. Speciﬁcally, we map (it, jt) V(G) V(H) to Xt Sn+p + given by the equation

2xtx T t (3)

ρ(G) (G 1 2 )it, 1 p

ρ(H) (H 1 2 )jt

matrices G Sn ++ and H Sp ++ are prescribed and we deﬁned ρ(G) := maxn i=1 Gii and ρ(H) := maxp j=1 Hjj. The algorithm works for any such embeddings but the mistake bound presented in Theorem 1 below is obtained by choosing

G = L+ G + RG11T and H = L+ H + RH11T (5)

where 1 denotes the vector (1, . . . , 1)T and RG = maxn i=1(L+ G )ii and RH = maxp j=1(L+ H)jj are (essentially) the resistance diameters1 of G and H, respectively. At each trial we predict by a linear threshold function in the embedded space, namely we predict positive if Tr (W t 1Xt) > K+1

2K ˆθ and negative otherwise, where W t Sn+p + is a parameter matrix which is updated by the algorithm after each trial and initially set to a positive multiple of the identity matrix. Speciﬁcally, W t is updated via Equation (2) only when a mistake is made. The worst case cost of an update is in the order of (n + p)3

since this requires computing an eigensystem of an (n + p) (n + p) matrix. However if the number of mistakes is much smaller than n + p then the computation per trial can be substantially reduced because the weight matrix can be decomposed as the sum of a multiple of the identity matrix plus a low rank matrix (speciﬁcally the rank at trial t is equal to the current number of mistakes plus one). In this paper we are primarily concerned with the mistake bound and postpone further discussions on large scale implementations of the algorithm to a future occasion.

5. Main Result

In this section, we present our main result and give a detailed proof.

Theorem 1 Let

k=1 cut G(ωk) + 4RHcut H(µ) + 2

j=1 I(µj = k) ,

and let c := (5 log(5/3) 2) 1 1.81. The number of mistakes made by Algorithm 1 with θ ˆθ and learning rate η := 1

2 log K+3 K+1 is upper bounded by

k=1 cut G(ωk) + RHcut H(µ) + K

log(K(n + p)) + ˆθ θ 1

1. Speciﬁcally, maxn i=1(L+ G )ii is a lower bound on the resistance diameter of G, see (Herbster and Pontil, 2006, Eq. (9)).

Predicting a Switching Sequence of Graph Labelings

To prepare for the proof we introduce some notation. The K-class labeling µ induces K boolean labelings on H, denoted by µk {0, 1}p, k NK, and is deﬁned componentwise as µk,j = 1 if µj = k and µk,j = 0 otherwise. We also deﬁne, for every k NK,

Φk := µT k H 1µk, and Φ k := ωT k G 1ωk.

For i Nn, we let ei be the i-th unit basis vector, that is, ei,i = 0 if i = i and ei,i = 1. We let zk := hp

and deﬁne the k-th embedded classiﬁer associated with the k-th labelings as

Zk := zkz T k ˆθ ,

with ˆθ θ where

k=1 zk 2 = ρ(G)

k=1 ωT k G 1ωk + ρ(H)

k=1 µT k H 1µk . (6)

Note that the representation of the k-th embedded classiﬁer depends on the k-th labeling of the observed graph and the k-th one versus all labeling of the latent graph. We have the following proposition.

Proposition 2 For all k NK and trials t it holds that

(i) Tr ZT k Xt = (ωk,it + µk,jt)2

k=1 Tr ZT k Xt = (K + 1 + 2yt)

(iii) Xt has eigenvalues in [0, 1]

(iv) zk 2 = ρ(H)Φk + ρ(G)Φ k

(v) Tr (Zk log (Zk)) < 0.

Proof (i): Note that Tr ZT k Xt = Tr(zkz T k (xtx T t )) 2ˆθ = (x T t zk)2

2ˆθ . The result then follows since x T t zk = e T itωk + e T jtµk = ωk,it + µk,jt. (ii): If k = µjt then µk,jt = 0 and by part (i) we have

Tr ZT k Xt = 1

2ˆθ (ωk,it + µk,jt)2 = 1

2ˆθ (ωk,it)2 = 1

Suppose now that k = µjt. By the deﬁnition of yt we have yt = ωµjt,it = ωk,it so since µk,jt = 1 when k = µjt we have, by part (i) that

Tr ZT k Xt = 1

2ˆθ (ωk,it + µk,jt)2 = 1

2ˆθ (1 + yt)2 = 1

2ˆθ (1 + 2yt + y2 t ) = (1 + yt)

Herbster, Pasteris and Pontil

as y2 t = 1. By summing the above over k we get the result. (iii): Note that

2xtx T t = xt 2

2 xt xt x T t xt .

Thus Xt is a rank one positive semideﬁnite matrix and its only nonzero eigenvalue is xt 2/2. A direct computation then gives that xt 2 2. The result follows. (iv): zk 2 = z T k zk = ρ(G)ωT k G 1ωk + ρ(H)µT k H 1µk = ρ(G)Φ k + ρ(H)Φk. (v): Note that Zk is a positive semideﬁnite rank one matrix. Hence denoting with λ the non-trivial eigenvalue we have Tr (Zk log (Zk)) = λ log λ. The result then follows if we show that λ (0, 1). To see this we write

Zk = zkz T k ˆθ = zk 2

zk zk z T k zk .

By deﬁnition θ = PK k=1 zk 2 so since θ ˆθ we have zk 2

ˆθ 1 as required.

We now deﬁne the quantum relative entropy, which plays a central role in the amortized analysis of the algorithm. We note that this technique was previously employed in online learning by Tsuda et al. (2005).

Deﬁnition 3 The quantum relative entropy of symmetric positive semideﬁnite square matrices A and B is

(A, B) := Tr (A log (A) A log (B) + B A) .

We will utilize the following lemmas.

Lemma 4 For t M we have that

k=1 ( (Zk, W t 1) (Zk, W t)) c

where c := 5 log(5/3) 2.

Proof When t M we have, for all k NK, that

(Zk, W t 1) (Zk, W t)

= Tr (Zk log (W t) Zk log (W t 1)) + Tr (W t 1) Tr (W t)

=η(yt ˆyt) Tr (Zk Xt) + Tr (W t 1) Tr (exp (log (W t 1) + η(yt ˆyt)Xt)) (7)

η(yt ˆyt) Tr (Zk Xt) + Tr (W t 1) Tr (exp (log (W t 1)) exp (η(yt ˆyt)Xt)) (8)

=η(yt ˆyt) Tr (Zk Xt) + Tr (W t 1(I exp (η(yt ˆyt)Xt)))

η(yt ˆyt) Tr (Zk Xt) + (1 eη(yt ˆyt)) Tr (W t 1Xt) (9)

Predicting a Switching Sequence of Graph Labelings

where Equation (7) comes from the algorithm s update of W t 1 (see Equation (2)), Equation (8) comes from Lemma A.8 with A := log (W t 1) and B := η(yt ˆyt)Xt, and Equation (9) comes by ﬁrst applying Lemma A.9 with a := η(yt ˆyt) and A := Xt (using Proposition 2-(iii)), and then applying Lemma A.10 with A := W t 1. We hence have

k=1 ( (Zk, W t 1) (Zk, W t))

k=1 Tr (Zk Xt) + K(1 eη(yt ˆyt)) Tr (W t 1Xt)

=η(yt ˆyt)(K + 1 + 2yt)

2ˆθ + K(1 eη(yt ˆyt)) Tr (W t 1Xt) (10)

where Equation (10) comes from Proposition 2-(ii). Let ρ be the right hand side of Equation (10). Noting that η := 1

2 log K+3 K+1 we have the

following. When yt = 1 and ˆyt = 1 then (1 eη(yt ˆyt)) is negative and Tr (W t 1Xt) < K+1

2K ˆθ and thus

ρ η(K + 3)(ˆθ) 1 + K + 1

2 (1 e2η)(ˆθ) 1

= 1 2 log K + 3

(K + 3)(ˆθ) 1 + K + 1

2 log K + 3

(K + 3) 1 (ˆθ) 1

K ˆθ . (11)

Alternately, when yt = 1 and ˆyt = 1 then (1 eη(yt ˆyt)) is positive and Tr (W t 1Xt) K+1

2K ˆθ and thus

ρ η(K 1)(ˆθ) 1 + K + 1

2 (1 e 2η)(ˆθ) 1

2 log K + 3

(K 1)(ˆθ) 1 + K + 1

2 log K + 3

(K 1) (ˆθ) 1

K ˆθ . (12)

The constant c in Equations (11) and (12) is derived from the following argument. For K 2 the functions

2 log K + 3

(K + 3) 1 (13)

2 log K + 3

Herbster, Pasteris and Pontil

are monotonic increasing (see Lemmas A.12 and A.13) so

2 log K + 3

(K + 3) 1 2 1

2 log 2 + 3

2 log K + 3

(K 1) 2 2 + 1

2 log 2 + 3

= 6 5 log 5

2 log K+3 K+1 (K + 3) 1 c

2 log K+3 K+1 (K 1) c

Lemma 5 It holds that K P

k=1 (Zk, W 0) |M| c

Proof We have

k=1 (Zk, W 0)

k=1 ( (Zk, W 0) (Zk, W T ))

t=1 ( (Zk, W t 1) (Zk, W t))

k=1 ( (Zk, W t 1) (Zk, W t))

k=1 ( (Zk, W t 1) (Zk, W t)) + X

k=1 ( (Zk, W t 1) (Zk, W t))

where Equation (15) comes from Lemma 4 and the fact that on a trial t / M we have W t = W t 1 and hence (Zk, W t 1) = (Zk, W t) for all k NK.

Lemma 6 It holds that K P

k=1 (Zk, W 0) θ

ˆθ log (K(n + p)) + 1 θ

Predicting a Switching Sequence of Graph Labelings

Proof of Lemma 6 Recall that W 0 = I K(n+p), where I is the (n + p) (n + p) identity matrix. We observe that

k=1 (Zk, W 0) =

k=1 (Tr (Zk log (Zk)) Tr (Zk log (W 0)) + Tr (W 0) Tr (Zk))

k=1 Tr (Zk log (W 0)) +

k=1 Tr (W 0)

k=1 Tr (Zk) (16)

k=1 Tr Zk log I K(n + p)

k=1 Tr I K(n + p)

k=1 Tr (Zk)

= log (K(n + p))

k=1 Tr (Zk) + 1 K(n + p)

k=1 Tr (Zk)

= log (K(n + p))

k=1 Tr (Zk) + 1

k=1 Tr (Zk)

= 1 + (log (K(n + p) 1))

k=1 Tr (Zk)

= 1 + (log (K(n + p) 1))

k=1 Tr zkz T k ˆθ

= 1 + (log (K(n + p) 1)) 1

k=1 z T k zk

= 1 + log (K(n + p) 1) θ

ˆθ log (K(n + p)) + 1 θ

where Equation (16) comes from Proposition 2-(v) and Equation (17) comes from the deﬁnition of θ.

We are now ready to prove our main result.

Proof of Theorem 1 Combining Lemmas 6 and 5 we have

k=1 (Zk, W 0) θ

ˆθ log (K(n + p)) + 1 θ

Herbster, Pasteris and Pontil

which gives

ˆθ log (K(n + p)) + K ˆθ

c log (K(n + p)) + Kθ

log (K(n + p)) +

Finally we compute θ by choosing the matrices H and G as per Equation (5). A direct computation gives, for any vector ω Rn, that

ωTG 1ω = ωT(L+ G + RG11T) 1ω = ωTLGω + 1

and, likewise, for any vector µ Rp

µTH 1µ = µT(L+ H + RH11T) 1µ = µTLHµ + 1 RH

For the observed labelings we have ωT k LGωk = 4cut(ωk). Using this and ρ(G) = 2RG, a direct computation gives

k=1 ωT k G 1ωk = 2RG

k=1 cut(ωk) + 1

k=1 cut(ωk) + 2K.

For the latent labeling we have PK k=1 µT k LHµk = 2cut(µ). Using this and ρ(H) = 2RH, we obtain

k=1 µT k H 1µk = 2RH

2cut(µ) + 1 RH

4RHcut(µ) + 2K.

We conclude that

k=1 zk 2 = ρ(G)

k=1 ωT k G 1ωk + ρ(H)

k=1 µT k H 1µk

k=1 cut(ωk) + RHcut(µ) + K

The result now follows by substituting the last inequality in the mistake bound.

Predicting a Switching Sequence of Graph Labelings

6. Discussion

In this section, we consider two special cases of the problem studied in this paper and make ﬁnal remarks. We tailor Theorem 1 to these cases and then compare to similar mistake bounds available in the literature.

6.1 Uniform Multitask Prediction

In the uniform multitask problem we suppose that we have p tasks corresponding to predicting the binary labeling of a graph. We assume that the tasks are interrelated so that only K p graph labelings are needed. To solve this problem we assume each task is given a number in {1, . . . , p}. Each task number denotes a unique vertex in the latent graph which is a p-vertex clique. Applying the bound of Theorem 1 gives

k=1 cut G(ωk)RG + p K log(K(n + p))

This follows immediately from the fact that the clique has resistance diameter O( 1

p) and the cut of a Kcoloring is O(p2). In (Cavallanti et al., 2010), a broad range of results are given for online multi-task learning in generic reproducing kernel Hilbert spaces. We apply their Corollary 3 to our problem with the kernel G 1 := L+ G +RG11T. In their setting there is no parameter K and instead they learn p distinct graph labelings, and thus obtain

k=1 cut G(ωk) +

i<j (ωi ωj)TG(ωi ωj) RG

This is small when each of the p binary labelings are near one another in the norm induced by the Laplacian. This is distinct from our bound where we pay a ﬁxed price for each of the p tasks of K log(K(n + p)). Thus our bound is stronger when K p and the averaged squared norm between labelings of the p tasks is larger than K log(K(n + p)).

6.2 Switching

We now consider the case where we have a switching sequence of graph labelings with S switches between K labelings. We sketch a proof of the bound announced in the introduction (cf. Equation (1)), namely

k=1 cut G(ωk) K log(n)

where the O(x) notation absorbs a polylogarithmic factor in x. Notice that the apparently natural structure for the proof would be to choose a latent graph which is a line with T vertices, where the linear ordering of the vertices reﬂects the linear trial sequence. Unfortunately, the resistance diameter of this line graph would then be equal to T which would make the bound vacuous. We overcome this diﬃculty by borrowing a trick from (Herbster

Herbster, Pasteris and Pontil

et al., 2008) and we instead use a binary tree with T leaves and thus a resistance diameter of 2 log2 T. We assume that for each trial we receive a label of a leaf along the natural linear ordering of the leaves. If φ is the cut along the leaves such a labeling may be extended to a labeling of the complete binary tree in a way that the cut increases by no more than a factor of log2 T. This extension works by choosing the label of the parent of each vertex to be consistent with the label of either of its children. The result follows since the labeling on each successive level of the tree down to the root is now a subsequence of the previous labeling, and the cut of a subsequence can only decrease. Hence with log2 T levels the cut increases by no more than a logarithmic factor. A second insight is that we do not actually need a tree with T leaves, we in fact only need M leaves corresponding to when the algorithm incurs a mistake, hence,

S(log(M))2 + RG

k=1 cut G(ωk) K log(K(n + p))

which we upper bound by

log(M)3 S + RG

k=1 cut G(ωk) K log(Kn)

Then the following technical lemma gives the result (proof in the Appendix).

Lemma 7 Given a function M : R R, a constant e > 0 such that M(x) ex log(M(x))3, then there exist constants a, b > 0 such that M(x) ax log(x)3 for all x > b.

We may also apply the technique of Herbster and Warmuth (1998) to the switching problem. Here, the underlying learning algorithm would be the perceptron with the kernel G 1 := L+ G + RG11T. As in (Cavallanti et al., 2010) the implicit assumption is that the underlying switching process is smooth and thus there is no parameter K, just a sequence of S + 1 graph labelings. The other ingredient needed for a tracking kernel perceptron is an upper bound ˆφ := maxk {1,...,S+1} ωT k Gωk. The upper bound is then used to deﬁne an additional update to the perceptron, which maintains the hypothesis vector in a ball of squared norm equal to ˆφ. This perceptron update and projection step then lead to the bound (Herbster and Warmuth, 1998, Theorem 10),

ˆφ(ωk ωk+1)TG(ωk ωk+1) + cut G(ωS+1) RG

Thus we observe with the projection kernel perceptron we pay a cost of

ˆφ(ωk ωk+1)TG(ωk ωk+1)RG

for each switch k {1, . . . , S}. Whereas when K S the dominant non poly-logarithmic term we pay per switch is O(K log n).

Predicting a Switching Sequence of Graph Labelings

6.3 Final Remarks

In this paper we presented a novel setting for online prediction over a graph. Our model is governed by K binary labelings and a latent K-labeling (deﬁned on a second graph) which determines which one of the binary labelings is active at each trial. We proposed an eﬃcient algorithm for online prediction in this setting and derived a bound on the number of mistakes made by the algorithm. An interesting feature of this bound is that it mimics the bound one would obtain having a-priori information about which binary labeling is active at each trial. A shortcoming of the bound is that it requires knowledge of the number of binary labelings K and the threshold θ. In practice these parameters are not known in advance and techniques based on the doubling trick could be employed to tune the parameters. Finally, we note that the problem considered in this paper could also be applied to the batch learning setting and our bound may be converted to a batch bound using techniques from (Cesa-Bianchi et al., 2004). In the batch setting a natural algorithm is given by empirical error minimization (Vapnik, 1998) over a hypothesis space of binary classiﬁers deﬁned on the graph. This space is obtained by a certain function composition involving the binary labelings and the latent labeling. We conjecture that the problem of performing empirical error minimization over such a hypothesis space is NP-hard. Therefore in future work our algorithm could be employed to obtain an eﬃcient sub-optimal solution to empirical error minimization in this challenging setting.

Acknowledgments

The third author acknowledges support from the EPSRC and GENES.

Appendix A. Appendix

In this appendix, we state some auxiliary results which are used in the main body of the paper. The ﬁrst result is the famous Golden-Thompson Inequality, whose proof can be found, for example, in (Bhatia, 1997).

Lemma A.8 For any symmetric matrices A and B we have that

Tr (exp (A + B)) Tr (exp (A) exp (B)) .

The next two results are taken from (Tsuda et al., 2005).

Lemma A.9 If A Sd + with eigenvalues in [0, 1] and a R, then

(1 ea)A I exp (a A) .

Lemma A.10 If A Sd + and B, C are symmetric matrices such that B C, then

Tr (AB) Tr (AC) .

Herbster, Pasteris and Pontil

Next we show that the functions (13) and (14) are monotonic increasing. We will use the following lemma.

Lemma A.11 For every x > 0 it holds that 2x 2+x < log(1 + x) < x x+1.

Proof To prove the right inequality, we let

f(x) = x x + 1 log(x + 1).

Since f(x) = 0 as x 0, the result follows if we show that f (x) > 0 for x > 0. We have that

f (x) = x 2 x + 1 + 2

2(x + 1)3/2 .

With a change of variable x z2 1, we have

x 2 x + 1 + 2

2(x + 1)3/2 = (1 z)2

which is positive for z (1, ] and hence x (0, ). The proof of the left inequality follows a similar pattern.

Lemma A.12 The following function

2(k + 3) log k + 3

is increasing for k 2.

Proof Diﬀerentiating, we have

2k2 + 5k + 3 log k+3 k+1 4k 2

We will check to see if the numerator of the above expression is positive. Using the left inequality in Lemma A.11 we have that

(2k2 + 5k + 3) log k + 3

4k 2 2(2k2 + 5k + 3)

2 + k 4k 2 = 2 2 + k > 0.

Lemma A.13 The following function

g(k) = k k + 1

2(k 1) log k + 3

is increasing for k 2.

Predicting a Switching Sequence of Graph Labelings

Proof Diﬀerentiating, we have

g (k) = 2 2k3 + 9k2 + 6k + 3 (k + 3)2 2k2 + k 1 log k+3 k+1

2(k + 1)(k + 3)2 . (18)

We will show that the numerator of the above expression is positive. The right inequality in Lemma A.11 gives that

k+3 k+1 k + 3 .

Using this, we lower bound the numerator in the r.h.s. of equation (18) by

k + 3 k + 1 2k2 + k 1 + k(k(2k + 9) + 6) + 3

With a change of variable k 3 y2

y2 1, we have

8 y6 7y3 + 12y2 + 3y4 (y 2) 3

Note k [2, ) implies y (1, q

5 3]. Since we are checking for positivity we strike the term

8 (y2 1)3 which gives

y6 + 3(y 2)y4 7y3 + 12y2 3 .

Factoring the above gives ( 1 + y)3(3 + y(3 + y)2) ,

which is positive for y (1, q

Proof of Lemma 7. Without loss of generality let e = 1 (else consider the function M , deﬁned by M (x) := M(x/e), instead of M (noting that log(ex)3 O(log(x)3))).

Note ﬁrst that we have some d such that for all y > d we have that the function y y log(y)3 is increasing.

Since exp(x) ω(x6) we have exp x 1 3 ω(x2) so 1 x exp x 1 3 ω(x). There hence

exists a c such that for all x > c we have 1

x exp x 1 3 > x.

Let b := max{c, log(d)3}. Now suppose we have some x > b. We then prove the inequality log(M(x))3 x. To show this consider the converse, that log(M(x))3 > x. Then M(x) > exp x 1 3 . Since the function y y/ log(y)3 is increasing for y > d and

exp x 1 3 > d, we then have that M(x)/ log(M(x))3 > 1

x exp x 1 3 , which is greater than

x since x > c. But this contradicts the fact that M(x) x log(M(x))3. So we have shown

Herbster, Pasteris and Pontil

that log(M(x))3 x.

If we have M(x) > 8x log(x)3 then we have 8x log(x)3 < M(x) x log(M(x))3 so we must have 2 log(x) < log(M(x)) so we have x2 < M(x). But, by above, log(M(x))3 x, and hence M(x) x log(M(x))3 x2 which is a contradiction. Hence we have that M(x) 8x log(x)3 as required.

J. Abernethy, P. Bartlett, and A. Rakhlin. Multitask learning with expert advice. In Proceedings 20th Annual Conference on Learning Theory, pages 484 498, 2007.

D. Adamskiy, M.K. Warmuth, and W.M. Koolen. Putting Bayes to sleep. In Advances in Neural Information Processing Systems 25, pages 135 143, 2012.

S. Avishek, R. Piyush, H. Daum e III, and S. Venkatasubramanian. Online learning of multiple tasks and their relationships. In Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics, pages 643 651, 2011.

R. Bhatia. Matrix Analysis. Springer, 1997.

A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th International Conference on Machine Learning, pages 19 26, 2001.

O. Bousquet and M.K. Warmuth. Tracking a small set of experts by mixing past posteriors. The Journal of Machine Learning Research, 3:363 396, 2003.

G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Linear algorithms for online multitask classiﬁcation. Journal of Machine Learning Research, 1:2901 2934, 2010.

N. Cesa-Bianchi and C. Gentile. Tracking the best hyperplane with a simple budget perceptron. In Proceedings of the 18th Conference on Learning Theory, pages 483 498, 2006.

N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, 2004.

N. Cesa-Bianchi, C. Gentile, and F. Vitale. Fast and optimal prediction on a labeled tree. In Proceedings of the 22nd Annual Conference on Learning, 2009.

N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella. Random spanning trees and the prediction of weighted graphs. In Proceedings of the 27th International Conference on Machine Learning, pages 175 182, 2010.

N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets ﬁxed share (and feels no regret). In Advances in Neural Information Processing Systems 24, pages 989 997, 2012.

Predicting a Switching Sequence of Graph Labelings

O. Dekel, P.M. Long, and Y. Singer. Online learning of multiple tasks with a shared loss. Journal of Machine Learning Research, 8(10):2233 2264, 2007.

Y. Freund. Private communication, 2000. Also posted on http://www.learning-theory.org.

C. Gentile, M. Herbster, and S. Pasteris. Online similarity prediction of networked data from known and unknown graphs. In Proceedings of the 26th Annual Conference on Learning Theory, 2013.

L.A. Goldberg and M. Jerrum. The complexity of ferromagnetic Ising with local ﬁelds. Combinatorics, Probability & Computing, 16(1):43 61, 2007.

A. Gyorﬁ, T. Linder, and G. Lugosi. Tracking the best of many experts. In Proceedings 18th Annual Conference on Learning Theory, pages 204 216, 2005.

E. Hazan and C. Seshadhri. Eﬃcient learning algorithms for changing environments. In Proceedings of the 26th International Conference on Machine Learning, pages 393 400, 2009.

M. Herbster and M. Pontil. Prediction on a graph with a perceptron. In Advances in Neural Information Processing Systems 19, pages 577 584, 2006.

M. Herbster and M.K. Warmuth. Tracking the best expert. Machine Learning, 32(2): 151 178, 1998.

M. Herbster and M.K. Warmuth. Tracking the best linear predictor. The Journal of Machine Learning Research, 1:281 309, 2001.

M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In Proceedings of the 22nd International Conference on Machine Learning, pages 305 312, 2005.

M. Herbster, G. Lever, and M. Pontil. Online prediction on large diameter graphs. In Advances in Neural Information Processing Systems 21, pages 649 656, 2008.

M. Herbster, M. Pontil, and S. Rojas-Galeano. Fast prediction on a tree. In Advances in Neural Information Processing Systems, pages 657 664, 2009.

J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52:2165 2176, 2004.

R.I. Kondor and J.D. Laﬀerty. Diﬀusion kernels on graphs and other discrete input spaces. In Proceedings of the 19th International Conference on Machine Learning, pages 315 322, 2002.

W. M. Koolen and S. Rooij. Combining expert advice eﬃciently. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 275 286, 2008.

K. Tsuda, G. R atsch, and M.K. Warmuth. Matrix exponentiated gradient updates for online learning and Bregman projection. Journal of Machine Learning Research, 6:995 1018, 2005.

Herbster, Pasteris and Pontil

V. Vapnik. Statistical Learning Theory. Wiley-Blackwell, 1998.

F. Vitale, N. Cesa-Bianchi, C. Gentile, and G. Zappella. See the tree through the lines: The shazoo algorithm. In Advances in Neural Information Processing Systems 23, pages 1584 1592, 2011.

V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35(3):247 282, 1999.

M.K. Warmuth. Winnowing subspaces. In Proceedings of the 24th International Conference on Machine Learning, pages 999 1006, 2007.

X. Zhu, Z. Ghahramani, and J. Laﬀerty. Semi-supervised learning using gaussian ﬁelds and harmonic functions. In Proceedings of the 23rd International Conference on Machine Learning, pages 912 919, 2003.