# nearest_neighbour_with_bandit_feedback__dd0ae03e.pdf

Nearest Neighbour with Bandit Feedback

Stephen Pasteris The Alan Turing Institute London UK spasteris@turing.ac.uk

Chris Hicks The Alan Turing Institute London UK c.hicks@turing.ac.uk

Vasilios Mavroudis The Alan Turing Institute London UK vmavroudis@turing.ac.uk

In this paper we adapt the nearest neighbour rule to the contextual bandit problem. Our algorithm handles the fully adversarial setting in which no assumptions at all are made about the data-generation process. When combined with a sufficiently fast data-structure for (perhaps approximate) adaptive nearest neighbour search, such as a navigating net, our algorithm is extremely efficient - having a per trial running time polylogarithmic in both the number of trials and actions, and taking only quasi-linear space. We give generic regret bounds for our algorithm and further analyse them when applied to the stochastic bandit problem in euclidean space. We note that our algorithm can also be applied to the online classification problem.

1 Introduction

In this paper we adapt the classic nearest neighbour rule to the contextual bandit problem and develop an extremely efficient algorithm. The problem proceeds in trials, where on trial t: (1) a context xt is revealed to us, (2) we must select an action at, and (3) the loss ℓt,at [0, 1] of action at on trial t is revealed to us. We assume that the contexts are points in a metric space and the distance between two contexts represents their similarity. A policy is a mapping from contexts to actions and the inductive bias of our algorithm is towards learning policies that typically map similar contexts to similar actions. Our main result has absolutely no assumptions whatsoever about the generation of the context/loss sequence and has no restriction on what policies we can compare our algorithm to.

Our algorithm requires, as a subroutine, a data-structure that performs c-nearest neighbour search. This data-structure must be adaptive - in that new contexts can be inserted into it over time. An example of such a data-structure is the Navigating net [17] which, given mild conditions on our metric and dataset, performs both search and insertion in polylogarithmic time. When utilising a data-structure of this speed our algorithm is extremely efficient - with a per-trial time complexity polylogarithmic in both the number of trials and actions, and requiring only quasi-linear space.

As an example we will apply our methodology to the special case of the contextual bandit problem in which the context sequence is drawn i.i.d. from a probability distribution over the d-dimensional hypercube. In this case, for any policy ˆy with a finite-volume decision boundary, our algorithm achieves O ˆαT d/(d+1)K1/(d+1) regret w.r.t. ˆy, where ˆα measures the magnitude of what is essentially the part of the decision boundary of ˆy that lies in the support of the probability distribution.

In the course of this paper we develop some novel algorithmic techniques, including a new algorithmic framework CANPROP and efficient algorithms for searching in trees, which may find further application.

We now describe related works. The bandit problem [20] was first introduced in [28] but was originally studied in the stochastic setting in which all losses are drawn i.i.d. at random [18], [1], [3]. However, our world is very often not i.i.d. stochastic. The work of [4] introduced the seminal EXP3 algorithm which handled the case in which the losses were selected arbitrarily. This work also

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

introduced the EXP4 algorithm for contextual bandits. In general this algorithm is exponential time but in some situations can be implemented in polynomial time - such as their EXP3.S algorithm, which was a bandit version of the classic FIXEDSHARE algorithm [14]. In [12] the EXP3.S setting was greatly generalised to the situation in which the contexts where vertices of a graph. They utilised the methodology of [8], [16] and [13] in order to develop extremely efficient algorithms. Although inspiring this work, these algorithms cannot be utilised in our situation as they inherently require the set of queried contexts to be known a-priori. In the stochastic case another class of contextual bandit problems are linear bandits [21], [5] in which the contexts are mappings from the actions into Rd. Here the queried contexts need not be known in advance but the losses must be drawn i.i.d. from a distribution that has mean linear in the respective context. The k nearest neighbour algorithm was first analysed in [6]. The work [26] utilised the k nearest neighbour methodology and the works [9] and [18] to handle a stochastic contextual bandit problem. However, their setting is extremely more restricted than ours. In particular, the context/loss pairs must be drawn i.i.d. at random and the probability distribution they are sampled from must obey certain strict conditions. In addition, on each trial the contexts seen so far must be ordered in increasing distance from the current context and operations must be performed on this sequence, making their algorithm exponentially slower than ours. Our algorithm utilises the works of [22] and [7] as subroutines. It should be noted that the later work, which was based on [23], was improved on in [11] - we leave it as an open problem as to whether we can utilise their work in our algorithm.

Algorithms for contextual bandits in metric spaces have been studied in [15, 19, 24, 25, 27, 29, 30, 31, 2, 26] but as far as we know ours is the first work to give a non-trivial bound for our problem in general (with no additional assumptions). As far as we are aware our above example bound for the fully stochastic case in euclidean space (with finite decision boundary) is also novel - the works [30, 24, 26] scale as O(T (d+1)/(d+2)) in general (and [24, 26] also require additional assumptions). As far as we are aware we are also the first work, stochastic or otherwise, to give a regret scaling as O(T 1/2) when the contexts are drawn from well-separated clusters (i.e. there is a positive distance between all pairs of clusters) in a finite-dimensional metric space, and the comparator policy is constant on each cluster (we do, however, conjecture that in the stochastic case [26] can obtain such a regret scaling here - but, as stated above, it is exponentially slower).

Let N be the set of natural numbers not including 0. Given a natural number m N we define [m] := {j N | j m}. Given a predicate p we define Jp K := 1 if p is true and Jp K := 0 otherwise. We define log( ) and ln( ) to be the logarithms with base 2 and e respectively. Given sets A and B we denote by BA the set of all functions f : A B and by 2A the set of all subsets of A. We write x [0, 1] to mean that the value x is drawn from the uniform distribution on [0, 1].

All trees in this paper are considered rooted. Given a tree J we denote its root by r(J ), its vertex set by J , its leaves by J , and its internal vertices by J . Given a vertex v in a tree J we denote its parent by J (v) and the subtree of all its descendants by J (v). Given an internal node v in a (full) binary tree J we denote its left and right children by J (v) and J (v) respectively. Internal nodes v in a (full) ternary tree J have an additional child J (v) called the centre child. Given vertices v and v in a tree J we denote by ΓJ (v, v ) the least common ancestor of v and v : i.e. the vertex of maximum depth which is an ancestor of both v and v . We will drop the subscript J in all these functions when unambiguous. Given a tree J , a subtree of J is a tree whose edge set is a subset of that of J .

3.1 The General Result

We consider the following game between Learner (us) and Nature (our adversary). We call this game the similarity bandit problem. We have K actions. Learning proceeds in T trials. A-priori Nature chooses a sequence n(t) | t [T] \ {1} where for all t [T] \ {1} we have n(t) [t 1]. A-priori Nature also chooses a sequence of loss vectors ℓt | t [T] [0, 1]K, but does not reveal them to Learner. On the t-th trial the following happens:

1. If t > 1 then Nature reveals n(t) to Learner. 2. Learner chooses some action at [K]. 3. Nature reveals ℓt,at to Learner.

Intuitively, given a trial t [T] \ {1}, n(t) is a similar trial to t. Our inductive bias is that if an action a [K] is good for trial t then it is likely that a will also be good for the similar trial n(t). We will measure our performance with respect to any policy y, where a policy is defined as a vector in [K]T . Specifically, we wish to minimise the y-regret, which is defined as the difference between the total cumulative loss suffered by Learner and that which Learner would have suffered if it had instead chosen at equal to yt for all trials t. Formally, this quantity is defined as follows: Definition 3.1. Given a policy y [K]T we define the y-regret of Learner as:

t [T ] ℓt,at X

t [T ] ℓt,yt .

The following quantity quantifies how much a policy agrees with our inductive bias. Definition 3.2. Given a policy y [K]T we define the complexity of y by:

Φ(y) := 1 + X

t [T ]\{1} Jyt = yn(t)K .

We now state our main result: Theorem 3.3. Consider the similarity bandit problem described above. Our algorithm CBNN takes a single parameter ρ > 0 and, for all policies y [K]T simultaneously, obtains an expected y-regret bounded by:

E[R(y)] O ρ + Φ(y)

where the expectation is taken over the randomisation of the algorithm. CBNN needs no initialisation time and has a per-trial time complexity of:

O(ln(T)2 ln(K)) .

3.2 Bandits in a Metric Space

We consider the following game between Learner (us) and Nature (our adversary). We call this game the metric bandit problem. We have K actions and a metric space (C, ) where C is a (possibly infinite) set of contexts and for all x, x C we have that (x, x ) is the distance from x to x . We assume that Learner does not necessarily know (C, ) a-priori but has access to an oracle for computing (x, x ) for any x, x C. Learning proceeds in T trials. A-priori Nature chooses a sequence of contexts xt | t [T] C and a sequence of loss vectors ℓt | t [T] [0, 1]K, but does not reveal them to Learner. On the t-th trial the following happens:

1. Nature reveals xt to Learner. 2. Learner chooses some action at [K]. 3. Nature reveals ℓt,at to Learner.

Here, our inductive bias is that if an action a [K] is good for a context x C then it is likely also good for contexts that are near to x with respect to the metric . Our algorithm for this problem will be based on the concept of a c-nearest neighbour which is defined as follows. Definition 3.4. Given some c 1, a finite set S C, and a context x C we have that some ˆx S is a c-nearest neighbour of x in the set S if and only if:

(x, ˆx) c min x S (x, x ) .

In order to utilise CBNN for this problem we need a data-structure for adaptive nearest neighbour search. This problem is as follows. We maintain a finite set S C. At any point in time we must either:

Insert a new context x C into the set S and update the data-structure.

Given a context x C , utilise the data-structure to find a c-nearest neighbour of x in the set S.

An efficient example of such a data-structure is the navigating net [17].

We can now reduce the metric bandit problem to the similarity bandit problem as follows. On any trial t [T] \ {1} choose ˆxt to be a c-nearest neighbour of xt in the set {xt | t [t 1]}. Then choose n(t) [t 1] such that xn(t) = ˆxt.

We will utilise the following definition in order to bound the complexity of policies when n( ) is chosen in this way.

Definition 3.5. Given a function ˆy : C [K] and a set X C then for all x C define:

γ(x, ˆy, X) := min{ (x, x ) | x X ˆy(x ) = ˆy(x)} .

We can now bound the complexity of policies as follows, noting that by Theorem 3.3 this leads directly to a regret bound.

Theorem 3.6. Assume we have a sequence xt | t [T] C and a function ˆy : C [K]. Define X := {xt | t [T]} and for all t [T] define yt := ˆy(xt). Assume that for all t [T] \ {1} we have that xn(t) is a c-nearest neighbour of xt in the set {xt | t [t 1]}. Then Φ(y) is no greater than the minimum cardinality of any set S C in which for all x X there exists x S with (x, x ) < γ(x, ˆy, X)/3c.

A strength of our algorithm is that it can be combined with binning algorithms, where the contexts xt are partitioned into sets called bins and, for each bin, all contexts in that bin are replaced by a single context called the centre of the bin. The advantage of binning is that γ(x, ˆy, X) can increase, so that by Theorem 3.6 we may have that Φ(y) decreases. However, when a context xt is binned (i.e. replaced by its bin centre) its label ˆy(xt) can change, increasing the final regret by O(1). In Section 3.3 we give an example of the utilisation of binning.

3.3 Stochastic Bandits in Euclidean Space

As an example we now consider the utilisation of the above algorithms for the problem of stochastic bandits in [0, 1]d for some arbitrary d N which we view as a constant in our bounds. Here we will only focus on what happens in the limit T . We focus on stochastic bandits for simplicity, but the same methodology can be used to study the limiting behaviour of adversarial bandits. In this problem we have an unknown probability density µ : [0, 1]d [0, 1]K R. We have T trials. On trial t the following happens:

1. Nature draws (zt, ℓt) from µ.

2. Nature reveals zt to Learner.

3. Learner chooses some action at [K].

4. Nature reveals ℓt,at to Learner.

To aid us in this problem we will first quantise the contexts zt to a grid. Note that this is an example of binning. The grid is defined as follows:

Definition 3.7. Given q N define Gd q to be the set of vectors in [0, 1]d in which each component is an integer multiple of 1/q.

On each trial t we will first quantise the vector zt by defining xt to be its nearest neighbour (w.r.t. the euclidean metric) in Gd q , where q := (T/K)1/(d+1). Note that this can be done in constant time per trial. As in Section 3.2 we then use CBNN to solve the problem by defining n(t) such that xn(t) is a c-nearest neighbour (w.r.t. the euclidean metric) of xt in the set {xt | t [t 1]}. We consider c as a constant in our bounds.

As before, we will compare our cumulative loss to that of a policy that follows a function ˆy : [0, 1]d [K]. Our regret bound will be based on the following quantities:

Definition 3.8. Let µ be the marginal of µ with respect to its first argument and let be the euclidean metric on [0, 1]d. For all ϵ > 0 define:

E(µ, ϵ) := {x [0, 1]d | x [0, 1]d : µ(x ) = 0 (x, x ) ϵ}

which is the set of contexts that are within distance ϵ of the support of µ. Given ˆy : [0, 1]d [K] we make the following definitions. For any δ > 0 define:

M(ˆy, µ, ϵ, δ) := {x [0, 1]d | x E(µ, ϵ) : (x, x ) δ ˆy(x) = ˆy(x )}

which is the set of contexts that are at distance no more than δ from the intersection of the decision boundary of ˆy and E(µ, ϵ). We then define:

α(ˆy, µ) := lim ϵ 0 lim δ 0 1 δ

x M(ˆy,µ,ϵ,δ) 1 ; α(ˆy, µ) := lim ϵ 0 lim δ 0 1 δ

x M(ˆy,µ,ϵ,δ) µ(x)

which are essentially the volumes of the part of the decision boundary of ˆy that lies in the support of µ, with respect to the uniform density and the density µ respectively.

With these definitions in hand we now present our regret bound, which utilises Theorem 3.6 in its proof:

Theorem 3.9. Let q := (T/K)1/(d+1) . For all t [T] let xt be the nearest neighbour of zt in the set Gd q . For all t [T] \ {1} let n(t) be such that xn(t) is a c-nearest neighbour of xt in the set {xt | t [t 1]}. Given some ˆy : [0, 1]d [K] let yt := ˆy(zt) for all t [T]. Then when ρ := q d 1

2 CBNN gives us:

E[R(y)] O (1 + α(ˆy, µ) + α(ˆy, µ))T d d+1 K 1 d+1

We note that varying q and ρ in Theorem 3.9 will allow us to trade off the values 1, α(ˆy, µ) and α(ˆy, µ) in different ways.

4 The Algorithm

In this section we describe our algorithm CBNN, for solving the similarity bandit problem, and give the pseudocode for the novel subroutines. In the appendix we give a more detailed description of how CBNN works and prove our theorems.

Instead of working directly with trial numbers we create a sequence of distinct nodes xt | t [T] and define, for all t [T] \ {1}, the node n(xt) := xn(t). Let X := {xt | t [T]}. We can now represent policies as functions from X into [K]. Hence, given some y : X [K] , we define the y-regret and the complexity of y as:

t [T ] ℓt,at X

t [T ] ℓt,y(xt) ; Φ(y) := 1 + X

x X\{x1} Jy(x) = y(n(x))K

respectively.

4.1 A Simple but Inefficient Algorithm

To give the reader intuition we first describe our initial idea - a simple algorithm which attains our desired regret bound but is exponentially slower - taking a per-trial time of Θ(KT). The algorithm is based on EXP4 [4] which we now describe. On every trial t we maintain a weighting ˆwt : [K]X [0, 1]. We are free to choose the initial weighting ˆw1 to be any probability distribution. On each trial t the following happens:

1. For all a [K] set pt,a P

y [K]X Jy(xt) = a K ˆwt(y) .

2. Set at a with probability proportional to pt,a . 3. Receive ℓt,at .

4. For all a [K] set ˆℓt,a Ja = at Kℓt,at pt 1/pt,at .

5. For all y [K]X set ˆwt+1(y) ˆwt(y) exp( ηˆℓt,y(xt)) .

For us we choose, for all y : X [K] , an initial weight of:

ˆw1(y) := (1/K)(T(K 1)) Φ(y) (1 1/T)(T 1 Φ(y)) .

Of course, we don t know Φ(y) a-priori, and hence we cannot implement EXP4 explicitly (and it would take exponential time even if we did know Φ(y) a-priori). Our crucial insight is the following. For any t [T] let Xt := {xt | t [t]} and for any y : X [K] let yt be the restriction of y onto Xt. Then for any t [T] and for any y : Xt [K] we have:

y [K]X : yt=y ˆw1(y) Y

Jy (n(x)) = y (x)K

T(K 1) + Jy (n(x)) = y (x)K 1 1

Note that, for all t [T] and a [K] , we can write pt,a as follows:

y [K]X Jy(xt) = a K ˆw1(y) Y

t [t 1] exp( ηˆℓt,y(xt))

y [K]Xt Jy (xt) = a K

t [t 1] exp( ηˆℓt,y (xt))

y [K]X : yt=y ˆw1(y) .

By substituting in Equation (1) we have now brought pt,a into a form that can be solved via Belief propagation [23] over the tree with vertex set Xt and in which, for all t [t] \ {1} , the parent of xt is n(xt ).

It is well known that for any y : X [K] , EXP4 attains a y-regret of at most ln( ˆw1(y))/η +ηKT/2. By setting η := ρ/

KT and noting our choice of ˆw1 we obtain our desired regret bound.

4.2 Cancellation Propagation

In the remainder of this section we describe our algorithm CBNN, which is based on the same idea as the simple algorithm of Section 4.1.

In this subsection we describe a novel algorithmic framework CANPROP for designing contextual bandit algorithms with a running time logarithmic in K. It is inspired by EXP3 [4], specialist algorithms [8] and online decision-tree pruning algorithms [10] but is certainly not a simple combination of these works. CBNN will be an efficient implementation of an instance of CANPROP. Although in general CANPROP requires a-priori knowledge, CBNN is designed in a way that, crucially, does not need it to be known.

We assume, without loss of generality, that K and T are integer powers of two. CANPROP, which takes a parameter η > 0, works on a full, balanced binary tree B with leaves B = [K]. On every trial t each pair (v, S) B 2X has a weight wt(v, S) [0, 1]. These weights induce a function θt : B [0, 1] defined by: θt(v) := X

S 2X Jxt SKwt(v, S) .

On each trial t a root-to-leaf path {vt,j | j [log(K)] {0}} is sampled such that vt,0 := r(B) and , given vt,j , we have that vt,(j+1) is sampled from { (vt,j), (vt,j)} with probability proportional to the value of θt when applied to each of these vertices. The action at is then chosen equal to vt,log(K). Once the loss has been observed we climb back up the root-to-leaf path, updating the function wt to wt+1.

CANPROP (at trial t) is given in Algorithm 1. We note that if wt+1(v, S) is not set in the pseudocode then it is defined to be equal to wt(v, S).

In Appendix B we give a general regret bound for CANPROP. For CBNN we set:

ln(K) ln(T)/KT

Algorithm 1 CANPROP at trial t

1: vt,0 r(B) 2: for j = 0, 1, , (log(K) 1) do 3: for v { (vt,j), (vt,j)} do 4: θt(v) P S 2X Jxt SKwt(v, S) 5: end for 6: zt,j θt( (vt,j)) + θt( (vt,j)) 7: for v { (vt,j), (vt,j)} do 8: πt(v) θt(v)/zt,j 9: end for 10: ζt,j [0, 1] 11: if ζt,j πt( (vt,j)) then 12: vt,j+1 (vt,j) 13: else 14: vt,j+1 (vt,j) 15: end if 16: end for

17: at vt,log(K) 18: πt Q

j [log(K)] πt(vt,j) 19: ψt,log(K) exp( ηℓt,at/ πt) 20: for j = log(K), (log(K) 1), , 1 do 21: ψt,(j 1) 1 (1 ψt,j)πt(vt,j) 22: ψ t,j ψt,j/ψt,j 1 23: if vt,j = (vt,j 1) then 24: vt,j (vt,j 1) 25: else 26: vt,j (vt,j 1) 27: end if 28: for S 2X : xt S do 29: wt+1(vt,j, S) wt(vt,j, S)ψ t,j 30: wt+1( vt,j, S) wt( vt,j, S)/ψt,j 1 31: end for 32: end for

and for all (v, S) B 2X we set:

w1(v, S) := 1

T + (1 σ(x, S)) 1 1

where: σ(x, S) := JJx SK = Jn(x) SKK. This choice gives us the regret bound in Theorem 3.3. We note that CBNN will be implemented in such a way that n need not be known a-priori.

4.3 Ternary Search Trees

As we shall see, CBNN works by storing a binary tree A(v) at each vertex v B. In order to perform efficient operations on these trees we will utilise the rebalancing data-structure defined in [22] which here we shall call a ternary search tree (TST) due to the fact that it is a generalisation of the classic binary search tree and, as we shall show, has searching applications. However, as for binary search trees, the applications of TSTs are more than just searching: we shall also utilise them for online belief propagation.

We now define what is meant by a TST. Suppose we have a full binary tree J . A TST of J is a full ternary tree D which satisfies the following. The vertex set of D is partitioned into two sets D and D where each vertex s D is associated with a vertex µ(s) J and every s D is also associated with a vertex µ (s) (µ(s)) . In addition, each internal vertex s D is associated with a vertex ξ(s) J . For all u J there exists an unique leaf ΥD(u) D in which µ(ΥD(u)) = u.

Essentially, each vertex s D corresponds to a subtree ˆ J (s) of J where ˆ J (r(D)) = J . Such a vertex s is a leaf of D if and only if | ˆ J (s)| = 1. For each internal vertex s D the subtree ˆ J (s) is split at the vertex ξ(s) into the subtrees ˆ J ( (s)), ˆ J ( (s)), and ˆ J ( (s)) corresponding to the children of s. The process continues recursively.

For completeness we now describe the rules that a TST D of J must satisfy. We have that r(D) D

and µ(r(D)) := r(J ). Each vertex s D represents a subtree ˆ J (s) of J . If s D then ˆ J (s) := (µ(s)) and otherwise ˆ J (s) is the set of all descendants of µ(s) which are not proper descendants of µ (s). Given that s D this subtree is split at the vertex ξ(s) where if s D we have that ξ(s) lies on the path from µ(s) to µ (s). The children of s are then defined so that ˆ J ( (s)) = ˆ J (s) ( (ξ(s))) and ˆ J ( (s)) = ˆ J (s) ( (ξ(s))) and ˆ J ( (s)) = ˆ J (s)\( ˆ J ( (s)) ˆ J ( (s))). i.e. ξ(s) partitions the subtree ˆ J (s) into the subtrees ˆ J ( (s)), ˆ J ( (s)), and ˆ J ( (s)). This process continues recursively until | ˆ J (s)| = 1 in which case s is a leaf of D.

For all binary trees J in our algorithm we shall maintain a TST H(J ) of J with height O(ln(|J |)). Such trees J are dynamic in that on any trial it is possible that two vertices, u and u , are added to the tree J such that u is inserted between a non-root vertex of J and its parent, and u is designated as a child of u . We define the subroutine REBALANCE(H(J ), u) as one which rebalances the TST H(J ) after this insertion, so that the height of H(J ) always remains in O(ln(|J |)). The work of [22] describes how this subroutine can be implemented in a time of O(ln(|J |)) and we refer the reader to this work for details (noting that they use different notation).

4.4 Contractions

Define the quantity ϕ0 := 0 and for all j N {0} inductively define:

ϕj+1 := 1 1

At any trial t the contexts in {xs | s [t]} naturally form a tree by designating n(xs) as the parent of xs. However, to utilise the TST data-structure we must only have binary trees. Hence, we will work with a (dynamic) full binary tree Z which, on trial t, is a binarisation of the above tree. The relationship between these two trees is given by a map γ : Zt {xs | s [t]} where Zt is the tree Z on trial t. For all x {xs | s [t]} we will always have an unique leaf γ(x) Z t in which γ( γ(x)) = x. We also maintain a balanced TST H(Z) of Z.

Algorithm 2 gives the subroutine GROWt which updates Z at the start of trial t. Note that GROWt also defines a function d : Z N such that d(u) is the number of times the function n must be applied to γ(u) to reach x1.

Algorithm 2 GROWt which works on Z

1: u γ(n(xt)) 2: u (u) 3: u NEWVERTEX 4: u NEWVERTEX 5: γ(u ) n(xt) 6: γ(u ) xt 7: γ(xt) u

8: if u = (u ) then 9: (u ) u

10: else 11: (u ) u

12: end if 13: (u ) u

14: (u ) u 15: d(u ) d(u) 16: d(u ) d(u) + 1 17: REBALANCE(H(Z), u )

A contraction (of Z) is defined as a full binary tree J in which the following holds. (1) The vertices of J are a subset of those of Z. (2) r(J ) = r(Z). (3) Given a vertex u J we have J (u) Z( Z(u)) and J (u) Z( Z(u)). (4) Any leaf of J is a leaf of Z.

CBNN will maintain, on every vertex v B, a contraction A(v) as well as a TST H(A(v)) of A(v). Given J is one of these contractions, we also maintain, for all i, i {0, 1} and all u J , a value τi,i (J , u) R+. Technically these quantities, which depend on the above function d, define a bayesian network on J which is explained in Appendix C.3. For all i {0, 1} and all u J we also maintain a value κi(J , u) initialised equal to 1.

On each of our contractions J we will define, on trial t, a subroutine INSERTt(J ) that simply modifies J so that γ(xt) is added to its leaves. This subroutine is only called on certain trials t. Specifically, it is called on the contraction A(v) only when v is involved in CANPROP on trial t. Although the effect of this subroutine is simple to describe, its polylogarithmic-time implementation is quite complex. A function that is used many times during this subroutine is ν : Z Z { , , } in which ν(u, u ) is equal to , or if u is contained in Z( Z(u)), in Z( Z(u)) or in neither, respectively. Algorithm 3 shows how to compute this function. Now that we have a subroutine for computing ν we can turn to the pseudocode for the subroutine INSERTt(J ) in Algorithm 4. In the appendix we give a full description of how and why this subroutine works.

Algorithm 3 Computing ν(u, u ) for u, u Z

1: E H(Z) 2: if u = u then 3: return 4: end if 5: s ΥE(u) 6: s ΥE(u ) 7: s ΓE( s, s ) 8: for s { (s ), (s ), (s )} do 9: if s (s) then 10: ˆs s 11: end if 12: if s (s) then 13: ˆs s 14: end if 15: end for 16: if ˆs = (s ) then 17: return 18: end if 19: if ξ(s ) = u ˆs = (s ) then

20: return 21: else if ξ(s ) = u ˆs = (s ) then 22: return 23: end if 24: s ˆs 25: while TRUE do 26: if s E then 27: return 28: else if u = ξ(s) (s) E then 29: return 30: else if u = ξ(s) (s) E then 31: return 32: end if 33: for s { (s), (s), (s)} do 34: if s (s ) then 35: s s

36: end if 37: end for 38: end while

Algorithm 4 The operation INSERTt(J ) on a contraction J of Z at trial t

1: E H(Z) 2: D H(J ) 3: s r(D) 4: ut γ(xt) 5: while s D do 6: if ν(ξ(s), ut) = then 7: s (s) 8: else if ν(ξ(s), ut) = then 9: s (s) 10: else if ν(ξ(s), ut) = then 11: s (s) 12: end if 13: end while 14: ˆu µ(s) 15: s r(E) 16: while s E do 17: if ν(ξ(s), ut) = ν(ξ(s), ˆu) then 18: if ν(ξ(s), ut) = then 19: s (s) 20: else if ν(ξ(s), ut) = then 21: s (s) 22: else if ν(ξ(s), ut) = then 23: s (s) 24: end if 25: else 26: s (s) 27: end if 28: end while

29: u µ(s) 30: u J (ˆu) 31: if ˆu = J (u ) then 32: J (u ) u

33: else 34: J (u ) u

35: end if 36: if ν(u , ˆu) = then 37: J (u ) ˆu 38: J (u ) ut 39: else 40: J (u ) ˆu 41: J (u ) ut 42: end if 43: for i {0, 1} do 44: κi(J , u ) 1 45: κi(J , ut) 1 46: end for 47: for u {u , ˆu, ut} do 48: δ(u) d(u) d( J (u)) 49: end for 50: for (i, i ) {0, 1} {0, 1} do 51: if i = i then 52: τi,i (J , u) 1 ϕδ(u) 53: else 54: τi,i (J , u) ϕδ(u) 55: end if 56: end for 57: REBALANCE(H(J ), ut)

4.5 Online Belief Propagation

In this subsection we utilise the work of [7] in order to be able to efficiently compute the function θt that appears in CANPROP.

Given a vertex u in one of our contractions J we define F(J , u) := {f {0, 1}J | f(u) = 1} and then define:

Λ(J , u) := X

u J \{r(J )} τf( J (u )),f(u )(J , u )κf(u )(J , u ) .

As stated in the previous subsection, when a vertex v B becomes involved in CANPROP on trial t, CBNN will add γ(xt) to the leaves of A(v) via the operation INSERTt(A(v)). In the appendix we shall show that for each such v we then have:

θt(v) = Λ(A(v), γ(xt))/4 .

We now outline how to compute this efficiently, deferring a full description for Appendix D.3. First note that for all contractions J and all u J we have that Λ(J , u) is of the exact form to be solved by the classic Belief propagation algorithm [23]. The work of [7] shows how to compute this term in logarithmic time by maintaining a data-structure based on a balanced TST of J - in our case the TST H(J ). Whenever, for some i {0, 1} and u J , the value κi(J , u ) changes, the data-structure is updated in logarithmic time. We define the subroutine EVIDENCE(J , u ) as that which updates this data-structure after κi(J , u ) changes. We also make sure that the data-structure is updated whenever REBALANCE(H(J ), ) is called. We then define the subroutine MARGINAL(J , u) as that which computes Λ(J , u)/4. Hence, the output of MARGINAL(A(v), γ(xt)) is equal to θt(v).

Now that we have defined all our subroutines we give, in Algorithm 5, the algorithm CBNN which is an efficient implementation of CANPROP with initial weighting given in Equation (2).

Algorithm 5 CBNN at trial t

1: GROWt 2: ut γ(xt) 3: vt,0 r(B) 4: for j = 0, 1, , (log(K) 1) do 5: for v { (vt,j), (vt,j)} do 6: INSERTt(A(v)) 7: θt(v) MARGINAL(A(v), ut) 8: end for 9: zt,j θt( (vt,j)) + θt( (vt,j)) 10: for v { (vt,j), (vt,j)} do 11: πt(v) θt(v)/zt,j 12: end for 13: ζt,j [0, 1] 14: if ζt,j πt( (vt,j)) then 15: vt,j+1 (vt,j) 16: else 17: vt,j+1 (vt,j)

18: end if 19: end for 20: at vt,log(K) 21: πt Q

j [log(K)] πt(vt,j) 22: ψt,log(K) exp( ηℓt,at/ πt) 23: for j = log(K), (log(K) 1), , 1 do 24: ψt,(j 1) 1 (1 ψt,j)πt(vt,j) 25: if vt,j = (vt,j 1) then 26: vt,j (vt,j 1) 27: else 28: vt,j (vt,j 1) 29: end if 30: κ1(A(vt,j), ut) ψt,j/ψt,j 1 31: κ1(A( vt,j), ut) 1/ψt,j 1 32: EVIDENCE(A(vt,j), ut) 33: EVIDENCE(A( vt,j), ut) 34: end for

5 Acknowledgments

We would like to thank Mark Herbster (University College London) for valuable discussions.

Research funded by the Defence Science and Technology Laboratory (Dstl) which is an executive agency of the UK Ministry of Defence providing world class expertise and delivering cutting-edge science and technology for the benefit of the nation and allies. The research supports the Autonomous Resilient Cyber Defence (ARCD) project within the Dstl Cyber Defence Enhancement programme.

[1] R. Agrawal. Sample mean based index policies by o(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27:1054 1078, 1995.

[2] S. Arya and Y. Yang. Randomized allocation with nonparametric estimation for contextual multi-armed bandits with delayed rewards. Statistics & Probability Letters, 2020.

[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32:48 77, 2002.

[5] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, 2011.

[6] T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory, 13:21 27, 1967.

[7] A. L. Delcher, A. J. Grove, S. Kasif, and J. Pearl. Logarithmic-time updates and queries in probabilistic networks. J. Artif. Intell. Res., 4:37 59, 1995.

[8] Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that specialize. In Symposium on the Theory of Computing, 1997.

[9] A. Garivier and O. Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Annual Conference Computational Learning Theory, 2011.

[10] D. P. Helmbold and R. E. Schapire. Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27:51 68, 1995.

[11] M. Herbster, S. Pasteris, and F. Vitale. Online sum-product computation over trees. In NIPS, 2012.

[12] M. Herbster, S. Pasteris, F. Vitale, and M. Pontil. A gang of adversarial bandits. In Neural Information Processing Systems, 2021.

[13] M. Herbster and J. Robinson. Online prediction of switching graph labelings with cluster specialists. In Neural Information Processing Systems, 2018.

[14] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32:151 178, 1995.

[15] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Efficient bandit algorithms for online multiclass prediction. In International Conference on Machine Learning, 2008.

[16] W. M. Koolen, D. Adamskiy, and M. K. Warmuth. Putting bayes to sleep. In NIPS, 2012.

[17] R. Krauthgamer and J. R. Lee. Navigating nets: simple algorithms for proximity search. In ACM-SIAM Symposium on Discrete Algorithms, 2004.

[18] T. L. Lai and H. E. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4 22, 1985.

[19] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS 2007, 2007.

[20] T. Lattimore and C. Szepesvári. Bandit algorithms. 2020.

[21] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In The Web Conference, 2010.

[22] K. Matsuzaki and A. Morihata. Balanced ternary-tree representation of binary trees and balancing algorithms. In Mathematical Engineering Technical Reports, 2008.

[23] J. Pearl. Reverend bayes on inference engines: A distributed hierarchical approach. Probabilistic and Causal Inference, 1982.

[24] V. Perchet and P. Rigollet. The multi-armed bandit problem with covariates. The Annals of Statistics, 2013.

[25] W. Qian and Y. Yang. Kernel estimation and model combination in a bandit problem with covariates. J. Mach. Learn. Res., 17:149:1 149:37, 2016.

[26] H. W. J. Reeve, J. C. Mellor, and G. Brown. The k-nearest neighbour ucb algorithm for multi-armed bandits with covariates. Algorithmic Learning Theory, 2018.

[27] P. Rigollet and A. J. Zeevi. Nonparametric bandits with covariates. In Annual Conference Computational Learning Theory, 2010.

[28] H. E. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58:527 535, 1952.

[29] Y. Seldin, P. Auer, F. Laviolette, J. Shawe-Taylor, and R. Ortner. Pac-bayesian analysis of contextual bandits. In NIPS, 2011.

[30] A. Slivkins. Contextual bandits with similarity information. Journal of Machine Learning Research, 2014.

[31] C.-C. Wang, S. R. Kulkarni, and H. V. Poor. Arbitrary side observations in bandit problems. Adv. Appl. Math., 34:903 938, 2005.

A Introduction to the Appendix

We now turn to the full description and analysis of our algorithm CBNN. In Appendix B we describe our novel algorithmic framework CANPROP. In Appendix C we describe contractions and bayesian networks on them, showing how CANPROP can be implemented with them. Finally, in Appendix D we describe TSTs and how they are used to perform our required operations efficiently. In Appendix E we prove, in order, all of the theorems stated in this paper.

Recall that we created a sequence of distinct nodes xt | t [T] and defined, for all t [T]\{1}, the node n(xt) := xn(t). Let X := {xt | t [T]}. Given some y : X [K] , we defined the y-regret and the complexity of y as:

t [T ] ℓt,at X

t [T ] ℓt,y(xt) ; Φ(y) := 1 + X

x X\{x1} Jy(x) = y(n(x))K

respectively.

B Cancellation Propagation

B.1 The General CANPROP Algorithm

We now introduce a general algorithmic framework CANPROP for handling contextual bandit problems with a per-trial time logarithmic in K. Without loss of generality assume that K is an integer power of two. Let B be a full, balanced binary tree whose leaves are the set of actions [K]. Let B := B \ {r(B)}. CANPROP takes a parameter η R+ called the learning rate. On each trial t CANPROP maintains a function: wt : B 2X [0, 1] . The function w1 is free to be defined how one likes, as long as it satisfies the constraint that for all internal vertices v B we have: X

S 2X (w1( (v), S) + w1( (v), S)) = 1 .

We now describe how CANPROP acts on trial t. For all v B we define:

S 2X Jxt SKwt(v, S)

and for all v B we define:

πt( (v)) := θt( (v)) θt( (v)) + θt( (v)) ; πt( (v)) := θt( (v)) θt( (v)) + θt( (v)) .

As we shall see CANPROP needs only compute these values for O(ln(K)) vertices v. CANPROP samples a root-to-leaf path {vt,j | j [log(K)] {0}} as follows. vt,0 is defined equal to r(B). For all j [log(K) 1] {0}, once vt,j has been sampled we sample vt,(j+1) from the probability distribution defined by:

P[vt,(j+1) = v] := J (v) = vt,j Kπt(v) v B

noting that vt,(j+1) is a child of vt,j. We define:

Pt := {vt,j | j [log(K)] {0}} .

CANPROP then selects: at := vt,log(K) and then receives the loss ℓt,at. The function wt is then updated to wt+1 as follows. Firstly we define,

wt+1(v, S) := wt(v, S) (v, S) {v B | (v ) / Pt} 2X .

We then define:

ψt,log(K) := exp

j [log(K)] πt(vt,j)

Once we have defined ψt,j for some j [log(K)] we then define:

ψt,(j 1) := 1 (1 ψt,j)πt(vt,j)

βt(v) := Jv Pt Kψt,j + Jv / Pt K

ψt,(j 1) v { (vt,(j 1)), (vt,(j 1))}

wt+1(v, S) := (Jxt SKβt(v) + Jxt / SK)wt(v, S) (v, S) { (vt,(j 1)), (vt,(j 1))} 2X .

The regret bound of CANPROP is given by the following theorem. Theorem B.1. Suppose we have a function y : X [K]. For all v B define:

Q(y, v) := {x X | y(x) (v)} .

Then the expected y-regret of CANPROP is bounded by:

E[R(y)] ηKT

v B JQ(y, v) = K ln(w1(v, Q(y, v))) .

B.2 Our Parameter Tuning

We now describe and analyse the initial weighting w1 and the learning rate η that we will use. Define X := X \ {x1}. For all (x, S) X 2X define:

σ(x, S) := JJx SK = Jn(x) SKK .

For all (v, S) B 2X we define:

w1(v, S) := 1

T + (1 σ(x, S)) 1 1

Given our parameter ρ we choose our learning rate as:

ln(K) ln(T)

Given this initial weighting and learning rate, Theorem B.1 implies the following regret bound.

Theorem B.2. Given w1 and η are defined as above, then for any y : X [K] the expected y-regret of CANPROP is bounded by:

E[R(y)] O ρ + Φ(y)

ln(K) ln(T)KT .

C Implementation with Contractions

C.1 A Sequence of Binary Trees

For any trial t we have a natural tree-structure on the set {xt | t [t]} formed by making n(xt ) the parent of xt for all t [t]\{1}. However, in order to utilise the methodology of [22] we need to work with binary trees. Hence, we now inductively define a sequence of binary trees Zt | t [T] \ {1} where the vertices of Zt are a subset of those of Zt+1. We also define a function γ : ZT X. This function γ has the property that for any t [T] and for any distinct leaves u, u Z t we have that γ(u) = γ(u ) , and that: {γ(u ) | u Z t } = {xt | t [t]} . We define Z2 to contain three vertices {r(Z2), (r(Z2)), (r(Z2))} where:

γ(r(Z2)) := γ( (r(Z2)) := x1 ; γ( (r(Z2)) := x2 .

Now consider a trial t [T]. We have that Zt+1 is constructed from Zt via the following algorithm GROWt+1:

1. Let u be the unique leaf in Z t in which γ(u) = n(xt+1) and let u := (u).

2. Create two new vertices u and u .

3. Set γ(u ) n(xt+1) and γ(u ) xt+1.

4. If u = (u ) then set (u ) u . Else set (u ) u .

5. Set (u ) u and (u ) u.

We also define a function d : ZT N {0} as follows. Define d (x1) := 0 and for all t [T] \ {1} inductively define d (xt) := d (n(xt)) + 1. Finally define d(u) := d (γ(u)) for all u ZT . Since for all t [T] we have that the vertices of Zt are a subset of those of ZT we have that d also defines a function over Zt for all t [T].

For all t [T] we define ut to be the unique leaf of Zt for which γ(ut) = xt.

C.2 Contractions

Our efficient implementation of CANPROP will have a data-structure at every vertex v B . However, to achieve polylogarithmic time per trial we can only update a polylogarithmic number of these data-structures per trial. This necessitates the use of contractions of our trees {Zt | t [T] \ {1}} which are defined as follows. A contraction of a full binary tree Q is another full binary tree J which satisfies the following:

The vertices of J are a subset of those of Q.

r(J ) = r(Q).

Given an internal vertex u J we have J (u) Q( Q(u)) and J (u) Q( Q(u)).

Any leaf of J is a leaf of Q.

Note that any contraction of Zt is also a contraction of Zt+1 and hence, by induction, a contraction of Zt for all t t. Given a trial t and a contraction J of Zt 1 we now define the operation INSERTt(J ) which acts on J by the following algorithm:

1. Let ˆu be the unique vertex in J \ {r(J )} such that ut is in the maximal subtree of Zt with ˆu and J (ˆu) as leaves.

2. Let u := ΓZt(ut, ˆu) .

3. Add the vertices u and ut to the tree J .

4. Let u := J (ˆu).

5. If ˆu = J (u ) then set J (u ) u . Else set J (u ) u .

6. If ˆu Zt( Zt(u )) then set J (u ) ˆu and J (u ) ut. Else set J (u ) ˆu and J (u ) ut.

Later in this paper we will show how this operation can be done in polylogarithmic time. Note that after the operation we have that J is a contraction of Zt and ut has been added to it s leaves. From now on when we use the term contraction we mean any contraction of ZT .

C.3 Contraction-Based Bayesian Networks

Here we shall define a bayesian network over any contraction J and show how it can be utilised to compute certain quantities required by CANPROP. First define the quantity ϕ0 := 0 and for all j N {0} inductively define:

ϕj+1 := 1 1

The algorithm must compute these quantities for all j [T]. However, for all t [T] we have that ϕt doesn t have to be computed until trial t so computing these quantities is constant time per trial. Given a contraction J , a vertex u J \ r(J ) and indices i, i {0, 1} define:

τi,i (J , u) := Ji = i Kϕ(d(u) d( J (u))) + Ji = i K 1 ϕ(d(u) d( J (u)))

which defines the transition matrix from J (u) to u in a bayesian network over J . We shall now show how belief propagation over such bayesian networks can be used to compute the quantities we need in CANPROP. Suppose we have a contraction J and a function λ : J R+. This function λ induces a function λ : X R+ defined as follows. Given x X, if there exists a leaf u J with γ(u) = x then λ (x) = λ(u). Otherwise λ (x) = 1. For all S 2X define:

w(J , λ, S) :=

T + (1 σ(x, S)) 1 1

For the CANPROP algorithm we will need to compute X

S 2X Jγ(ˆu) SK w(J , λ, S) (3)

for some leaf ˆu J and some function λ : J R+. We shall now show how we can compute this quantity via belief propagation on the bayesian network. In particular, we shall construct a quantity Λ(J , λ, u) equal to the quantity in Equation (3). To do this, first define the function λ : J R+ so that for all u J we have λ (u) = λ(u) and for all u J we have λ (u) = 1. For all vertices u J and all indices i {0, 1} define:

κi(J , λ, u) := Ji = 0K + Ji = 1Kλ (u) .

For all ˆu J define: F(J , ˆu) := {f {0, 1}J | f(ˆu) = 1}

and then define:

Λ(J , λ, ˆu) := X

u J \r(J ) τf( J (u)),f(u)(J , u) κf(u)(J , λ, u) .

The equality of this quantity and that given in Equation (3) is given by the following theorem.

Theorem C.1. Given a contraction J , a function λ : J R+ and some leaf ˆu J we have:

Λ(J , λ, ˆu) = X

S 2X Jγ(ˆu) SK w(J , λ, S) .

Note that Λ(J , λ, ˆu) is of the exact form to be solved via belief propagation over J . However, belief propagation is too slow (taking Θ(|J |) time) - we will remedy this later.

C.4 Cancelation Propagation with Contractions

We now describe how to implement CANPROP with contractions. For each v B we maintain a contraction A(v) and a function λv : A(v) R+. We initialise with A(v) identical to Z2 and λv(u) = 1 for both leaves u Z 2. Via induction over t we will have that at the start of each trial t we have, for all sets S 2X , that:

wt(v, S) = w(A(v), λv, S)/4 . (4)

On trial t we do as follows. First we update Zt 1 to Zt using the algorithm GROWt. We will perform the necessary modifications to our contractions as we sample the path Pt. In particular, we first set vt,0 := r(B) and then for each j [log(K) 1] {0} in turn we do as follows. For each v { (vt,j), (vt,j)} run INSERTt(A(v)) and set λv(ut) 1. Since λv(ut) = 1 Equation (4) still holds and hence, by Theorem C.1, we have:

θt(v) = Λ(A(v), λv, ut)/4 .

After θt(v) has been computed for both v { (vt,j), (vt,j)} we can now sample vt,j+1.

Once we have selected the action at we then update the functions {λv | B(v) Pt} by setting λv(ut) βt(v) for all v B with B(v) Pt. It is clear now that Equation (4) holds inductively.

C.5 Notational Relationship to the Main Body

We now point out how the notation in this section relates to that of the main body. In particular, we have, for all v B , all u A(v) and all i, i {0, 1}, that:

κi(A(v), u) = κi(A(v), λv, u).

Λ(A(v), u) = Λ(A(v), λv, u).

We note that the function λv does not explicity appear in our pseudocode since it can be inferred from κi(A(v), ).

D Utilising Ternary Search Trees

There are now only two things left to do in order to achieve polylogarithmic time per trial - to make an efficient online implementation of the INSERTt( ) operation and an efficient online algorithm to perform belief propagation over our contractions. In order to do this we will utilise the methodology of [22] which we now describe. However, we do not give the full details of the rebalancing technique and refer the reader to [22] for these details (noting that [22] uses different notation).

D.1 Ternary Search Trees

In this section we will consider a full binary tree J . A (full) ternary tree D is a rooted tree in which each internal vertex s D has three children denoted by (s), (s), (s) and called the left, centre, and right children respectively. We now define what it means for a ternary tree D to be a ternary search tree (TST) of J . Firstly, the vertex set of D is partitioned into two sets D and D . Every vertex s D is associated with a vertex µ(s) J and every s D is also associated with a vertex µ (s) J (µ(s)) . The root r(D) is contained in D and µ(r(D)) := r(J ). Each internal vertex s D is associated with a vertex ξ(s) J . If s D then ξ(s) (µ(s)) and if s D then ξ(s) lies on the path (in J ) from µ(s) to (µ (s)). For all s D we have:

(s) D , µ( (s)) := µ(s) and µ ( (s)) := ξ(s). (s) satisfies:

If s D then (s) D and µ( (s)) := (ξ(s)). If s D and µ (s) ( (ξ(s))) then (s) D and µ( (s)) := (ξ(s)). Else (s) D , µ( (s)) := (ξ(s)) and µ ( (s)) := µ (s). (s) satisfies:

If s D then (s) D and µ( (s)) := (ξ(s)). If s D and µ (s) ( (ξ(s))) then (s) D and µ( (s)) := (ξ(s)). Else (s) D , µ( (s)) := (ξ(s)) and µ ( (s)) := µ (s).

Finally, for each leaf s D we have:

If s D then µ(s) is a leaf of J .

If s D then there exists u J such that µ(s) = µ (s) = u.

Intuitively, each vertex s D is associated with a subtree ˆ J (s) of J . If s D then ˆ J (s) := (µ(s)) and if s D then ˆ J (s) is the subtree of all descendants of µ(s) which are not proper descendants of µ (s). For every s D such that ˆ J (s) contains only a single vertex, we have that s is a leaf of D. Otherwise s is an internal vertex of D and its children are as follows. We say that ˆ J (s) is split at the vertex ξ(s) ˆ J (s) . If s D we require that ξ(s) is on the path in J from µ(s) to µ (s). The action of splitting ˆ J (s) at ξ(s) partitions ˆ J (s) into the subtrees ˆ J ( (s)), ˆ J ( (s)) and ˆ J ( (s)) defined as follows:

ˆ J ( (s)) := ( (ξ(s))) ˆ J (s).

ˆ J ( (s)) := ( (ξ(s))) ˆ J (s).

ˆ J ( (s)) := ˆ J (s) \ ( ˆ J ( (s)) ˆ J ( (s))).

Utilising the methodology of [22] we will maintain TSTs of Zt (at each trial t) and the trees in {A(v) | v B }, each with height O(ln(T)). Note that these trees are dynamic, in that vertices are inserted into them over time. [22] shows how, after such an insertion, the corresponding TST can be rebalanced, in time O(ln(T)), so that its height is still in O(ln(T)). This rebalancing is performed via a sequence of O(ln(T)) tree rotations, which generalise the concept of tree rotations in binary search trees.

D.2 Searching

In this section we show how we can use our TSTs to implement the operation INSERTt(J ) on any trial t and contraction J of Zt 1. To do this we need to perform the following two search operations:

1. Find the unique vertex ˆu J \ {r(J )} such that ut is in the maximal subtree of Zt with ˆu and J (ˆu) as leaves. 2. Find u := ΓZt(ut, ˆu).

To perform these tasks in polylogarithmic time we will utilise TSTs E and D of Zt and J respectively. Both the searching tasks utilise a function ν : Z2 t { , , } defined, for all u, u Zt as follows. If u Zt( (u)) or u Zt( (u)) then ν(u, u ) := or ν(u, u ) := respectively. Otherwise ν(u, u ) := . This can be computed as follows. If u = u then ν(u, u ) := . Otherwise let s and s be the unique leaves of E such that µ( s) = u and µ( s ) = u . Let s := ΓE( s, s ) and let ˆs and ˆs be the children of s which are ancestors of s and s respectively. If ˆs = (s ) then we have ν(u, u ) = . If ξ(s ) = u then we have ν(u, u ) = or ν(u, u ) = if ˆs = (s ) or ˆs = (s ) respectively. If ˆs = (s ) and ξ(s ) = u then we perform the following process. Start with s equal to ˆs. At any point in the process we do as follows. If s E then the process terminates with ν(u, u ) := . If s E and u = ξ(s) then the process terminates with ν(u, u ) = or ν(u, u ) = if (s) E or (s) E respectively. If s E and u = ξ(s) then we reset s as equal to the child of s which is an ancestor of s and continue the process.

The vertex ˆu can be found as follows. We construct a root-to-leaf path in D such that, given a vertex s in the path, the next vertex in the path is (s), (s) or (s) if ν(ξ(s), ut) is equal to , or respectively. Given that s is the leaf of D that is in this path we have ˆu = µ(s ).

The vertex u can then be found as follows. We construct a root-to-leaf path in E such that, given a vertex s in the path, the next vertex in the path is found as follows. If ν(ξ(s), ut) = ν(ξ(s), ˆu) then given ν(ξ(s), ut) is equal to , or , the next vertex is equal to (s), (s) or (s) respectively. Otherwise, the next vertex is (s). Given that s is the leaf of E that is in this path we have u = µ(s ).

The fact that these algorithms find the correct vertices is given in the following theorem: Theorem D.1. The above algorithms are correct.

D.3 Belief Propagation

Here we utilise the methodology of [7] in order to efficiently compute the function Λ that appears in the CANPROP implementation. i.e. given a contraction J , a function λ : J R+ and some leaf ˆu J we need to compute Λ(J , λ, ˆu). For brevity let us define, for all i, i {0, 1}, and all vertices u J \ {r(J )} , the quantities: ˆτi,i (u) := τi,i (J , u) ; ˆκi(u) := κi(J , λ, u) . For simplicity of presentation we will utilise a tree J which is defined as identical to J except with a single vertex added as the parent of r(J ). For all i, i {0, 1} we define ˆκi(r(J )) := 1 and ˆτi,i (r(J )) = Ji = i K. For all u J we will define (u) := J (u)

We will utilise a TST D of J by maintaining potentials on the vertices of D defined as follows. First, as in Section D.1, for any vertex s D we define the subtree ˆ J (s) of J to be equal to J (µ(s)) if s D and equal to the subtree in J of all descendants of µ(s) which are not proper descendants of µ (s) if s D . For all s D and i {0, 1} we define:

f {0,1} ˆ J (s) { (µ(s))} Jf( (µ(s))) = i K Y

u ˆ J (s) ˆτf( (u)),f(u)(u)ˆκf(u)(u)

and for all s D and i, i {0, 1} we define:

Ωi,i (s) := X

f {0,1} ˆ J (s) { (µ(s))} Jf( (µ(s))) = i KJf(µ (s)) = i K Y

u ˆ J (s) ˆτf( (u)),f(u)(u)ˆκf(u)(u) .

We have the following recurrence relations for these potentials. Suppose we have an internal vertex s D and i, i {0, 1}. If s D we have:

i {0,1} Ωi,i ( (s))Ψi ( (s))Ψi ( (s)) .

If, instead, s D then, by letting s := (s), s := (s) if (s) D and s := (s), s := (s) otherwise, we have:

Ωi,i (s) = X

i {0,1} Ωi,i ( (s))Ωi ,i (s )Ψi (s ) .

If, on a trial t, we perform the operation INSERTt(J ) or change the value of λ(ut) these recurrence relations can be used (in conjunction with the tree rotations) to update the potentials in logarithmic time.

Now that we have defined our potentials we will show how to use them to compute Λ(J , λ, ˆu) in logarithmic time. To do this we recursively define the following quantities for i {0, 1}. Let ωi(r(D)) := 1. Given an internal vertex s D we define:

ωi( (s)) := ωi(s) ; ω i( (s)) := Ψi( (s))Ψi( (s))

ωi( (s)) := Ψi( (s)) X

i {0,1} ωi (s)Ωi ,i( (s)) ; ωi( (s)) := Ψi( (s)) X

i {0,1} ωi (s)Ωi ,i( (s)) .

Given an internal vertex s D define s := (s), s := (s) if (s) D and s := (s), s := (s) otherwise. Then:

ωi( (s)) := ωi(s) ; ω i( (s)) := Ψi(s ) X

i {0,1} Ωi,i (s )ω i (s)

ωi(s ) := X

i {0,1} ωi (s)Ωi ,i(s)Ψi(s ) ; ω i(s ) := ω i(s)

ωi(s ) := X

i ,i {0,1} ωi (s)Ωi ,i( (s))ω i (s)Ωi,i (s ) .

For s D , ω i(s) is not required and hence is arbitrary. We inductively compute the values {ωi(s), ω i(s) | i {0, 1}} for all s in the path from r(D) to the unique leaf ˆs D in which µ(ˆs) = ˆu. We then have Λ(J , λ, ˆu) = ω1(ˆs).

Since this is known methodology we do not include a proof in this paper and direct the reader to [7].

E.1 Theorem 3.3

This theorem is proved in appendices B to D and the theorems therein.

E.2 Theorem 3.6

For all x C define ˆγ(x) := γ(x, ˆy, X).

Choose a set S C in which for all t [T] there exists x S with (x, xt) < ˆγ(x)/3c. For all trials t let St be the set of all contexts x S in which there exists s [t] with (x, xs) < ˆγ(x)/3c.

Now consider a trial t [T] \ {1} in which ˆy(xt) = ˆy(xn(t)) and choose x S with (x, xt) < ˆγ(x)/3c.

Assume, for contradiction, that x St 1. Then there exists s [t 1] with (x, xs) < ˆγ(x)/3c so that by the triangle inequality we have:

(xt, xs) (x, xs) + (x, xt) < 2ˆγ(x)/3c

which implies that (xt, xn(t)) < 2ˆγ(x)/3. By the triangle inequality we then have that:

(x, xn(t)) (xt, xn(t)) + (x, xt) < 2ˆγ(x)/3 + ˆγ(x)/3c 3ˆγ(x)/3 = ˆγ(x)

Since (x, xt) < ˆγ(x) we have y(x) = y(xt) and hence that y(x) = y(xn(t)). But this contradicts the fact that (x, xn(t)) < ˆγ(x).

We have hence shown that x / St 1. Since x St we then have that |St| |St 1|. Since |S1| 1 this implies that: Φ(y) = 1 + X

t [T ]\{1} Jˆy(xt) = ˆy(xn(t))K |ST | |S|

as required.

E.3 Theorem 3.9

Let ϵ be such that:

lim δ 0 1 δ

x M(ˆy,µ,2ϵ,δ) 1 2α(ˆy, µ) ; lim δ 0 1 δ

x M(ˆy,µ,2ϵ,δ) µ(x) 2 α(ˆy, µ) (5)

Since we are only interested in the behaviour as q assume, without loss of generality, that

d/q < ϵ/3. Choose λ > 0 and λ > 0 sufficiently small for this proof to work. For all x Gd q let H(x) be the set of points x [0, 1]d such that x is the nearest neighbour (or one of them if the nearest neighbour is not unique) of x in Gd q . Let L be the set of all x Gd q for which there exists x , x H(x) E(µ, ϵ) with ˆy(x ) = ˆy(x ). Note that for all x L we have H(x ) M(ˆy, µ, 2ϵ,

x M(ˆy,µ,2ϵ,

x H(x ) 1 = q

d |L|q d = 1 qd 1

By considering the limit of this inequality as q , and noting that d is being treated as a constant, we then have, by Equation (5), that:

|L| O(α(ˆy, µ)qd 1) (6)

Now define:

x H(x ) µ(x)

Since H(x ) M(ˆy, µ, 2ϵ,

d/q) for all x L , we have:

x M(ˆy,µ,2ϵ,

By considering the limit of this inequality as q , and noting that d is being treated as a constant, we then have, by Equation (5), that:

p O( α(ˆy, µ)/q) (7)

Let D be the set of all x Gd q such that there exists x H(x) with µ(x) = 0. Now define ˆy : [0, 1]d [K] such that for all x [0, 1]d we have:

If x D \ L then ˆy (x) is the unique a [K] such that there exists x H(x) with µ(x ) = 0 and a = ˆy(x ). Note that if there existed more than one such a then we would have x L which is a contradiction.

If x / D \ L then ˆy (x) = ˆy (ˆx) where ˆx is the nearest neighbour of x in D \ L

Let A be a finite set of points in Rd such that for all x Rd with (x, 0) 1 there exists x A with (x, x ) < λ. Define: A := [

i [ log(qd) ] {2ix/q | x A}

Note that: |A | O(ln(q)) (8)

We now show that for all x Rd with 1/q (x, 0) d there exists x A with:

(x, x ) < 2 (x, 0)λ (9)

To show this choose any such x and let i N be such that 2i 1/q (x, 0) < 2i/q. By the assumption on x we have that i [ log(qd) ]. Since (xq2 i, 0) < 1 choose x A such that (xq2 i, x ) < λ. Then we have:

(x, 2ix /q) = (2i/q) (xq2 i, x ) < (2i/q)λ 2 (x, 0)λ

so, since 2ix /q A , we have proved that Equation (9) is true.

We now let S be a finite set of points in Rd such that for all x [0, 1]d there exists some x S with (x, x ) < λ ϵ. Now define:

x L {x + x | x A }

Let X := {xt | t [T]}. Take any x X. We now show that there exists x S with (x, x ) < γ(x , ˆy , X)/3c. We have three cases:

First consider the case that (x, x ) > ϵ/3. Let x be the nearest neighbour of x in [0, 1]d

with ˆy (x) = ˆy (x ). Choose x S such that (x, x ) < λ ϵ. Since (x, x ) < ϵ/3 we must have ˆy (x ) = ˆy (x). Let x be the nearest neighbour of x in X with ˆy (x ) = ˆy (x ). Since ˆy (x ) = ˆy (x) we must have that ˆy (x ) = ˆy (x) so that (x, x ) > ϵ/3. By the triangle inequality we then have:

ϵ/3 < (x, x ) (x, x ) + (x , x ) < λ ϵ + (x , x ) = λ ϵ + γ(x , ˆy , X)

Hence γ(x , ˆy , X) > (1/3 λ )ϵ so that:

(x, x ) < λ ϵ < λ

1/3 λ γ(x , ˆy , X) < γ(x , ˆy , X)/3c

as required.

Now consider the case that x L. In this case we trivially have the result with x := x.

Finally consider the case that x / L and (x, x ) ϵ/3. Let x be the nearest neighbour of x in [0, 1]d with ˆy (x) = ˆy (x ). Let ˆx and ˆx be the nearest neighbours of x and x in D \ L respectively. Note that by definition of ˆy we have ˆy (ˆx) = ˆy (x) and ˆy (ˆx ) = ˆy (x ). By definition of D we must have x D, so since x / L we have ˆx = x. Noting that (x , ˆx ) (x , ˆx) we must then have that (x , ˆx ) (x , x) which means, by the triangle inequality, that (x, ˆx ) 2 (x, x ). Since x, ˆx D choose z H(x) and z H(ˆx ) such that µ(z), µ(z ) = 0. Since x, ˆx D \ L we have:

ˆy(z) = ˆy (z) = ˆy (x) = ˆy (ˆx ) = ˆy (z ) = ˆy(z )

By the triangle inequality and above we have:

(z, z ) (z, x) + (x, ˆx ) + (ˆx , z ) 2 (x, x ) +

so since µ(z), µ(z ) = 0 we must have that the straight line from z to z is entirely contained in E(µ, ϵ/2). Since ˆy(z) = ˆy(z ) we can then choose some z in the line from z to z

that is on the decision boundary of ˆy (i.e. any open set around z contains some z with ˆy( z) = ˆy(z )). Since there exists an open set around z that is entirely contained in E(µ, ϵ)

we must now have, by definition of L, that there exists ˆz L such that z H(ˆz). By the triangle inequality and Equation (10) we have:

(x, ˆz) (x, z) + (z, z ) + (z , ˆz) (z, z ) +

d/q (z, z ) +

2 (x, x ) + 2

Since x D\L we have ˆy ( z) = ˆy (x) for all z H(x) and hence we must have x / H(x) so that (x, x )

d/(2q). By Equation (11) this means that:

(x, x ) (x, ˆz)/6 (12)

By Equation (9) choose z A such that:

(x ˆz, z ) < 2 (x ˆz, 0)λ (13)

and define x = ˆz + z . Since ˆz L we have x S as required. Note that by equations (12) and (13) we have:

(x, x ) = (x ˆz, z ) < 2 (x, ˆz)λ 12 (x, x )λ (14)

Since 2λ < 1/6 we now have, from equations (12) and (14), that (x, x ) < (x, x ) so that ˆy (x) = ˆy (x ). Let x be the nearest neighbour of x in X with ˆy (x ) = ˆy (x ). Since ˆy (x ) = ˆy (x) we must have that ˆy (x ) = ˆy (x) so that (x, x ) (x, x ). By the triangle inequality and Equation (14) we then have:

(x, x ) (x, x ) (x, x ) + (x , x ) < 12 (x, x )λ + (x , x )

= 12 (x, x )λ + γ(x , ˆy , X)

Hence γ(x , ˆy , X) > (1 12λ) (x, x ) so that by Equation (14) we have:

(x, x ) < 12 (x, x )λ < 12λ 1 12λγ(x , ˆy , X) < γ(x , ˆy , X)/3c

as required.

Let y [K]T be such that y t := ˆy (xt) for all t [T]. We have shown that for all x X there exists x S with (x, x ) < γ(x , ˆy , X)/3c. By equations (6) and (8) we have that:

|S| O(|L| ln(q)) O(α(ˆy, µ)qd 1 ln(q)) O(α(ˆy, µ)qd 1)

Invoking Theorem 3.6 then gives us:

Φ(y ) O(α(ˆy, µ)qd 1)

so by Theorem 3.3 we have:

E[R(y )] O ρ + Φ(y )

KT O (1 + α(ˆy, µ))q d 1

We also have:

R(y) R(y ) = X

t [T ] (ℓt,yt ℓt,y t) = X

t [T ] (ℓt,ˆy(zt) ℓt,ˆy (xt)) X

t [T ] Jˆy(zt) = ˆy (xt)K

But ˆy(zt) = ˆy (xt) implies that xt / D \ L so since xt D we must have xt L which happens with probability p and hence, by Equation (7), we have:

E[R(y) R(y )] p T O(T α(ˆy, µ)/q) (16)

Combining equations (15) and (16) gives us:

E[R(y)] O((1 + α(ˆy, µ))q d 1

KT + T α(ˆy, µ)/q)

Since q := (T/K)1/(d+1) we have now shown that:

E[R(y)] O (1 + α(ˆy, µ) + α(ˆy, µ))T d d+1 K 1 d+1

as required.

E.4 Theorem B.1

For every trial t [T] define:

v B JQ(y, v) = K ln(wt(v, Q(y, v)))

Choose some arbitrary trial t [T]. From here until we say otherwise all probabilities and expectations (i.e. whenever we use P[ ] or E[ ]) are implicitly conditional on the state of the algorithm at the start of trial t. Note first that we have:

v B JQ(y, v) = K ln wt+1(v, Q(y, v))

wt(v, Q(y, v))

For all j [log(K)] {0} let γt,j be the ancestor (in B) of y(xt) at depth j. Note that for all v X \ {γt,j | j [log(K)] {0}} we have y(xt) / (v) so that xt / Q(y, v) and hence, directly from the CANPROP algorithm, we have wt+1(v, Q(y, v)) = wt(v, Q(y, v)). By Equation (17) and the fact that Q(y, v) = for all ancestors v of y(xt) this implies that:

j [log(K)] ln wt+1(γt,j, Q(y, γt,j))

wt(γt,j, Q(y, γt,j))

For all j [log(K)] define:

λt,j := ln wt+1(γt,j, Q(y, γt,j))

wt(γt,j, Q(y, γt,j))

and: ϵt,j := E[ln(ψt,j) | γt,j Pt] Now choose some arbitrary j [log(K)]. If γt,(j 1) Pt then γt,(j 1) = vt,(j 1) so (γt,j) = vt,(j 1) and hence, since xt Q(y, γt,j), we have λt,j = ln(βt(γt,j)). By definition of βt(γt,j) this means that:

E[λt,j | γt,j Pt , γt,(j 1) Pt] = ϵt,j E[ln(ψt,(j 1)) | γt,j Pt , γt,(j 1) Pt]

E[λt,j | γt,j / Pt , γt,(j 1) Pt] = E[ln(ψt,(j 1)) | γt,j / Pt , γt,(j 1) Pt]

Multiplying these two equations by P[γt,j Pt | γt,(j 1) Pt] and P[γt,j / Pt | γt,(j 1) Pt] respectively, and summing them together, then gives us:

E[λt,j | γt,(j 1) Pt] = P[γt,j Pt | γt,(j 1) Pt]ϵt,j E[ln(ψt,(j 1) | γt,(j 1) Pt]

Since P[γt,j Pt | γt,(j 1) Pt] = πt(γt,j) we then have:

E[λt,j | γt,(j 1) Pt] = πt(γt,j)ϵt,j ϵt,(j 1) (19)

If, on the other hand, γt,(j 1) / Pt then (γt,j) / Pt so λt,j = 0. This means that:

E[λt,j] = P[γt,(j 1) Pt]E[λt,j | γt,(j 1) Pt] (20)

Since the probability that γt,(j 1) Pt is equal to Q j [j 1] πt(γt,j ) we then have, by combining equations (19) and (20), that:

E[λt,j] = ϵt,j Y

j [j] πt(γt,j ) ϵt,(j 1) Y

j [j 1] πt(γt,j )

By substituting into Equation (18) (after taking expectations) we then have that:

E[ t t+1] = ϵt,0 + ϵt,log(K) Y

j [log(K)] πt(γt,j)

= E[ln(ψt,0)] + E[ln(ψt,log(K)) | at = γt,log(K)] Y

j [log(K)] πt(γt,j) (21)

Note that if at = γt,log(K) then γt,j = vt,j for all j [log(K)]. By definition of ψt,log(K) and the fact that γt,log(K) = y(xt), Equation (21) then gives us:

E[ t t+1] = E[ln(ψt,0)] ηℓt,y(xt) (22)

For all (v, a) B [K] define:

pt,a(v) = P[at = a | v Pt]

noting that this is non-zero only when a (v). Suppose we have some v B \ {r(B)} and some a (v) [K]. Then, since P[at = a | v / Pt] = 0, we have:

pt,a( (v)) = P[at = a | (v) Pt] = P[at = a | v Pt]P[v Pt | (v) Pt] = πt(v)pt,a(v)

Since pt,a(v) = 0 whenever a / (v), this implies that for all (v, a) B [K] we have:

pt,a(v) = πt( (v))pt,a( (v)) + πt( (v))pt,a( (v)) (23)

For all a [K] define:

ˆℓt,a = Jat = a Kℓt,a

P[at = a] We now take the inductive hypothesis that for all j [log(K)] {0} we have:

a [K] pt,a(vt,j) exp( ηˆℓt,a)

and prove this via reverse induction (i.e. from j = log(K) to j = 0). Note that given a := at we have P[at = a ] = Q

j [log(K)] πt(vt,j) and hence:

ψt,log(K) = exp( ηˆℓt,at)

so the inductive hypothesis holds for j = log(K). Now suppose that we have some j [log(K)] and that the inductive hypothesis holds for j = j . We shall now show that it holds also for j = j 1. Let v be the child of vt,(j 1) that is not equal to vt,j . Note that at / (v ) and hence exp( ηˆℓt,a) = 1 for all a (v ) (i.e. whenever pt,a(v ) = 0) which implies: X

a [K] pt,a(v ) exp( ηˆℓt,a) = 1 (24)

For all a [K], Equation (23) gives us:

pt,a(vt,(j 1)) exp( ηˆℓt,a) = πt(v )pt,a(v ) exp( ηˆℓt,a) + πt(vt,j )pt,a(vt,j ) exp( ηˆℓt,a)

Substituting Equation (24) and the inductive hypothesis into this equation (when summed over all a [K]) then gives us: X

a [K] pt,a(vt,(j 1)) exp( ηˆℓt,a) = πt(v ) + πt(vt,j )ψt,j

Since πt(v ) + πt(vt,j ) = 1 we have, direct from the algorithm, that πt(v ) + πt(vt,j )ψt,j = ψt,(j 1) so the inductive hypothesis holds for j = j 1. We have hence shown that the inductive hypothesis holds for all j [log(K)] {0} and in particular for j = 0. Since pt,a(vt,0) = P[at = a] we then have: ψt,0 = X

a [K] P[at = a] exp( ηˆℓt,a) (25)

Since exp( z) 1 z + z2/2 for all z R+ we have, from Equation (25), that:

a [K] P[at = a]

1 ηˆℓt,a + η2ˆℓt,a

a [K] P[at = a]ˆℓt,a + η2

a [K] P[at = a]ˆℓ2 t,a

so since ln(1 + z) z for all z R we have:

ln(ψt,0) η X

a [K] P[at = a]ˆℓt,a + η2

a [K] P[at = a]ˆℓ2 t,a (26)

Noting that P[at = a]ˆℓt,a = Jat = a Kℓt,a for all a [K], we have:

a [K] P[at = a]ˆℓt,a

a [K] P[at = a]ˆℓ2 t,a

Jat = a Kℓ2 t,a P[at = a]

a [K] ℓ2 t,a K

Substituting these equations into Equation (26) (after taking expectations) gives us:

E[ln(ψt,0)] ηE[ℓt,at] + η2K/2

which, upon substitution into Equation (22) gives us:

E[ t t+1] η(E[ℓt,at] ℓt,y(xt)) η2K/2 (27)

Note that this equation implies that the same equation also holds when the expectation is not implicitly conditional on the state of the algorithm at the start of trial t. Hence, we now drop the assumption that the expectation is conditional on the state of the algorithm at the start of trial t. Summing Equation (27) over all trials t [T] and then rearranging gives us:

η (E[ 1] E[ T +1]) + ηKT

Now consider a trial t. For all v B let:

S 2X Jxt SKwt+1( (v), S) + X

S 2X Jxt SKwt+1( (v), S)

Now take any j [log(K) 1] {0} and let v := vt,j. Note that:

Vt(v) = βt( (v))θt( (v)) + βt( (v))θt( (v))

so that by definition of πt( (v)) and πt( (v)) we have:

Vt(v) = (θt( (v)) + θt( (v)))(πt( (v))βt( (v)) + πt( (v))βt( (v)))

Without loss of generality assume that (v) Pt. Then the above equation implies that:

Vt(v) = (θt( (v)) + θt( (v)))πt( (v))ψt,j+1 + πt( (v))

so by definition of ψt,j we have:

Vt(v) = (θt( (v)) + θt( (v))) = X

S 2X Jxt SKwt( (v), S) + X

S 2X Jxt SKwt( (v), S)

Note that this equation trivially holds for all v B \ Pt and hence holds for all v B . Since for all such v and all S with xt / S we have wt+1( (v), S) = wt( (v), S) and wt+1( (v), S) = wt( (v), S) we then have: X

S 2X wt+1( (v), S) + X

S 2X wt+1( (v), S) = X

S 2X wt( (v), S) + X

S 2X wt( (v), S)

so, by induction on t we have, for all t [T + 1], that: X

S 2X wt( (v), S) + X

S 2X wt( (v), S) = 1

Hence, for all v B \ r(B) and S 2X , we have wt(v, S) [0, 1]. We have now shown that T +1 0 so that Equation 28 gives us:

η E[ 1] + ηKT

which, by definition of 1, gives us the desired result.

E.5 Theorem B.2

The fact that the weighting w1 is valid is given by the following lemma:

Lemma E.1. For all v B we have: X

S 2X (w1( (v), S) + wt( (v), S)) = 1

Proof. We will show that for all v B we have:

S 2X w1(v, S) = 1

which directly implies the result. So take some arbitrary v B . Define, for all t [T], the sets:

X t := {xs | s [t]} \ {x1} and Ft := {0, 1}X t {x1}

and for all x X t and f Ft define the quantity:

β(x, f) := Jf(x) = f(n(x))K1/T + Jf(x) = f(n(x))K(1 1/T)

which is defined since n(x) X t {x1}. For all t [T 1] we have:

x X t+1 β(x, f) = X

x X t β(x, f)

f(xt+1) {0,1} β(xt+1, f) (29)

Given any f Ft we have: X

f(xt+1) {0,1} β(xt+1, f) = (1 1/T) + 1/T = 1

and hence by Equation (29) we have: X

x X t+1 β(x, f) = X

x X t β(x, f)

Since X T = X this implies, by induction, that: X

x X β(x, f) = X

x X 1 β(x, f) = X

x β(x, f) = X

f F1 1 = |F1| = 2 (30)

Note that we have a bijection G : FT 2X defined by:

G(f) := {x X | f(x) = 1} f FT

and that for all (f, x) FT X we have:

β(x, f) = σ(x, G(f))/T + (1 σ(x, G(f)))(1 1/T)

Hence, Equation (30) shows us that:

T + (1 σ(x, S)) 1 1

This implies that: X

S 2X w1(v, S) = 1

which implies the result.

Now that we have shown that the weighting w1 is valid we can utilise Theorem B.1 to prove our regret bound. For any set S 2X define:

x X σ(x, S)

Note that for all v B and S 2X we have:

w1(v, S) = 1

so since T ln(1 1/T) O(1) we have:

ln(w1(v, S)) ln(4) + ϕ(S) ln(T) T ln(1 1/T) O(ϕ(S) ln(T) + 1) (31)

As in the statement of Theorem B.1 define, for all v B, the set:

Q(y, v) := {x X | y(x) (v)}

First note that the graph (with vertex set X) formed by linking x to n(x) for every x X is a tree so that Φ(y) |{y(x) | x X}| 1. So since for all v B we have Q(y, v) = if and only if v has a descendent in {y(x) | x X} and each element of {y(x) | x X} has log(K) ancestors in B we have: X

v B JQ(y, v) = K log(K)|{y(x) | x X}| log(K)(Φ(y) + 1) (32)

Now suppose we have some x X . If y(x) = y(n(x)) then for all v B we have x, n(x) Q(y, v) or x, n(x) / Q(y, v) and hence σ(x, Q(y, v)) = 0. On the other hand, if y(x) = y(n(x)) then for any v B with σ(x, Q(y, v)) = 1 we must have that either x Q(y, v) or n(x) Q(y, v) so v is an ancestor of either x or n(x) and hence there can be at most 2 log(K) such v. So in any case we have: X

v B σ(x, Q(y, v)) Jy(x) = y(n(x))K2 log(K)

Hence we have: X

v B ϕ(Q(y, v)) = X

v B σ(x, Q(y, v)) 2 log(K)Φ(y) (33)

Equation (31) gives us:

v B JQ(y, v) = K ln(w1(v, Q(y, v))) O

v B ϕ(Q(y, v)) + X

v B JQ(y, v) = K

Substituting in equations (32) and (33) then gives us:

v B JQ(y, v) = K ln(w1(v, Q(y, v))) O(ln(K) ln(T)Φ(y))

so by Theorem B.1 we have:

E[R(y)] O ηKT

2 + ln(K) ln(T)Φ(y)

Since η = ρ p

ln(K) ln(T)/KT we obtain the result.

E.6 Theorem C.1

Recall our sequence of trees Zt | t [T] \ {1} noting that each of these trees is a contraction so that τi,i (Zt, ) is defined for all i, i {0, 1}. Let ϵ := 1/T. Define λ : X R+ as follows. Given x X, if there exists a leaf u J with γ(u) = x then λ (x) = λ(u). Otherwise λ (x) = 1. Given t [T] define ˆλt : Zt R+ such that for all u Zt we have that ˆλt(u) := λ (γ(u)) if u is a leaf of Zt and ˆλt(u) := 1 otherwise. For all t [T] and f : {xt | t [t]} {0, 1} define:

N(f) := {f {0, 1}Zt | u Z t , f (u) = f(γ(u))}

t [t]:f(xt )=1 λ (xt)

t [t]\{1} (Jf(xt) = f(n(xt))Kϵ + Jf(xt) = f(n(xt))K(1 ϵ))

and: ˆν(f) := X

u Zt\{r(Zt)} τf ( Zt(u)),f (u)(Zt, u) κf (u)(Zt, ˆλt, u)

We now have the following lemma: Lemma E.2. For all t [T] and f : {xt | t [t]} {0, 1} we have:

ˆw(f) = ˆν(f)

Proof. We prove by induction on t. Suppose the result holds for t = s (for some s 2). We now show that it holds for t = s + 1 as well. Let f be the restriction of f onto the set {xt | t [s]}. Let u and u be the unique leaves in Z s+1 of which γ(u ) = n(xs+1) and γ(u ) = xs+1. By the construction of Zs+1 these vertices are siblings. Let u be the parent (in Zs+1) of both u and u . First note that:

Jf(xs+1) = 0K + Jf(xs+1) = 1Kλ (xs+1) = κf(xs+1)(Zs+1, ˆλs+1, u ) (34)

Since, by the construction of Zs+1, we have γ( Zs+1(u )) = γ(u ) = n(xs+1) we also have that d( Zs+1(u )) = d(u ) 1 so that, since ϕ1 = ϵ, we have:

Jf(xs+1) = f(n(xs+1))Kϵ + Jf(xs+1) = f(n(xs+1))K(1 ϵ) = τf(n(xs+1)),f(xs+1)(Zs+1, u ) (35) Equations (34) and (35) give us:

ˆw(f) = ˆw(f )τf(n(xs+1)),f(xs+1)(Zs+1, u ) κf(xs+1)(J , ˆλs+1, u ) (36)

Now suppose we have some f N(f). We have γ(u ) = γ(u ) and hence d( Zs+1(u )) = d(u ) = d(u ) so since f (u ) = f(n(xs+1)) and ϕ0 = 0 we have:

τf ( Zs+1(u )),f (u )(Zs+1, u ) = τf (u ),f (u )(Zs+1, u ) = Jf (u ) = f(n(xs+1))K (37)

Since, by the construction of Zs+1, we have Zs+1(u ) = Zs(u ) and (as above) we have d(u ) = d(u ), we also have:

τf ( Zs+1(u )),f (u )(Zs+1, u ) = τf ( Zs(u )),f (u )(Zs, u ) (38)

Since f (u ) = f(xs+1) and Zs+1(u ) = u we have:

τf ( Zs+1(u )),f (u )(Zs+1, u ) = τf (u ),f(xs+1)(Zs+1, u ) (39)

ζ := τf(n(xs+1)),f(xs+1)(Zs+1, u ) ; ζ := τf ( Zs(u )),f(n(xs+1))(Zs, u )

Define: g(f ) := Y

u Zs\{r(Zs)} τf ( Zs(u)),f (u)(Zs, u)

and: g (f ) := Y

u Zs+1\{r(Zs+1)} τf ( Zs+1(u)),f (u)(Zs+1, u)

Combining equations (37), (38) and (39) gives us: Y

u {u ,u ,u } τf ( Zs+1(u)),f (u)(Zs+1, u) = Jf (u ) = f(n(xs+1))Kζ ζ (40)

For all u Zs+1 \ {u , u , u } we have Zs+1(u) = Zs(u) so that:

τf ( Zs+1(u)),f (u)(Zs+1, u) = τf ( Zs(u)),f (u)(Zs, u)

and hence, since f(n(xs+1)) = f (u ), we have:

g (f ) = g(f )

u {u ,u ,u } τf ( Zs+1(u)),f (u)(Zs+1, u)

Substituting in Equation (40) gives us:

g (f ) = g(f )Jf (u ) = f(n(xs+1))Kζ (41)

We have κf (u )(Zs+1, ˆλs+1, u ) = 1 and for all u Zs we have κf (u)(Zs+1, ˆλs+1, u) = κf (u)(Zs, ˆλs, u). Substituting into Equation (41) gives us:

u Zs+1 κf (u)(Zs+1, ˆλs+1, u)

= Jf (u ) = f(n(xs+1))K κf (u )(Zs+1, ˆλs+1, u )ζ g(f ) Y

u Zs κf (u)(Zs, ˆλs, u)

Summing over all f N(f) and noting that f (u ) = f(xs+1) and that:

κf (r(Zs+1))(Zs+1, ˆλs+1, r(Zs+1)) = 1 = κf (r(Zs))(Zs, ˆλs, r(Zs))

gives us: ˆν(f) = κf(xs+1)(Zs+1, ˆλs+1, u )ζ ˆν(f ) By the inductive hypothesis we then have:

ˆν(f) = κf(xs+1)(Zs+1, ˆλs+1, u )ζ ˆw(f )

which, by Equation (36), is equal to ˆw(f). We have hence shown that if the inductive hypothesis holds for t = s then it holds for t = s + 1 also. An identical argument shows that the inductive hypothesis holds for t = 2. We have hence shown that the inductive hypothesis holds for all t [T] \ {1}.

We now define a bijection G : {0, 1}X 2X by:

G(f) := {x X | f(x) = 1} f {0, 1}X

Note that for all f : X {0, 1} and all x X \ {x1} we have:

σ(x, G(f))ϵ + (1 σ(x, G(f)))(1 ϵ) = Jf(x) = f(n(x))Kϵ + Jf(x) = f(n(x))K(1 ϵ)

x G(f) λ (x) = Y

t [T ]:f(xt )=1 λ (xt)

so that: w(J , λ, G(f)) = ˆw(f) and hence, by Lemma E.2, we have:

w(J , λ, G(f)) = ˆν(f)

S 2X Jγ(ˆu) SK w(λ, ϵ, S) = X

f {0,1}X Jf(γ(ˆu)) = 1Kˆν(f) (42)

Since: [ {N(f) | f {0, 1}X , f(γ(ˆu)) = 1} = {f {0, 1}ZT | f (ˆu) = 1}

and all sets in this union are disjoint, the right hand side of Equation (42) is equal to: X

f {0,1}ZT Jf (ˆu) = 1K Y

u ZT \{r(ZT )} τf ( ZT (u)),f (u)(ZT , u) κf (u)(ZT , ˆλT , u) (43)

Given a vertex u ZT \ {r(ZT )} define:

H(u) := ZT (u) { ZT (u)}

and for all f : H(u) {0, 1} define:

ˆζ(u, f) := Y

u ZT (u) τf( ZT (u )),f(u )(ZT , u )

Lemma E.3. Given a vertex u ZT \ {r(ZT )} and an index i {0, 1} we have: X

f {0,1}H(u ) Jf( ZT (u )) = i Kˆζ(u , f) = 1

Proof. We prove by induction on the height of ZT (u ). If this height is equal to zero then H(u ) = {u , ZT (u )} and for all f : H(u) {0, 1} we have:

ˆζ(u , f) = τf( ZT (u )),f(u )(ZT , u )

Since: τi,0(ZT , u ) + τi,1(ZT , u ) = 1 (44) we immediately have the result for the case that the height of ZT (u ) is zero. Now suppose that the result holds whenever the height of ZT (u ) is equal to j (for some j N). We will now show that it holds whenever the height of ZT (u ) is equal to j + 1 which will prove that the result holds always. By the inductive hypothesis we have, for all i {0, 1} , that: X

f {0,1}H( (u )) Jf(u ) = i Kˆζ( (u ), f) = 1

f {0,1}H( (u )) Jf(u ) = i Kˆζ( (u ), f) = 1

f {0,1}H(u ) Jf( ZT (u )) = i KJf(u ) = i Kˆζ( (u ), f)ˆζ( (u ), f) = 1

and hence: X

f {0,1}H(u ) Jf( ZT (u )) = i KJf(u ) = i Kˆζ(u , f) = τi,i (ZT , u)

Summing over i {0, 1} and noting Equation (44) then shows us the result holds for this case and hence, by induction, holds always.

Given u , u ZT with u ZT (u ) we define ˆH(u , u ) to be the maximal subtree of ZT which has u and u as leaves. Given, in addition, f : ˆH(u , u ) {0, 1} we define:

ζ(u , u , f) := Y

u ˆ H(u ,u )\{u } τf( ZT (u)),f(u)(ZT , u)

and: δ(u , u ) := d(u ) d(u ) We now have the following lemma. Lemma E.4. Given u , u ZT with u Zt(u ) \ {u } and indices i , i {0, 1} we have that: X

f {0,1} ˆ H(u ,u ) Jf(u ) = i KJf(u ) = i K ζ(u , u , f)

is equal to Ji = i Kϕδ(u ,u ) + Ji = i K(1 ϕδ(u ,u ))

Proof. We prove by induction on the distance from u to u in ZT . If this distance is one then we have u = ZT (u ) and ˆH(u , u ) = {u , u } so we have: X

f {0,1} ˆ H(u ,u ) Jf(u ) = i KJf(u ) = i K ζ(u , u , f) = τi ,i (ZT , u )

which immediately implies that the inductive hypothesis holds in this case. Now suppose that the inductive hypothesis holds whenever the distance from u to u is j. We now consider the case

that the distance from u to u is j + 1. Let u be the child of u that lies in ˆH(u , u ). Without loss of generality assume that u is a descendant of (u ). Now choose any i {0, 1}. Given f : ˆH(u , u ) {0, 1} let:

h(i , f) = Jf(u ) = i KJf(u ) = i KJf(u ) = i K

and let f and f be the restriction of f onto the sets ˆH(u , u ) and H( (u )) respectively. Note that

ζ(u , u , f) = τf(u ),f(u )(ZT , u ) ζ(u , u , f )ˆζ( (u ), f )

By Lemma E.3 and the inductive hypothesis we then have that the quantity: X

f {0,1} ˆ H(u ,u ) h(i , f) ζ(u , u , f)

is equal to the quantity:

τi ,i (ZT , u )(Ji = i Kϕδ(u ,u ) + Ji = i K(1 ϕδ(u ,u )))

Summing over i {0, 1} gives us the result. We have hence proved the result in general.

Suppose we have some f : J {0, 1}. Let:

ˆF(f) := {f {0, 1}ZT | u J , f (u) = f(u)}

Given u J we have that:

Jf( J (u)) = f(u)Kϕδ( J (u),u) + Jf( J (u)) = f(u)K(1 ϕδ( J (u),u))

is equal to τf( J (u)),f(u)(J , u) and hence Lemma E.4 implies that: X

f ˆ H( J (u),u) Jf ( J (u)) = f( J (u))KJf (u) = f(u)K ζ( J (u), u, f ) = τf( J (u)),f(u)(J , u)

so since, by the definition of a contraction, the edge sets of the subtrees in { ˆH( J (u), u) | u J \ {r(J )}} partition the edge set of ZT we have, by definition of ζ, that: X

u ZT \{r(ZT )} τf ( ZT (u)),f (u)(ZT , u) = Y

u J \{r(J )} τf( J (u)),f(u)(J , u)

Since for all f ˆF(f) and for all u ZT \ J we have κf (u)(ZT , ˆλT , u) = 1 we have now shown that the quantity: X

u ZT \{r(ZT )} τf ( ZT (u)),f (u)(ZT , u) κf (u)(ZT , ˆλT , u)

is equal to the quantity: Y

u J \{r(J )} τf( J (u)),f(u)(J , u) κf(u)(ZT , ˆλT , u)

Summing over all f F(J , ˆu) and noting Equations (42) and (43) gives us the result.

E.7 Theorem D.1

Lemma E.5. Given u, u Zt the algorithm for computing ν(u, u ) is correct.

Proof. If u = u then the proof is trivial. Otherwise we consider the following cases:

Consider first the case that ˆs = (s ). Without loss of generality assume ˆs = (s ). Then we have u ( (ξ(s ))) and since ˆs = (s ) we have u / ( (ξ(s ))). Hence u / (u) so ν(u, u ) = as required.

If u = ξ(s ) then ˆs = (s ) so either ˆs = (s ) or ˆs = (s ). In the former case we have u ( (ξ(s ))) = ( (u)) so that ν(u, u ) = and similarly in the later case we have ν(u, u ) = as required.

If ˆs = (s ) and u = ξ(s ) then we invoke the process. Consider the vertex s at any stage in the process. By induction we have that if s E then u (µ (s)). This is because if s E then µ (s) is an ancestor of µ ( E(s)). This further implies that when s = ˆs we have u (µ ( E(s))). Now suppose that s E and without loss of generality assume s = ( E(s)). Then u ( (ξ( E(s)))) and µ ( E(s)) ( (ξ( E(s)))) so that, since u (µ ( E(s))), we have u / (u) and hence ν(u, u ) = as required. Suppose now that s E and that u = ξ(s). If (s) E then we have µ (s) ( (ξ(s))) = ( (u)) so that, by above, u ( (u)) and hence ν(u, u ) = as required. Similarly, if (s) E then ν(u, u ) = as required. This completes the proof.

Lemma E.6. The algorithm correctly finds ˆu.

Proof. By induction on the depth of s we have, for all vertices s in the constructed path, that:

If s D then ut lies in the maximal subtree of Zt containing µ(s) and having J (µ(s)) as a leaf .

If s D then ut lies in the maximal subtree of Zt with J (µ(s)) and µ (s) as leaves.

Let s be the unique leaf of D that is on the constructed path. If s D then µ(s ) is a leaf of J and hence also a leaf of Zt. So by above we have that ut lies in the maximal subtree of Zt with J (µ(s )) and µ(s ) as leaves. If, on the other hand, s D then since s is a leaf of D we have that µ(s ) = µ (s ) and hence, by above, we have that ut lies in the maximal subtree of Zt with J (µ(s )) and µ(s ) as leaves. In either case we have ˆu = µ(s ) as required.

Lemma E.7. The algorithm correctly finds u .

Proof. By induction on the depth of s we have, for all vertices s in the constructed path, that:

If s E then ΓZt(ut, ˆu) lies in Zt(µ(s)).

If s E then ΓZt(ut, ˆu) lies in the maximal subtree of Zt with µ(s) and µ (s) as leaves.

Let s be the unique leaf of E that is on the constructed path. If s E then µ(s ) is a leaf of Zt and hence, by above, ΓZt(ut, ˆu) = µ(s ) as required. If s E then µ(s) = µ (s) and hence, by above, ΓZt(ut, ˆu) = µ(s ) as required.