# efficient_online_clustering_with_moving_costs__4d1aab45.pdf

Efficient Online Clustering with Moving Costs

Dimitris Christou UT Austin christou@cs.utexas.edu

Stratis Skoulakis LIONS, EPFL efstratios.skoulakis@epfl.ch

Volkan Cevher LIONS, EPFL volkan.cevher@epfl.ch

In this work we consider an online learning problem, called Online k-Clustering with Moving Costs, at which a learner maintains a set of k facilities over T rounds so as to minimize the connection cost of an adversarially selected sequence of clients. The learner is informed on the positions of the clients at each round t only after its facility-selection and can use this information to update its decision in the next round. However, updating the facility positions comes with an additional moving cost based on the moving distance of the facilities. We present the first O(log n)-regret polynomial-time online learning algorithm guaranteeing that the overall cost (connection + moving) is at most O(log n) times the time-averaged connection cost of the best fixed solution. Our work improves on the recent result of Fotakis et al. [31] establishing O(k)-regret guarantees only on the connection cost.

1 Introduction

Due to their various applications in diverse fields (e.g. machine learning, operational research, data science etc.), clustering problems have been extensively studied. In the well-studied k-median problem, given a set of clients, k facilities should be placed on a metric with the objective to minimize the sum of the distance of each client from its closest center [55, 14, 13, 67, 6, 44, 52, 65, 51, 15, 54, 3].

In many modern applications (e.g., epidemiology, social media, conference, etc.) the positions of the clients are not static but rather evolve over time [57, 56, 64, 59, 23, 5]. For example the geographic distribution of the clients of an online store or the distribution of Covid-19 cases may drastically change from year to year or respectively from day to day [31]. In such settings it is desirable to update/change the positions of the facilities (e.g., compositions of warehouses or Covid test-units) so as to better serve the time-evolving trajectory of the clients.

The clients positions may change in complex and unpredictable ways and thus an a priori knowledge on their trajectory is not always available. Motivated by this, a recent line of research studies clustering problems under the online learning framework by assuming that the sequence of clients positions is unknown and adversarially selected [18, 28, 16, 31]. More precisely, a learner must place k facilities at each round t 1 without knowing the positions of clients at round t which are revealed to the learner only after its facility-selection. The learner can use this information to update its decision in the next round; however, moving a facility comes with an additional moving cost that should be taken into account in the learner s updating decision, e.g. moving Covid-19 test-units comes with a cost [18, 28].

Building on this line of works, we consider the following online learning problem:

Problem 1 (Online k-Clustering with Moving Costs). Let G(V, E, w) be a weighted graph with |V | = n vertices and k facilities. At each round t = 1, . . . , T:

1. The learner selects Ft V , with |Ft| = k, at which facilities are placed.

First author contribution.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

2. The adversary selects the clients positions, Rt V .

3. The learner learns the clients positions Rt and suffers

j Rt min i Ft d G(j, i) | {z } connection cost of client j

+ γ MG(Ft 1, Ft) | {z } moving cost of facilities

where d G(j, i) is the distance between vertices i, j V ; MG(Ft 1, Ft) is the minimum overall distance required to move k facilities from Ft 1 to Ft; and γ 0 is the facility-weight.

An online learning algorithm for Problem 1 tries to minimize the overall (connection + moving) cost by placing k facilities at each round t 1 based only on the previous positions of clients R1, . . . , Rt 1. To the best of our knowledge, Problem 1 was first introduced in [18]2. If for any sequence of clients, the overall cost of the algorithm is at most α times the overall connection cost of the optimal fixed placement of facilities F then the algorithm is called α-regret, while in the special case of α = 1 the algorithm is additionally called no-regret.

Problem 1 comes as a special case of the well-studied Metrical Task System by considering each of the possible n k facility placements as a different state. In their seminal work, [11] guarantee that the famous Multiplicative Weights Update algorithm (MWU) achieves (1 + ϵ)-regret in Problem 1 for any ϵ > 0. Unfortunately, running the MWU algorithm for Problem 1 is not really an option since it requires O(nk) time and space complexity. As a result, the following question naturally arises:

Q. Can we achieve α-regret for Problem 1 with polynomial-time online learning algorithms?

Answering the above question is a challenging task. Even in the very simple scenario of time-invariant clients, i.e. Rt = R for all t 1, an α-regret online learning algorithm must essentially compute an α-approximate solution of the k-median problem. Unfortunately the k-median problem cannot be approximated with ratio α < 1 + 2/e 1.71 (unless NP DTIME[nlog log n] [43]) which excludes the existence of an (1+2/e)-regret polynomial-time online learning algorithm for Problem 1. Despite the fact that many O(1)-approximation algorithms have been proposed for the k-median problem (the best current ratio is 1 +

3 [54]), these algorithms crucially rely on the (offline) knowledge of the whole sequence of clients and most importantly are not designed to handle the moving cost of the facilities [55, 14, 13, 67, 6, 44, 52, 65, 51, 15, 54, 3].

In their recent work, Fotakis et al. [31] propose an O(k)-regret polynomial-time online learning algorithm for Problem 1 without moving costs (i.e. the special case of γ = 0). Their approach is based on designing a no-regret polynomial-time algorithm for a fractional relaxation of Problem 1 and then using an online client-oblivious rounding scheme in order to convert a fractional solution to an integral one. Their analysis is based on the fact that the connection cost of any possible client is at most O(k) times its fractional connection cost. However in order to establish the latter guarantee their rounding scheme performs abrupt changes on the facilities leading to huge moving cost.

Our Contribution and Techniques. In this work, we provide a positive answer to question (Q), by designing the first polynomial-time online learning algorithm for Online k-Clustering with Moving Costs that achieves O (log n)-regret for any γ 0. The cornerstone idea of our work was to realize that O(1)-regret can be established with a polynomial-time online learning algorithm in the special case of G being a Hierarchical Separation Tree (HST). Then, by using the standard metric embedding result of [25], we can easily convert such an algorithm to an O(log n)-regret algorithm for general graphs. Our approach for HSTs consists of two main technical steps:

1. We introduce a fractional relaxation of Problem 1 for HSTs. We then consider a specific regularizer on the fractional facility placements, called Dilated Entropic Regularizer [26], that takes into account the specific structure of the HST. Our first technical contribution is to establish that the famous Follow the Leader algorithm [35] with dilated entropic regularization admits O(1)-regret for any γ 0.

2In [18], an easier version of Problem 1 with 1-lookahead is considered, meaning that the learner learns the positions of the clients Rt before selecting Ft. Moreover, G is considered to be the line graph and γ = 1.

2. Our second technical contribution is the design of a novel online client-oblivious rounding scheme, called Cut&Round, that converts a fractional solution for HSTs into an integral one. By exploiting the specific HST structure we establish that Cut&Round, despite not knowing the clients positions Rt, simultaneously guarantees that (i) the connection cost of each client j Rt is upper bounded by its fractional connection cost, and (ii) the expected moving cost of the facilities is at most O(1) times the fractional moving cost.

Experimental Evaluation. In Section F of the Appendix we experimentally compare our algorithm with the algorithm of Fotakis et al. [31]. Our experiments verify that our algorithm is robust to increases of the facility weight γ while the algorithm of [31] presents a significant cost increase. We additionally experimentally evaluate our algorithm in the MNIST and CIFAR10 datasets. Our experimental evaluations suggest that the O(log n)-regret bound is a pessimistic upper bound and that in practise our algorithm performs significantly better. Finally, we evaluate our algorithm both in the random arrival case (where the requested vertices are drawn uniformly at random from the graph) as well as in adversarial settings, where the request sequences are constructed through some arbitrary deterministic process.

Related Work. As already mentioned, our work most closely relates with the work of Fotakis et al. [31] that provides an O(k)-regret algorithm running in polynomial-time for γ = 0. [16] also consider Problem 1 for γ = 0 with the difference that the connection cost of clients is captured through the k-means objective i.e. the sum of the squared distances. They provide an (1 + ϵ)-regret algorithm with O (k2/ϵ2)2k time-complexity that is still exponential in k. [18, 28] study the special case of Problem 1 in which G is the line graph and γ = 1 while assuming 1-lookahead on the request Rt. For k = 1, [18] provide an (1 + ϵ)-competitive online algorithm meaning that its cost is at most (1 + ϵ) times the cost of the optimal dynamic solution and directly implies (1 + ϵ)-regret. [28] extended the previous result by providing a 63-competitive algorithm for k = 2 on line graphs. Our work also relates with the works of [23] and [4] that study offline approximation algorithms for clustering problems with time-evolving metrics. Finally our work is closely related with the research line of online learning in combinatorial domains and other settings of online clustering. Due to space limitations, we resume this discussion in Section A of the Appendix.

2 Preliminaries and Our Results

Let G(V, E, w) be a weighted undirected graph where V denotes the set of vertices and E the set of edges among them. The weight we of an edge e = (i, j) E denotes the cost of traversing e. Without loss, we assume that we N and we 1 for all edges e E. The distance between vertices i, j V is denoted with d G(i, j) and equals the cost of the minimum cost path from i V to j V . We use n := |V | to denote the cardinality of G and DG := maxi,j V d G(i, j) to denote its diameter.

Given a placement of facilities F V , with |F| = k, a client placed at vertex j V connects to the closest open facility i F. This is formally captured in Definition 1.

Definition 1. The connection cost of a set of clients R V under the facility-placement F V with |F| = k equals CR(F) := X

j R min i F d G(j, i)

Next, consider any pair of facility-placements F, F V such that |F| = |F | = k. The moving distance between F and F is the minimum overall distance needed to transfer the k facilities from F to F , formally defined in Definition 2. Definition 2. Fix any facility-placements F, F V where |F| = |F | = k. Let Σ be the set of all possible matchings from F to F , i.e. each σ Σ is a one-to-one mapping σ : F 7 F with σ(i) F denoting the mapping of facility i F. The moving cost between F and F equals

MG(F, F ) := min σ Σ

i F d G(i, σ(i))

At each round t 1, an online learning algorithm A for Problem 1 takes as input all the previous positions of the clients R1, . . . , Rt 1 V and outputs a facility-placement Ft := A(R1, . . . , Rt 1)

such that Ft V and |Ft| = k. The performance of an online learning algorithm is measured by the notion of regret, which we formally introduce in Definition 3. Definition 3. An online learning algorithm A for Problem 1 is called α-regret with additive regret β if and only if for any sequence of clients R1, . . . , RT V ,

t=1 CRt(Ft) + γ

t=2 MG(Ft 1, Ft)

α min |F |=k

t=1 CRt(F ) + β

where Ft = A(R1, . . . , Rt 1) and α, β are constants independent of T.

An online learning algorithm A selects the positions of the k facilities at each round t 1 solely based on the positions of the clients in the previous rounds, R1, . . . , Rt 1. If A is α-regret then Definition 3 implies that its time-averaged overall cost (connection + moving cost) is at most α times the time-averaged cost of the optimal static solution! 3 Furthermore, the dependency on

T is known to be optimal [11] and β is typically only required to be polynomially bounded by the size of the input, as for T the corresponding term in the time-averaged cost vanishes.

As already mentioned, the seminal work of [11] implies the existence of an (1 + ϵ)-regret algorithm for Problem 1; however, this algorithm requires O(nk) time and space complexity. Prior to this work, the only polynomial-time4 online learning algorithm for Problem 1 was due to Fotakis et al. [31], for the special case of γ = 0. Specifically, in their work the authors design an online learning algorithm with the following guarantee: Theorem (Fotakis et al. [31]). There exists a randomized online learning algorithm for Problem 1 that runs in polynomial time (w.r.t. T, n and log DG) such that

t=1 CRt(Ft)

O(k) min |F |=k

t=1 CRt(F ) + O(k n p

Clearly, the algorithm of [31] has not been designed to account for charging the moving of facilities, as indicated by the absence of the moving cost in the above regret guarantee. The main contribution of this work is to obtain (for the first time) regret guarantees that also account for the moving cost. Theorem 1. There exists a randomized online learning algorithm for Problem 1 (Algorithm 2) that runs in polynomial time (w.r.t. T, n and log DG) and admits the following regret guarantee:

t=1 CRt(Ft) + γ

t=2 MG(Ft 1, Ft)

O(log n) min |F |=k

t=1 CRt(F ) + β

for β = O(k n3/2 DG max(γ, 1)) and any γ 0. Remark 1. We remark that while our additive regret β is larger than the corresponding term in [31] by a factor of o( n), our results apply to any γ 0 while the algorithm of [31] can generally suffer unbounded moving cost for γ , as our experimental results verify.

2.1 HSTs and Metric Embeddings

In this section we provide some preliminary introduction to Hierarchical Separation Trees (HSTs), as they consist a key technical tool towards proving Theorem 1. A weighted tree T (V, E, w) is a weighted graph with no cycles. Equivalently, for any pair of vertices i, j V there exists a unique path that connects them. In Definition 4, we establish some basic notation for tree graphs. Definition 4. Fix any tree T (V, E, w). For every vertex u V , cld(u) V denotes the set children vertices of u and p(u) denotes its unique parent, i.e. u cld(p(u)). The root r V of T is the unique node with p(r) = and the set L(T ) := {u V : cld(u) = } denotes the leaves of T . We use dpt(u) to denote the depth of a vertex u V , i.e. the length of the (unique) path from the root r to u, and h(T ) := maxu L(T ) dpt(u) to denote the height of T . We use lev(u) := h(T ) dpt(u) to denote the level of a vertex u V . Finally, T(u) V denotes the set of vertices on the sub-tree rooted at u, i.e. the set of vertices that are descendants of u.

3Specifically, the time-averaged overall cost of A approaches this upper bound with rate β T 1/2. 4Polynomial-time with respect to the input parameters, namely T, n and log DG.

Next, we proceed to define a family of well-structured tree graphs that constitute one of the primary technical tools used in our analysis. Definition 5. A Hierarchical Separation Tree (HST) is a weighted tree T (V, E, w) such that (i) for any node u and any of its children v cld(u), the edge e = (u, v) admits weight we = 2lev(v), and (ii) the tree is balanced, namely lev(u) = 0 for all leaves u L(T ).

In their seminal works, [10] and later [24] showed that HSTs can approximately preserve the distances of any graph G(V, E, w) within some logarithmic level of distortion.

Theorem 2. For any graph G(V, E, w) with |V | = n and diameter D, there exists a polynomial-time randomized algorithm that given as input G produces an HST T with height h(T ) log D s.t.

1. L(T ) = V , meaning that the leaves of T correspond to the vertices of G.

2. For any u, v V , d G(u, v) d T (u, v) and E[d T (u, v)] O(log n) d G(u, v).

Theorem 2 states that any weighted graph G(V, E, w) can be embedded into an HST T with O(log n)-distortion. This means that the distance d G(u, v) between any pair of vertices u, v V can be approximated by their respective distance d T (u, v) in T within an (expected) factor of O(log n). Remark 2. We note that traditionally HSTs are neither balanced nor are required to have weights that are specifically powers of 2. However, we can transform any general HST into our specific definition, and this has been accounted for in the statement of the above theorem. The details are deferred to Section B of the Appendix.

3 Overview of our approach

In this section we present the key steps of our approach towards designing the O(log n)-regret online learning algorithm for Problem 1. Our approach can be summarized in the following three pillars:

1. In Section 3.1 we introduce a fractional relaxation of Problem 1 in the special case of HSTs (Problem 2). Problem 2 is an artificial problem at which the learner can place a fractional amount of facility to the leaves of an HST so as to fractionally serve the arrived clients. Since the optimal static solution of Problem 2 lower bounds the optimal static solution of Problem 1 in the special case of HSTs, the first step of our approach is to design an O(1)-regret algorithm for Problem 2. 2. In Section 3.2 we present the formal guarantees of a novel randomized rounding scheme, called Cut&Round, that is client-oblivious and converts any fractional solution for Problem 2 into an actual placement of k facilities on the leaves of the HST with just an O(1)- overhead in the connection and the moving cost. 3. In Section 3.3 we present how the fractional algorithm for Problem 2 together with the Cut&Round rounding naturally lead to an O(1)-regret online learning algorithm for Problem 1 in the special case of HSTs (Algorithm 1). Our main algorithm, presented in Algorithm 2, then consists of running Algorithm 1 into an O(log n) HST embedding of input graph.

3.1 A Fractional Relaxation for HSTs

In this section we introduce a fractional relaxation for Problem 1, called Fractional k-Clustering with Moving Costs on HSTs (Problem 2). Fix any HST T (V, E, w) (in this section, V denotes the nodes of the HST). We begin by presenting a fractional extension of placing k facilities on the leaves of T .

Definition 6. The set of fractional facility placements FP(T ) consists of all vectors y R|V | such that

1. yv [0, 1] for all leaves v L(T ).

u cld(v) yu for all non-leaves v / L(T ).

v L(T ) yv = k, i.e. the total amount of facility on the leaves equals k.

For a leaf vertex v L(T ), yv simply denotes the fractional amount of facilities that are placed on it. For all non-leaf vertices v / L(T ), yv denotes the total amount of facility placed in the leaves of the sub-tree T(v). Thus, any integral vector y FP(T ) N corresponds to a placement of k facilities on the leaves of T .

In Definitions 7 and 8 we extend the notion of connection and moving cost for fractional facility placements. In the special case of integral facility placements, Definitions 7 and 8 respectively collapse to Definitions 1 and 2 (a formal proof is given in Claims 1 and 2 of Section C of the Appendix). Definition 7. The fractional connection cost of a set of clients R L(T ) under y FP(T ) is defined as f R(y) := X

v P (j,r) 2lev(v)+1 max (0, 1 yv)

where P(j, r) denotes the set of vertices in the (unique) path from the leaf j L(T ) to the root r.

Definition 8. The fractional moving cost between any y, y FP(T ) is defined as

||y y ||T := γ X

v V (T ) 2lev(v) |yv y v|

We are now ready to present our fractional generalization of Problem 1 in the special case of HSTs. Problem 2 (Fractional k-Clustering with Moving Costs on HSTs). Fix any HST T . At each round t = 1, . . . , T:

1. The learner selects a vector yt FP(T ).

2. The adversary selects a set of clients Rt L(T ).

3. The learner suffers cost f Rt(yt) + ||yt yt 1||T .

In Section 4, we develop and present an O(1)-regret algorithm for Problem 2 (see Algorithm 3). To this end, we present its formal regret guarantee established in Theorem 3. Theorem 3. There exists a polynomial-time online learning algorithm for Problem 2 (Algorithm 3), such that for any sequence R1, . . . , RT L(T ), its output y1, . . . , y T satisfies

t=1 f Rt(yt) +

t=2 ||yt yt 1||T 3

2 min y FP(T )

t=1 f Rt(y ) + β

for β = O k |L(T )|3/2 DT max(γ, 1) .

3.2 From Fractional to Integral Placements in HSTs

As already mentioned, the basic idea of our approach is to convert at each round t 1 the fractional placement yt FP(T ) produced by Algorithm 3 into an integral facility placement Ft L(T ) with |Ft| = k on the leaves of the HST. In order to guarantee small regret, our rounding scheme should preserve both the connection and the moving cost of the fractional solution within constant factors for any possible set of arriving clients. In order to guarantee the latter, our rounding scheme Cut&Round (Algorithm 4) uses shared randomness across different rounds. Cut&Round is rather complicated and is presented in Section 5. To this end, we present its formal guarantee. Theorem 4. There exists a linear-time deterministic algorithm, called Cut&Round (Algorithm 4), that takes as input an HST T , a fractional facility placement y FP(T ) and a vector α [0, 1]|V | and outputs a placement of k facilities F Cut&Round(T , y, α) on the leaves of T (F L(T ) and |F| = k) such that

1. Eα Unif(0,1) [CR(F)] = f R(y) for all client requests R L(T ).

2. Eα Unif(0,1) [γ MT (F, F )] 4 ||y y ||T for all other fractional facility placements y FP(T ) and F Cut&Round(T , y , α).

Item 1 of Theorem 4 establishes that although Cut&Round is oblivious to the arrived set of clients Rt L(T ), the expected connection cost of the output equals the fractional connection cost under yt FP(T ). Item 2 of Theorem 4 states that once the same random seed α is used into two consecutive time steps, then the expected moving cost between the facility-placements Ft and Ft+1 is at most O(1)-times the fractional moving cost between yt and yt+1. Both properties crucially rely on the structure of the HST and consist one of the main technical contributions of our work.

3.3 Overall Online Learning Algorithm

We are now ready to formally introduce our main algorithm (Algorithm 2) and prove Theorem 1. First, we combine the algorithms from Theorems 3 and 4 to design an O(1)-regret algorithm for Problem 1 on HSTs (Algorithm 1). Up next we present how Algorithm 1 can be converted into an O(log n)-regret online learning algorithm for general graphs, using the metric embedding technique of Theorem 2, resulting to our final algorithm (Algorithm 2).

Algorithm 1 O(1)-regret for HSTs.

1: Input: A sequence R1, . . . , RT L(T ).

2: The learner samples αv Unif(0, 1) for all v V (T ). 3: for each round t = 1 to T do

4: The learner places the k facilities to the leaves of the HST T based on the output Ft := Cut&Round(T , yt, α).

5: The learner learns Rt L(T ).

6: The learner updates yt+1 FP(T ) by running Algorithm 3 for Problem 2 with input R1, . . . , Rt. 7: end for

Algorithm 2 O(log n)-regret for graphs.

1: Input: A sequence R1, . . . , RT L(T ).

2: The learner embeds G(V, E, w) into a (random) HST T with L(T ) = V via the procedure of Theorem 2. 3: for each round t = 1 to T do

4: The learner selects a facility-placement Ft V .

5: The learner learns Rt V .

6: The learner updates Ft+1 by giving as input R1, . . . , Rt L(T ) to Algorithm 1 for T . 7: end for

Theorem 5. For any sequence of client requests R1, . . . , RT L(T ), the sequence of facilityplacements F1, . . . , FT L(T ) produced by Algorithm 1 satisfies

t=1 CRt(Ft) + γ

t=2 MT (Ft, Ft 1)

6 min |F |=k

t=1 CRt(F ) + β

for β = O k |L(T )|3/2 DT max(γ, 1) .

Theorem 5 establishes that Algorithm 1 achieves constant regret in the special case of HSTs and its proof easily follows by Theorems 3 and 4. Then, the proof of Theorem 1 easily follows by Theorem 2 and Theorem 5. All the proofs are deferred to Section C of the Appendix.

4 O(1)-Regret for Fractional HST Clustering

In this section we present the O(1)-regret algorithm for Problem 2, described in Algorithm 3, and exhibit the key ideas in establishing Theorem 3. Without loss of generality, we can assume that the facility-weight satisfies γ 15.

Algorithm 3 is the well-known online learning algorithm Follow the Regularized Leader (FTRL) with a specific regularizer RT ( ) presented in Definition 9. Our results crucially rely on the properties of this regularizer since it takes into account the HST structure and permits us to bound the fractional moving cost of FTRL.

5If not, establishing our guarantees for γ = 1 will clearly upper bound the actual moving cost.

Definition 9. Given an HST T , the dilated entropic regularizer RT (y) over y FP(T ) is defined as

RT (y) := X

v =r 2lev(v) (yv + δv) ln yv + δv yp(v) + δp(v)

where δv := (k/n) |L(T ) T(v)| and n := |L(T )|.

Algorithm 3 FTRL with dilated entropic regularization

1: Input: An adversarial sequence R1, . . . , RT L(T ).

2: for t = 1 to T do

3: The learner selects yt FP(T ).

4: The learner suffers cost f Rt(yt) + ||yt yt 1||T .

5: The learner updates yt+1 arg miny FP(T ) h Pt s=1 f Rs(y) + (γ

n T) RT (y) i .

Algorithm 3 selects at each step t the facility placement yt FP(T ) that minimizes a convex combination of the total fractional connection cost for the sub-sequence R1, . . . , Rt 1 and RT (y). The regularization term ensures the stability of the output, which will result in a bounded fractional moving cost.

Analysis of Algorithm 3. Due to space limitations, all proofs are moved to Section D of the Appendix. The primary reason for the specific selection of the regularizer at Definition 9 is that RT ( ) is strongly convex with respect to the norm || ||T of Definition 8, as established in Lemma 1 which is the main technical contribution of the section. We use D = DT for the diameter of T . Lemma 1. For any vectors y, y FP(T ),

RT (y ) RT (y) + RT (y), y y + 8k Dγ2 1 ||y y ||2 T

The strong convexity of RT (y) with respect to || ||T is crucial since it permits us to bound the moving cost of Algorithm 3 by its fractional connection cost. Lemma 2. For any sequence R1, . . . , RT L(T ), the output of Algorithm 3 satisfies

t=2 ||yt yt 1||T 1

t=1 f Rt(yt) + O (γk D)

We remark that using another regularizer R( ) that is strongly convex with respect to another norm || || would still imply Lemma 1 with respect to || ||. The problem though is that the fractional moving cost PT t=1 ||yt yt 1|| can no longer be associated with the actual moving cost PT t=1 MT (Ft, Ft 1). It is for this reason that using a regularizer that is strongly convex with respect to || ||T is crucial.

Next, by adapting the standard analysis of FTRL to our specific setting, we derive Lemma 3 establishing that Algorithm 3 admits bounded connection cost. Lemma 3. For any sequence R1, . . . , RT L(T ), the output of Algorithm 3 satisfies

t=1 f Rt(yt) min y FP

t=1 f Rt(y ) + O kn3/2Dγ

The proof of Theorem 3 directly follows by Lemma 2 and 3. We conclude the section by presenting how Step 5 of Algorithm 3 can be efficiently implemented, namely

min y FP(T ) Φt(y) :=

s=1 f Rs(y) + (γ

n T) RT (y).

Since Φt(y) is strongly convex and the set FP(T ) is a polytope, one could use standard optimization algorithms such as the ellipsoid method or projected gradient descent to approximately minimize

Φt(y) given access to a sub-gradient oracle for Φt( ). In Claim 11 of Section D of the Appendix, we establish that the sub-gradients of Φ( ) can be computed in polynomial time and thus any of the previous methods can be used to approximately minimize Φ( ). In Lemma 4 we establish the intuitive fact that approximately implementing Step 5 does not affect the guarantees of Theorem 3.

Lemma 4. Let yt be the minimizer of Φt( ) in FP(T) and let zt FP(T ) be any point such that Φt(zt) Φt(yt) + ϵ for some ϵ = O(T 1/2). Then,

f Rt(zt) + ||zt zt 1||T f Rt(yt) + ||yt yt 1||T + O kn3/2Dγ T 1/2

Remark 3. In our implementation of the algorithm, we approximately solve Step 5 of Algorithm 3 via Mirror Descent based on the Bregman divergence of RT ( ). This admits the same convergence rates as projected gradient descent but the projection step can be computed in linear time with respect to the size of the HST T . We present the details of our implementation in Section C of the Appendix.

5 The Cut&Round Rounding

In this section we present our novel rounding scheme (Algorithm Cut&Round) as well as the main steps that are required in order to establish Theorem 4. To ease notation, for any real number x 0 we denote its decimal part as δ(x) = x x . We comment that our rounding scheme simply maintains and updates a distribution over the vertices of the HST, and can be thus implemented in polynomialtime. Similar rounding schemes, like the one presented in [9], typically maintain a distribution over all possible facility-placements, which generally cannot be implemented in polynomial-time.

Algorithm 4 Cut&Round.

1: Input: An HST T , a fractional placement y FP(T ) and thresholds αv [0, 1] for all v V (T ). 2: Yr k 3: for levels ℓ= h(T ) to 1 do 4: for all nodes v with lev(v) = ℓdo 5: Yrem Yv 6: yrem yv 7: for all children u cld(v) do

8: Yu Alloc(yu, Yrem, yrem, αu)

9: Yrem Yrem Yu 10: yrem yrem yu 11: end for 12: end for 13: end for 14: return F := {u L(T ) : Yu = 1}.

Algorithm 5 Alloc.

Input: Numbers yu, yrem 0, Yrem N and αu [0, 1]. if Yrem == yrem then

if δ(yu) < δ(yrem) then

Yu yu + 1 h au δ(yu) δ(yrem)

1 δ(yrem) i

end if else

if δ(yu) < δ(yrem) then

Yu yu + 1 h au δ(yu) δ(yrem) i

Yu yu + 1 end if end if Return Yu.

On principle, Cut&Round (Algorithm 4) assigns to each vertex v an integer number of facilities Yv to be placed at the leaves of its sub-tree. Notice that due to sub-routine Alloc (Algorithm 5), Yv either equals yv or yv + 1. Cut&Round initially assigns k facilities to the set of leaves that descend from the root r, which is precisely L(T ). Then, it moves in decreasing level order to decide Yv for each node v. Once Yv is determined (Step 5), the Yv facilities are allocated to the sub-trees of its children u cld(v) (Steps 7-10) via sub-routine Alloc using the thresholds αu, in a manner that guarantees that Yv = P

u cld(v) Yu (see Section E.1 of the Appendix). This implies the feasibility of Cut&Round, as exactly k facilities are placed in the leaves of T at the end of the process.

Assuming that the set of thresholds αv is randomly drawn from the uniform distribution in [0, 1], sub-routine Alloc (Algorithm 5) guarantees that Yv either equals yv or yv +1 while Eα [Yv] = yv. This is formally captured in Lemma 5 and is crucial in the proof of Theorem 4.

Lemma 5. Consider Algorithm 4 given as input a vector y FP(T ) and random thresholds αv Unif(0, 1). Then,

Yv = yv with probability 1 δ(yv) yv + 1 with probability δ(yv)

By coupling Lemma 5 with the HST structure we are able to establish Theorem 4. The proof is technically involved and thus deferred to Section E of the Appendix.

6 Conclusion

In this work, we designed the first polynomial-time online learning algorithm for Online k-Clustering with Moving Costs that achieves O(log n)-regret with respect to the cost of the optimal static facility placement, extending the results of Fotakis et al. [31] for the special case of γ = 0. The cornerstone of our approach was to realize and establish that O(1)-regret is plausible for HST metrics. This was achieved through designing a dilated entropic regularizer to capture the structure of the HST and combine it with the FTRL algorithm, as well as designing a lossless (up to constant factors) rounding scheme that simultaneously works for both the connection and the moving cost. Both of these components where central towards acquiring constant regret on HSTs.

A interesting future direction is to investigate whether a polynomial-time online learning algorithm with O(1)-regret for the problem is theoretically possible or not. Since the O(log n)-factor is inherently lost when using HST embeddings, this would require a significantly different approach to the one presented in this work. Finally, we comment that our current optimality guarantees are with respect to the optimal static facility placement. Going beyond the notion of regret, an intriguing future direction is establishing guarantees with respect to the optimal dynamic facility-placement that moves facilities from round to round by suffering the corresponding moving cost.

Acknowledgements

This work was supported by the Swiss National Science Foundation (SNSF) under grant number 200021_205011, by Hasler Foundation Program: Hasler Responsible AI (project number 21043) and Innovation project supported by Innosuisse (contract agreement 100.960 IP-ICT).

[1] Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, 2014.

[2] Nir Ailon. Improved bounds for online learning over the permutahedron and other ranking polytopes. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, AISTATS 2014, 2014.

[3] Soroush Alamdari and David B. Shmoys. A bicriteria approximation algorithm for the k-center and k-median problems. In Workshop on Approximation and Online Algorithms, 2017.

[4] Hyung-Chan An, Ashkan Norouzi-Fard, and Ola Svensson. Dynamic facility location via exponential clocks. ACM Trans. Algorithms, 13(2):21:1 21:20, 2017.

[5] Tarique Anwar, Surya Nepal, Cecile Paris, Jian Yang, Jia Wu, and Quan Z. Sheng. Tracking the evolution of clusters in social media streams. IEEE Transactions on Big Data, pages 1 15, 2022.

[6] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, STOC 01, page 21 29. Association for Computing Machinery, 2001.

[7] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. J. Comput. Syst. Sci., 2008.

[8] Maria-Florina Balcan and Avrim Blum. Approximation algorithms and online mechanisms for item pricing. In ACM Conference on Electronic Commerce, 2006.

[9] Nikhil Bansal, Niv Buchbinder, Aleksander Madry, and Joseph Naor. A polylogarithmiccompetitive algorithm for the k-server problem. In Rafail Ostrovsky, editor, IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, pages 267 276. IEEE Computer Society, 2011.

[10] Yair Bartal. Probabilistic approximation of metric spaces and its algorithmic applications. Proceedings of 37th Conference on Foundations of Computer Science, pages 184 193, 1996.

[11] Avrim Blum and Carl Burch. On-line learning and the metrical task system problem. Mach. Learn., 39(1):35 58, 2000.

[12] Sébastien Bubeck, Michael B. Cohen, Yin Tat Lee, James R. Lee, and Aleksander Madry. kserver via multiscale entropic regularization. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 3 16. ACM, 2018.

[13] Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS 99. IEEE Computer Society, 1999.

[14] Moses Charikar, Sudipto Guha, Éva Tardos, and David B. Shmoys. A constant-factor approximation algorithm for the k-median problem (extended abstract). In Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, STOC 99, page 1 10. Association for Computing Machinery, 1999.

[15] Moses Charikar and Shi Li. A dependent lp-rounding approach for the k-median problem. In Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012, volume 7391 of Lecture Notes in Computer Science, pages 194 205. Springer, 2012.

[16] Vincent Cohen-Addad, Benjamin Guedj, Varun Kanade, and Guy Rom. Online k-means clustering. In Arindam Banerjee and Kenji Fukumizu, editors, The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pages 1126 1134. PMLR, 2021.

[17] Aaron Cote, Adam Meyerson, and Laura J. Poplawski. Randomized k-server on hierarchical binary trees. In Cynthia Dwork, editor, Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, May 17-20, 2008, pages 227 234. ACM, 2008.

[18] Bart de Keijzer and Dominik Wojtczak. Facility reallocation on the line. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, pages 188 194, 2018.

[19] Sina Dehghani, Soheil Ehsani, Mohammad Taghi Hajiaghayi, Vahid Liaghat, and Saeed Seddighin. Stochastic k-server: How should uber work? In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, volume 80 of LIPIcs, pages 126:1 126:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.

[20] Sina Dehghani, Mohammad Taghi Hajiaghayi, Hamid Mahini, and Saeed Seddighin. Price of competition and dueling games. ar Xiv preprint ar Xiv:1605.04004, 2016.

[21] Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracle-efficient online learning and auction design. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, 2017.

[22] Miroslav Dudík, Daniel J. Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2011, 2011.

[23] David Eisenstat, Claire Mathieu, and Nicolas Schabanel. Facility location in evolving metrics. In Automata, Languages, and Programming - 41st International Colloquium ICALP 2014, volume 8573 of Lecture Notes in Computer Science, pages 459 470. Springer, 2014.

[24] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics. In Proceedings of the Thirty-Fifth Annual ACM Symposium on Theory of Computing, STOC 03, page 448 455, New York, NY, USA, 2003. Association for Computing Machinery.

[25] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics. J. Comput. Syst. Sci., 69(3):485 497, 2004.

[26] Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Optimistic regret minimization for extensive-form games via dilated distance-generating functions. In Neural Information Processing Systems, 2019.

[27] Hendrik Fichtenberger, Silvio Lattanzi, Ashkan Norouzi-Fard, and Ola Svensson. Consistent k-clustering for general metrics. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, pages 2660 2678. SIAM, 2021.

[28] Dimitris Fotakis, Loukas Kavouras, Panagiotis Kostopanagiotis, Philip Lazos, Stratis Skoulakis, and Nikos Zarifis. Reallocating multiple facilities on the line. In Proceedings of the Twenty Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, pages 273 279, 2019.

[29] Dimitris Fotakis, Loukas Kavouras, Grigorios Koumoutsos, Stratis Skoulakis, and Manolis Vardas. The online min-sum set cover problem. In Proc. of the 47th International Colloquium on Automata, Languages and Programming, ICALP 2020.

[30] Dimitris Fotakis, Thanasis Lianeas, Georgios Piliouras, and Stratis Skoulakis. Efficient online learning of optimal rankings: Dimensionality reduction via gradient descent. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, 2020.

[31] Dimitris Fotakis, Georgios Piliouras, and Stratis Skoulakis. Efficient online learning for dynamic k-clustering. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3396 3406. PMLR, 18 24 Jul 2021.

[32] Takahiro Fujita, Kohei Hatano, and Eiji Takimoto. Combinatorial online prediction via metarounding. In 24th International Conference on Algorithmic Learning Theory, ALT 2013, 2013.

[33] Dan Garber. Efficient online linear optimization with approximation algorithms. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2017, 2017.

[34] Xiangyu Guo, Janardhan Kulkarni, Shi Li, and Jiayi Xian. Consistent k-median: Simpler, better and robust. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1135 1143. PMLR, 13 15 Apr 2021.

[35] Elad Hazan. Introduction to online convex optimization. Co RR, abs/1909.05207, 2019.

[36] Elad Hazan, Wei Hu, Yuanzhi Li, and Zhiyuan Li. Online improper learning with an approximation oracle. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

[37] Elad Hazan and Satyen Kale. Online submodular minimization. J. Mach. Learn. Res., 2012.

[38] Elad Hazan, Satyen Kale, and Shai Shalev-Shwartz. Near-optimal algorithms for online matrix prediction. In 25th Annual Conference on Learning Theory, COLT 2012, 2012.

[39] Elad Hazan and Tomer Koren. The computational power of optimization in online learning. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing, STOC 2016, 2016.

[40] David P. Helmbold, Robert E. Schapire, and M. Long. Predicting nearly as well as the best pruning of a decision tree. In Machine Learning, 1997.

[41] David P. Helmbold and Manfred K. Warmuth. Learning permutations with exponential weights. In Proceedings of the 20th Annual Conference on Learning Theory, COLT 2007, 2007.

[42] Nicole Immorlica, Adam Tauman Kalai, Brendan Lucier, Ankur Moitra, Andrew Postlewaite, and Moshe Tennenholtz. Dueling algorithms. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, STOC 11, page 215 224, New York, NY, USA, 2011. Association for Computing Machinery.

[43] Kamal Jain, Mohammad Mahdian, and Amin Saberi. A new greedy approach for facility location problems. In John H. Reif, editor, Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 731 740. ACM, 2002.

[44] Kamal Jain and Vijay V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. J. ACM, 48(2):274 296, 2001.

[45] Stefanie Jegelka and Jeff A. Bilmes. Online submodular minimization for combinatorial structures. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, 2011.

[46] Sham Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing, STOC 2007, 2007.

[47] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. In J. Comput. Syst. Sci. Springer, 2003.

[48] Wouter M. Koolen, Manfred K. Warmuth, and Jyrki Kivinen. Hedging structured concepts. In the 23rd Conference on Learning Theory, COLT 2010, 2010.

[49] Elias Koutsoupias. The k-server problem. Comput. Sci. Rev., 3(2):105 118, 2009.

[50] Elias Koutsoupias and Christos H. Papadimitriou. On the k-server conjecture. J. ACM, 42(5):971 983, 1995.

[51] Amit Kumar. Constant factor approximation algorithm for the knapsack median problem. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 12, page 824 832. Society for Industrial and Applied Mathematics, 2012.

[52] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2), 2010.

[53] Silvio Lattanzi and Sergei Vassilvitskii. Consistent k-clustering. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1975 1984. PMLR, 2017.

[54] Shi Li and Ola Svensson. Approximating k-median via pseudo-approximation. SIAM J. Comput., 45(2):530 547, 2016.

[55] Jyh-Han Lin and Jeffrey Scott Vitter. Approximation algorithms for geometric median problems. Information Processing Letters, 44(5):245 249, 1992.

[56] M.E.J. Newman. The structure and function of complex networks. SIAM review, 45(2):167 256, 2003.

[57] Romualdo Pastor-Satorras and Alessandro Vespignani. Epidemic Spreading in Scale-Free Networks. Physical Review Letters, 86(14):3200 3203, 2001.

[58] Holakou Rahmanian and Manfred K. Warmuth. Online dynamic programming. In NIPS, 2017.

[59] Juliette Stehlé, Nicolas Voirin, Alain Barrat, Ciro Cattuto, Lorenzo Isella, Jean-François Pinton, Marco Quaggiotto, Wouter Van den Broeck, Corinne Régis, Bruno Lina, and Philippe Vanhems. High-resolution measurements of face-to-face contact patterns in a primary school. PLOS ONE, 6(8), 2011.

[60] Matthew J. Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In 22nd Annual Conference on Neural Information Processing Systems, NIPS 2008, 2008.

[61] Daiki Suehiro, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Kiyohito Nagano. Online prediction under submodular constraints. In Algorithmic Learning Theory, ALT 2012, 2012.

[62] Eiji Takimoto and Manfred K. Warmuth. Predicting nearly as well as the best pruning of a planar decision graph. In Theoretical Computer Science, 2000.

[63] Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. J. Mach. Learn. Res., 2003.

[64] Chayant Tantipathananandh, Tanya Berger-Wolf, and David Kempe. A framework for community identification in dynamic social networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 07, page 717 726. Association for Computing Machinery, 2007.

[65] David P. Williamson and David B. Shmoys. The Design of Approximation Algorithms. Cambridge University Press, USA, 1st edition, 2011.

[66] Shota Yasutake, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Masayuki Takeda. Online linear optimization over permutations. In Proceedings of the 22nd International Conference on Algorithms and Computation, ISAAC 2011, 2011.

[67] Neal E. Young. K-medians, facility location, and the chernoff-wald bound. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 00, page 86 95. Society for Industrial and Applied Mathematics, 2000.

A Further Related Work

In this chapter of the appendix, we continue our discussion on the literature that relates to this work.

Efficient Combinatorial Online Learning. There exists a long line of research studying efficient online learning algorithms in various combinatorial domains (e.g., selection of paths, permutations, binary search trees etc.) [40, 62, 63, 41, 7, 60, 45, 22, 42, 66, 38, 37, 1, 2, 20, 30, 29]. Another related line of work studies black-box reductions converting any α-approximation (offline) algorithm to an O(α)-regret online learning algorithm for a specific class of combinatorial optimization problems called linear optimization problems [47, 8, 46, 48, 61, 32, 39, 58, 21, 33, 36]. We remark that a key difference of our setting with the aforementioned works is that in the latter case the learner is not penalized for switching actions from round to round with an additional moving/switching cost. In the context of Problem 1 this means that γ = 0 which is exactly the setting considered by [31]. As a result, apart from the fact that k-median does not belong in the class of linear optimization problems, the aforementioned black-box reductions do not apply to Problem 1 since they do not account for the moving cost.

The k-server Problem. Our work also relates with the rich line of literature on the k-server problem [50, 17, 49, 9, 19, 12]. In this setting there exists only 1 client at each round, while 1-lookahead is assumed, i.e. the request Rt is revealed prior to the action of the algorithm at step t. Moreover in k-server a facility must be placed in the exact position of the request, leading to a simpler combinatorial structure with respect to Problem 16. However, in the k-server problem, instead of using the benchmark of regret, the more challenging metric of competitive ratio that measures the sub-optimality with respect to the optimal dynamic solution is used. Mostly related to ours is the work of [9] providing the first poly(log n)-competitive algorithm for k-server by reducing the problem to the special case of HSTs. [9] first design a poly(log n)-competitive algorithm for a fractional version of k-server at which facilities can be fractionally placed into the vertices of the HST. They then use a randomized rounding scheme to convert the fractional solution into an integral one. The basic difference of the randomized rounding scheme of [9] with the one that we introduce in this work (Algorithm Cut&Round) is that the first provides guarantees only for the moving cost of the facilities while Cut&Round provides guarantees both for the moving cost of the facilities as well as the connection cost of the clients.

Consistent k-Clustering. Another setting of clustering in the presence of unknown clients is that of Consistent k-Clustering [53, 34, 27]. In this setting, given an unknown stream of clients, a set of k facilities has to be maintained over time so that at any round t, the selected facilities form an approximately optimal solution of the sub-instance consisting of clients appeared in the time interval {1, t}. A basic difference of Consistent k-Clustering with Problem 1 is that in the first case the moving cost is not penalized as long as the number of swaps does not exceed a certain threshold (O(k)).

6Given offline access to the sequence of requests, the optimal solution for the k-server can be computed in polynomial-time while the optimal static solution of Problem 1 cannot be approximated in polynomial-time with ratio less than (1 + 2/e) even under a-priori knowledge of the request sequence (inapproximability of k-median).

B Proof of Theorem 2

In this chapter of the appendix we briefly discuss the details behind Theorem 2 and show how the results of [10] and [24] hold even for the specific definition of HSTs we have considered in Definition 5.

Traditionally, HSTs are not required to be balanced nor are required to have weights that are specifically powers of 2. In fact, the seminal work of [10], later improved by [24], states that there exists a randomized procedure such that for every weighted graph G(V, E, w), it constructs (in polynomial-time) a tree T such that:

1. There exists a perfect matching σ : V 7 L(T ) that maps the vertices of G to the leaves of T . 2. For any vertices i, j V , their corresponding distance on T can only increase, i.e. d G(i, j) d T (σ(i), σ(j)). 3. On expectation, distances between vertices are distorted only by a logarithmic factor, i.e. E [d T (σ(i), σ(j))] O(log |V |) d G(i, j) 4. The weight of any edge e = (v, u) between a vertex v V (T ) and its parent vertex u is precisely diam(G) 2 dpt(v). 5. The height of T satisfies h(T ) log (diam(G)) .

The purpose of this section is to argue that one can easily transform such a tree T to match our notion of HSTs (Definition 5), while maintaining the same guarantees for the distortion of the distances. Recall that we have already assumed that the minimum edge weight of G is 1, i.e. mine E we = 1. Furthermore, we can also assume without loss of generality that the diameter of G is a power of 2; if not, simple scaling arguments suffice to transform G into such a graph by only distorting distances by a constant factor. Thus, we assume that diam(G) = 2d for some d 0.

We start from the tree T that the algorithm of [24] generates. Recall that by definition, the weight of an edge e = (i, j) between some vertex i and its parent node j is 2d dpt(i). In order to balance the tree, we take each leaf vertex u L(T ) at depth dpt(u) and extend it downwards by adding new vertices until it reaches a new depth dpt (u) = d. For every new edge that we add during this process, we maintain that the weight of the edge e = (i, j) from i to its parent j is diam(G) 2 dpt(i).

Let T be used to denote our modified tree. Clearly, the above construction guarantees h(T ) = d. Since by definition h(T ) log (diam(G)) = d, we know that all leaves initially lied at depth at most d, and thus by the end of the above process all leaves will lie at the same level of the tree and have depth d. Thus, we have indeed constructed a balanced tree. Furthermore, since by definition dpt(v) = h(T ) lev(v), we get that the weight of the edge e = (i, j) from i to its parent j is we = diam(G) 2lev(i) d = 2lev(i). So, the constructed tree indeed satisfies all the requirements of Definition 5 and is a valid HST (according to our definition).

We will now argue that T also satisfies all items of Theorem 2. Fist of all, the height of our new tree is precisely d, and thus it is true that h(T ) log (diam(G)) . Furthermore, since we only added edges to the initial tree T , the distance between any two leaves can only increase. Thus, we get that for any vertices i, j V it holds

d G(i, j) d T (i, j) d T (i, j)

Finally, it remains to upper bound the expected distortion on T . Recall that by construction of [24], we know that E [d T (σ(i), σ(j))] O(log |V |) d G(i, j) Since edge lengths decrease by a factor of 2 every time we move down the tree, we know that the total length of the path we added in order to move leaf i from depth dpt(i) to depth d is precisely 1 + 2 + . . . 2dpt(i) 1 2dpt(i). This implies that any distance on T can be at most twice the corresponding distance on T , i.e.

d T (σ(i), σ(j)) 2 d T (σ(i), σ(j))

which completes the proof.

C Proofs of Section 3

In this chapter of the appendix we present all the omitted proofs from Section 3 concerning the basic algorithmic primitives we use in order to establish our main result in Theorem 1.

Roadmap. In section C.1 we establish the connection between Problems 1 and 2 and show that our notion of fractional connection and moving cost collapses with our initial definitions in the case of integral facility placements. Then, in section C.2 we present the proof of Theorem 5 and in section C.3 we present the proof of Theorem 1.

C.1 Establishing the relation between Problems 1 and 2

Fix any HST T and let FP(T ) be the corresponding set of fractional facility placements. In this section, we will establish that in the case of integral facility placements y FP(T ) N, the notions of fractional connection cost and fractional moving cost (formally stated in Definitions 7 and 8) collapse to the notions of actual connection and moving costs (formally stated in Definitions 1 and 2) respectively.

Let y FP(T ) N be an integral facility placement. Then, by definition, for each leaf v L(T ) we have yv {0, 1} facilities that are placed on it, and the total amount of placed facilities is k, i.e. P v L(T ) yv = k. Thus, we can associate with any integral facility placement y a corresponding set

F(y) = {v L(T ) : yv = 1}

such that |F(y)| = k, meaning that F(y) is a valid facility placement of the leaves of the T .

In Claim 1 we will establish that for any set of clients, the connection cost under F(y) is equal to the fractional connection cost under y. Then, in Claim 2 we will establish that the fractional moving cost between y and y gives us precisely the moving cost between facility placements F(y) and F(y ) on T . Claim 1. For any integral facility placement y FP(T ) N and any set of clients R L(T ), it holds that f R(y) = CR(F(y))

Proof. Fix any y FP(T ) N and any R L(T ). By definition of the connection cost (Definition 1), we have CR(F) = X

j R min i F (y) d T (i, j)

Let s fix a particular client that lies on some leaf j L(T ) of T . Let i = arg mini F (y) d T (i, j) be the leaf closest to j that F(y) places a facility into. Since T is an HST and distances increase by a factor of 2 as we move up the tree, it is not hard to see that i is the leaf in F(y) whose lowest common ancestor (lca) with j has the smallest level. Let l = lca(j, i ). Equivalently, l is the minimum-level vertex in P(j, r) such that yl 1. Since T is balanced, we have that the connection cost of client j under F(y) is precisely

C{j}(F(y)) = 2 d T (j, l ) = 2

lev(l ) 1 X

and since by integrality we have that yv = 0 for any v P(j, l ) \ {l } and yv 1 for all v P(l , r), we have

C{j}(F) = 2 X

v P (j,r) 2lev(v) max(0, 1 yv)

Summing over all clients j R we get

CR(F(y)) = X

v P (j,r) 2lev(v)+1 max(0, 1 yv) = f R(y)

which concludes the proof.

Claim 2. For any integral facility placements y, y FP(T ) N, it holds that ||y y ||T = γ MT (F(y), F(y ))

Proof. Fix any two integral facility placements y, y FP(T ) N. By definition of the moving cost (Definition 2), we have that

MT (F(y), F(y )) = min σ Σ

i F (y) d T (i, σ(i))

where Σ is the set of all possible matchings from the facilities in F(y) to the facilities in F(y ).

In general graphs, the minimum transportation cost can have a very complicated structure and typically requires solving a minimum transportation problem in order to compute it. However, in the special case of HSTs, we are actually able to obtain a very simple expression for this quantity.

Recall that in an HST T , edge weights increase by a factor of 2 every time we move up a level on the tree. Thus, it is always in out interest to move facilities between leaves whose lowest common ancestor is as low as possible. In other words, the matching σ that minimizes the transportation cost from F(y) to F(y ) can be obtained by selecting an arbitrary leaf in F(y), matching it to the leaf in F(y ) with which it shares the lowest lowest common ancestor and then repeating the process for the rest of the leaves.

Now fix any vertex v V (T ). Recall that yv is equal to the number of facilities in F(y) that are placed in the descendant leaves of v (respectively for y v). Thus, if we apply the above (optimal) transportation plan, the number of facilities that will end up traversing the edge from v to its parent vertex is going to be precisely |yv y v|. Since the weight of this edge is by definition 2lev(v), we get that MT (F(y), F(y )) = X

v V (T ) 2lev(v) |yv y v|

and since ||y y ||T = γ X

v V (T ) 2lev(v) |yv y v|

we have proven the claim.

C.2 Proof of Theorem 5

We will now formally present the proof of Theorem 5, bounding the expected total cost of Algorithm 1. Fix any sequence of clients R1, . . . , RT . Since the random seed α is selected uniformly at random (Step 3 of Algorithm 1), by Item 1 of Theorem 4 we get that E [CRt(Ft)] = f Rt(yt)

Moreover since the same random seed α is used at all rounds t 1, Item 2 of Theorem 4 implies that γ E [MT (Ft+1, Ft)] 4 ||yt+1 yt||T

t=1 CRt(Ft) + γ

t=2 MT (Ft, Ft 1)

t=1 f Rt(yt) +

t=2 ||yt yt 1||T

t=1 f Rt(y ) + β

where the last inequality follows by Theorem 3 for β = O k |L(T )|3/2 DT max(γ, 1) . The proof is concluded by the fact that

t=1 f Rt(y ) min |F |=k

t=1 CRt(F )

which is established in Claim 1 of Appendix C.1, stating that for any placement of k-facilities F L(T ) there exists a corresponding y FP(T ) whose fractional connection cost is equal to F s under any client request.

C.3 Proof of Theorem 1

We will now formally present the proof of Theorem 1, bounding the regret of Algorithm 2.

Let T be the HST that we randomly embed our graph G(V, E, w) into. Since V = L(T ), we slightly abuse notation and use u to refer both to some vertex of G and to the corresponding leaf of T . From Theorem 5, we know that the output of Algorithm 1 satisfies

t=1 CT Rt(Ft) + γ

t=2 MT (Ft, Ft 1)

6 min |F |=k

t=1 CT Rt(F )

+ O k |L(T )|3/2 DT max(1, γ)

where we use T in the connection and moving cost to indicate that all distances are measured on the HST. Here, the expectation is taken over the random choices of Algorithm 1.

Next, notice that both the connection cost and the moving cost are defined as sum of distances. Thus, the results of Theorem 2 about the distance distortion from G to T clearly apply for these quantities as well, namely

CG Rt(Ft) CT Rt(Ft) and E CT Rt(Ft) O(log |V |) CG Rt(Ft)

MG(Ft, Ft 1) MT (Ft, Ft 1) and E [MT (Ft, Ft 1)] O(log |V |) MG(Ft, Ft 1)

Thus, taking an expectation over the randomness of T , we finally get that

t=1 CG Rt(Ft) + γ

t=2 MG(Ft, Ft 1)

O(log |V |) min |F |=k

t=1 CG Rt(F )

+ O k |L(T )|3/2 DT max(1, γ)

Let n = |V | and D = diam(G). From the above, we get that Algorithm 2 is indeed α-regret for α = O(log n). Furthermore, we have that |L(T )| = |V | = n, and DT = 2 (2h(T ) 1) 4D since h(T ) log D . Thus, setting β = O(k n3/2 D max(1, γ)), we get that Algorithm 2 has β-additive regret, completing the proof of Theorem 1.

D Analysis of FTRL (Proofs of Section 4)

In this chapter of the appendix we present all the omitted proofs from Section 4 concerning our analysis of the Follow the Regularized Leader (FTRL) algorithm (Algorithm 3). To avoid repetition, from now on we fix an arbitrary HST T and use FP(T ) to denote the set of all fractional placements of k facilities on the leaves of T . We use n = |L(T )| to denote the number of leaves of T , h = h(T ) to denote its height and D = diam(T ) to denote its diameter. Since T is an HST, we know that its diameter D, i.e. the maximum distance between any two leaves, is precisely D = 2 (2h 1).

To ease notation, let wv = 2lev(v). For convenience, we remind the reader that our regularizer function RT : FP(T ) 7 R is defined as

v =r wv (yv + δv) ln yv + δv yp(v) + δp(v)

where δv = k |L(T ) T(v)|/|L(T )| is the percentage of leaves that lie on the sub-tree rooted at vertex v multiplied by k and p(v) is the parent of node v. Also, recall that for any y FP(T ) we have defined the norm ||y||T = γ X

v V (T ) wv|yv|

Roadmap. In Section D.1 we prove Lemma 1, namely the strong convexity of RT with respect to || ||T . Then, in Section D.2 we bound the moving cost of FTRL, proving Lemma 2. Next, in Section D.3 we bound the connection cost cost of FTRL, proving Lemma 3. Finally, in Section D.4 we account for approximation errors in the computation of the regularized leader, proving Lemma 4.

D.1 Strong Convexity (Proof of Lemma 1)

The objective of this section is to prove Lemma 1, specifically that for any fractional facility placements y, y FP(T ) it holds that

RT (y ) RT (y) + RT (y), y y + α||y y ||2 T where α = (8k Dγ2) 1.

We begin by computing the gradient of RT on any fractional facility placement y FP(T ).

Claim 3. The partial derivatives of RT on any point y FP(T ) are given by

2 for v = r wv ln yv+δv yp(v)+δp(v)

+ wv for v L(T )

wv ln yv+δv yp(v)+δp(v)

2 for v / L(T ) {r}

Proof. Clearly, RT is well-defined and differentiable on FP(T ). For any v = r, we compute the partial derivatives of RT (y) to obtain

yv = wv ln yv + δv yp(v) + δp(v)

v cld(u) wu yu + δu

Since y FP(T ), we know yv = P

u cld(v) yu and by definition, δv = P

u cld(v) δu. Finally, recall that wu = wv/2 for any u cld(v). By plugging everything in we get

yv = wv ln yv + δv yp(v) + δp(v)

2 1[v / L(T )]

for any v = r. For the root vertex, using similar arguments we get RT (y)

Now that we have calculated the gradient of RT , we can substitute it into the definition of strong convexity. Specifically, by Claim 3, Lemma 1 states that

v =r wv (y v + δv) ln

y v+δv y p(v)+δp(v)

yv+δv yp(v)+δp(v)

1 8k Dγ2 ||y y||2 T (1)

To ease the presentation, we define quantities

f(y , y) = X

v =r wv (y v + δv) ln

y v+δv y p(v)+δp(v)

yv+δv yp(v)+δp(v)

h(y , y) = X

v =r wv (yp(v) + δp(v)) | y v + δv y p(v) + δp(v) yv + δv yp(v) + δp(v) |

We will prove that f(y , y) (1/2k D) h2(y , y) and that h(y , y) (1/2γ) ||y y||T in Claims 4 and 5 respectively. Combining these claims, equation (1) clearly holds, completing the proof of Lemma 1.

Claim 4. For any y, y FP(T ), it holds that f(y , y) 1 2k D (h(y , y))2.

Proof. We begin by establishing some notation. For any v = r, let

µ v = wv (y p(v) + δp(v)) y v + δv y p(v) + δp(v) and µv = wv (y p(v) + δp(v)) yv + δv yp(v) + δp(v) .

Then, we have that

f(y , y) = X

v =r µ v ln µ v µv

v I µ v ln µ v µv

v I µ v ln µ v µv

where I = {v = r : µ v µv} and I = {v = r : µ v < µv}. By applying the log-sum inequality in both of these terms, we obtain

f(y , y) ( X

v I µ v) ln P

v I µ v) ln P

Next, observe that X

v =r µ v = X

v =r wv (y v + δv) = 2k (2h 1) = k D

v =r µv = X

v =r wv (y p(v) + δp(v)) yv + δv yp(v) + δp(v)

2 (y v + δv) X

yu + δu yv + δv

2 (y v + δv)

2 2k (2h+1 2)

Let B = P v I µ v and B = P v I µv. Then, we have shown that

f(y , y) B ln B

+ (k D B ) ln k D B

Our next step is to apply Pinsker s inequality to the above expression. Pinsker s inequality states that for any p, q (0, 1), it holds that

+ (1 p) ln 1 p

Since B k D and B k D, we can scale everything in inequality 2 and apply Pinsker s inequality to obtain

f(y , y) 2 k D (B B )2 (3)

To complete the proof, we substitute

v I (µ v µv)

v I wv (y p(v) + δp(v)) ( y v + δv y p(v) + δp(v) yv + δv yp(v) + δp(v) )

2 (y v + δv) X

u cld(v) I (y u + δu y v + δv yu + δu

2 (y v + δv) X

u cld(u) |y u + δu y v + δv yu + δu

where the last equality follows from the fact that the ratio in the inner sum always sum to 1, and thus by only summing over the ones with positive difference we get half of the total sum of absolute differences. By swapping the summation order once again, we get

v =r wv (y p(v) + δp(v)) | y v + δv y p(v) + δp(v) yv + δv yp(v) + δp(v) |

and from inequality (3) we finally get

f(y , y) 1 2k D (h(y , y))2

as desired.

Claim 5. For any y, y FP(T ), it holds that ||y y||T 2γ h(y , y).

Proof. To prove the claim, we first need to establish some extra notation. For any y FP(T ) and v = r, let

λv(y) := yv + δv yp(v) + δp(v)

Furthermore, for any vertex v and any integer i [0, h lev(v)], we use p(v, i) to denote the i-th ancestor of v on T , for example p(v, 0) = v, p(v, 1) = p(v) and p(v, h lev(v)) = r.

Recall that by definition, yr = δr = k. Thus, if we telescope these terms and let mv = h lev(v) 1, we clearly have that yv + δv = 2k Πmv i=0λp(v,i)(y)

which implies

y v yv = 2k Πmv i=0λp(v,i)(y ) 2k Πmv i=0λp(v,i)(y)

i=0 λp(v,0)(y ) . . . (λp(v,i)(y ) λp(v,i)(y)) . . . λp(v,mv)(y)

y v + δv y p(v,i) + δp(v,i) (λp(v,i)(y ) λp(v,i)(y)) yp(v,i+1) + δp(v,i+1)

= (y v + δv)

yp(v,i+1) + δp(v,i+1)

y p(v,i) + δp(v,i) (λp(v,i)(y ) λp(v,i)(y))

and from the triangular inequality

|y v yv| (y v + δv)

yp(v,i+1) + δp(v,i+1)

y p(v,i) + δp(v,i) |λp(v,i)(y ) λp(v,i)(y)| (4)

Plugging inequality (4) into the definition of norm || ||T , we get

||y y||T γ X

v =r wv (y v + δv)

yp(v,i+1) + δp(v,i+1)

y p(v,i) + δp(v,i) |λp(v,i)(y ) λp(v,i)(y)|

and by carefully exchanging the summation order, we obtain

||y y||T γ X

yp(v) + δp(v)

y v + δv |λv(y ) λv(y)|

u T (v) wu(y u + δu)

Finally, observe that P u T (v) wuy u 2wvy v. To see this, fix the sub-tree T(v) rooted at vertex v and recall that since y FP(T ), the total amount of facilities at each level is y v. Furthermore, the weights wv decrease by a factor of 2 at every level. Using the same arguments, we obtain P

u T (v) wuδu 2wvδv. Combining everything, we finally get

||y y||T 2γ X

v =r wv (yp(v) + δp(v)) |λv(y ) λv(y)|

or equivalently, ||y y||T 2γ h(y , y).

D.2 Bounding the Moving Cost (Proof of Lemma 2)

In this section we will upper bound the moving cost of FTRL by its connection cost. Fix any sequence of client requests R1, R2, . . . , RT L(T ). Recall that at each step t, FTRL selects a fractional facility placement yt given by

yt = arg min y FP(T ) Φt(y)

where Φt(y) = Pt 1 s=1 f Rs(y) + 1

η RT (y) is the objective that FTRL minimizes over at step t for

n T) 1. In this section, we prove Lemma 2, by arguing that

t=2 ||yt yt 1||T 1

t=1 f Rt(yt) + η

since the proof follows easily by the definitions of η and α.

From Lemma 1 we already know that RT is α-strongly convex with respect to || ||T for α = (8k Dγ2) 1. Furthermore, by definition the fractional connection cost

v P (j,r) 2lev(v)+1 max (0, 1 yv)

is clearly convex for any client request R L(T ). Thus, it is straight-forward to argue that at any step t, the FTRL objective Φt is α

η -strongly convex with respect to || ||T . Unfortunately, f R(y) is not differentiable on FP(T ), but its sub-gradients are well-defined on any y FP(T ). Thus, the strong convexity of Φt provides us with the following guarantee:

Claim 6. Fix any pair of fractional facility placements y, y FP(T ) and any time step t [T]. Let gt Φt(y) be any sub-gradient of Φt at y. Then, it holds that

Φt(y ) Φt(y) + gt, y y + α

η ||y y ||2 T

Furthermore, since by definition yt is the (unique) minimizer of Φt, the first order optimality conditions on Φt imply that there exists some g t Φt(yt) such that g t , y yt 0 for any y FP(T ). Claim 6 for y = yt, y = yt 1 and gt = g t gives us

Φt(yt 1) Φt(yt) + α

η ||yt yt 1||2 T

Thus, we have

||yt yt 1||2 T η

α Φt(yt 1) Φt(yt)

α Φt 1(yt 1) + f Rt 1(yt 1) Φt 1(yt) f Rt 1(yt)

α f Rt 1(yt 1) f Rt 1(yt)

where for the equality we used the fact that Φt(y) = Φt 1(y)+f Rt 1(y) and for the second inequality we used the fact that yt 1 is by definition the minimizer of Φt 1. Finally, since f R(y) 0 for any client request R L(T ), we have

||yt yt 1||T r η

α f Rt 1(yt 1)

2 f Rt 1(yt 1)

where the last inequality follows from the Arithmetic Mean - Geometric Mean inequality. Summing over all t completes the proof of Lemma 2.

D.3 Bounding the Connection Cost (Proof of Lemma 3)

In this section we will upper bound the connection cost of FTRL by the connection cost of the optimal fractional facility placement in hindsight. This is a standard analysis found in many textbooks, and we present it just for the sake of completeness.

Fix any sequence of client requests R1, R2, . . . , RT L(T ). Recall that at each step t, FTRL selects a fractional facility placement yt given by yt = arg min y FP(T ) Φt(y)

where Φt(y) = Pt 1 s=1 f Rs(y) + 1

η RT (y) is the objective that FTRL minimizes over at step t for

n T) 1. Let y be the optimal facility placement in hindsight, i.e.

y = arg min y FP(T )

t=1 f Rt(y)

In this section we prove Lemma 3,by arguing that

t=1 f Rt(yt)

t=1 f Rt(y ) + kn D

η + 32kn2Dη T

and then the proof follows easily by definition of η.

In the standard analysis of FTRL, the following quantities are of special interest as they appear in the final regret guarantees of the algorithm:

Let diam(RT ) := maxy,y FP(T ) |RT (y) RT (y )| be the diameter of the regularizer. Let Gf be an upper bound on the dual norm of the sub-gradient of the fractional connection cost for any client request, i.e. for any R L(T ) and any y FP(T ), there exists some sub-gradient g f R(y) such that ||g|| T Gf. Here, || || T denotes the dual norm of || ||T .

We begin by presenting the standard analysis of FTRL and deriving an expression for the regret guarantee that depends on the above quantities. Recall that at any step t, the FTRL objective Φt doesn t include f Rt since the client request Rt is not revealed to the algorithm at the time of decision. We begin by bounding the connection cost of a theoretical algorithm that has access to this information and thus at time t can pick facility placement yt+1.

Claim 7. The output of FTRL satisfies

t=1 f Rt(yt+1)

t=1 f Rt(y ) + diam(RT )

Proof. We have

Φt(yt) = Φt 1(yt) + f Rt 1(yt)

Φt 1(yt 1) + f Rt 1(yt)

where the equality holds by definition of Φt and the inequality holds from the optimality of yt 1 on Φt 1. Similarly, we obtain

Φt 1(yt 1) Φt 2(yt 2) + f Rt 2(yt 1)

If we keep applying this rule, we finally get that

s=1 f Rs(ys+1) + Φ1(y1)

Furthermore, we have Φ1(y1) = RT (y1)/η and Φt(y ) Φt(yt) for all t. Thus, we get

t=1 f Rt(yt+1) + 1

or equivalently (by substituting ΦT +1 s definition) we have

t=1 f Rt(yt+1)

t=1 f Rt(y ) + RT (y ) RT (y1)

The claim follows from the definition of diam(RT ).

Next, we proceed by bounding the increase in the connection cost that we suffer by choosing yt instead of yt+1 at time t.

Claim 8. For any t 0, it holds that f Rt(yt) f Rt(yt+1) + ηG2 f/α.

Proof. For any client request R L(T ), the fractional connection cost function f R(y) is clearly convex and its sub-gradients are well-defined on FP(T ). By definition of Gf, we know that there exists some sub-gradient g f Rt(yt) such that ||g|| T Gf. Using this sub-gradient, we get

f Rt(yt) f Rt(yt+1) + g, yt yt+1

f Rt(yt+1) + ||g|| T ||yt yt+1||T f Rt(yt+1) + Gf ||yt yt+1||T

where the first inequality is derived from the convexity of the fractional connection cost, the second inequality is an application of Holder s inequality and the third inequality is from Gf s definition.

As we have already argued in section D.2, we know that for any step t, the FTRL objective Φt is α/η-strongly convex with respect to || ||T . Using the definition of strong convexity, this implies that

Φt+1(yt) Φt+1(yt+1) + g, yt yt+1 + α

η ||yt yt+1||2 T

for any sub-gradient g Φt+1(yt+1). Furthermore, since yt+1 is the minimizer of Φt+1, we know from the first order optimality conditions that we can select g Φt+1(yt+1) such that g, y yt+1 0 for any y FP(T ). Using such a sub-gradient, we get

||yt yt+1||2 T η

α Φt+1(yt) Φt+1(yt+1)

α Φt(yt) + f Rt(yt) Φt(yt+1) f Rt(yt+1)

α f Rt(yt) f Rt(yt+1)

where we just expanded Φt+1 s definition and used the fact that yt is the minimizer of Φt.

Combining everything, we finally obtain

f Rt(yt) f Rt(yt+1) Gf r η

α (f Rt(yt) f Rt(yt+1))

and the claim follows.

We complete the analysis of FTRL by combining Claims 7 and 8 in order to obtain the following regret guarantee:

Claim 9. The output of FTRL satisfies

t=1 f Rt(yt)

t=1 f Rt(y ) + diam(RT )

η + ηG2 f α T

It remains to substitute the specific values of the parameters that appear in the regret guarantee. We have already proven in section D.1 that RT is α-strongly convex with respect to || ||T for α = (8k Dγ2) 1. Next, we provide an upper bound for the diameter of the regularizer.

Claim 10. It holds that diam(RT ) kn D.

Proof. Fix any y FP(T ). Be definition, we know that yv yp(v) and δv δp(v) for any v = r. Thus, the expressions inside the logarithms of the regularizer are always at most 1, which implies that RT (y) 0. Furthermore, for any α, β > 0 it holds that α β α ln (α/β). Using this inequality, we get that

v =r wv (yv + δv yp(v) δp(v))

Fix any level l [0, h 1] and let Vl = {v V (T ) : lev(v) = l} denote the set of vertices of the HST at level l. Since y FP(T ), we know that P v Vl yv = k, and by definition of δ s we know that P

v Vl δv = k as well. Furthermore, we know that P

v Vl yp(v) n P

v Vl+1 yv = n k since any vertex v can have at most n (i.e. the total number of leaves) children. Using the same argument,

we have P v Vl δp(v) n P

v Vl+1 yv = n k. Thus, combining everything we obtain

v =r wv (yv + δv yp(v) δp(v))

v Vl 2l (yv + δv yp(v) δp(v))

l=0 2l (2k 2kn)

= 2k(1 n)(2h 1) = k(1 n)D.

which proves our claim.

Finally, we only need to find an upper bound for Gf. We begin by computing a set of sub-gradients for the fractional connection cost function.

Claim 11. Fix any client request R L(T ) and any y FP(T ). Define the vector g R,y R|V (T )| such that

g R,y v = 0 if yv 1 2lev(v)+1 |T(v) R| if yv < 1

Then, g R,y f R(y), i.e. g R,y is a sub-gradient of f R on point y.

Proof. Fix any client request R L(T ). By definition of the fractional connection cost on facility placement y FP(T ), we have

v P (j,r) 2lev(v)+1 max (0, 1 yv)

where P(j, r) denotes the unique path from leaf j L(T ) to the root r. This is clearly a convex function on FP(T ) and thus the sub-gradients of f R are well-defined. Fix any v V (T ). We distinguish between two cases.

If yv < 1, then the partial derivative of f R(y) is well-defined and given by

yv = 2lev(v)+1 |T(v) R|

where T(v) is the set of vertices on the sub-tree rooted at vertex v.

If yv 1, then clearly it doesn t contribute to f R(y). Using standard calculus, it is not hard to argue that in this case there exists a sub-gradient of f R(y) whose coordinate corresponding to v is 0. Thus, we have argued that that g R,y is a valid sub-gradient of f R on point y.

Finally, we provide an upper bound on the dual-norm of the sub-gradients that we computed on Claim 11.

Claim 12. For any y FP(T ) and any R L(T ), it holds that ||g R,v|| T 2n

Proof. Recall that we have defined the moving cost norm as

||y||T = γ X

v V (T ) wv yv

which is basically a weighted l1-norm with weights γ wv. It is well-known that the dual of the l1-norm is the l norm. Similarly, the dual of the weighted l1-norm is a weighted l norm with inverse weights, i.e. || || = l ((γw) 1). Thus, we have

||x|| T = max v |xv| γ wv

Using the calculation of the sub-gradients from Claim 11 and that R L(T ) and thus |R| n, we immediately get the claim.

Claim 10 provides us with an expression for diam(RT ) and Claim 12 provides us with an expression for Gf. Plugging everything in into Claim 9, we complete the proof of Lemma 3.

D.4 Incorporating approximation errors (Proof of Lemma 4)

Fix any sequence of client requests R1, R2, . . . , RT L(T ). Recall that at each step t, FTRL selects a fractional facility placement yt given by

yt = arg min y FP(T ) Φt(y)

where Φt(y) = Pt 1 s=1 f Rs(y) + 1

η RT (y) is the objective that FTRL minimizes over at step t for

Now, assume that instead of minimizing Φt(y) over FP(T ) to compute yt, we are only able to compute a fractional facility placement zt FP(T ) such that Φt(zt) Φt(yt) + ϵ for some ϵ > 0.

Claim 13. For any step t, it holds that

||zt yt||T r

Proof. As we have already argued in section D.2, we know that for any step t, the FTRL objective Φt is α/η-strongly convex with respect to || ||T . Combining this with the first order optimality condition for Φt on yt, we get

Φt(zt) Φt(yt) + α

η ||zt yt||2 T

which implies that

||zt yt||T r

Using Claim 13, we can easily bound both the connection and the moving cost of the approximated FTRL solutions.

For the connection cost, recall that the fractional connection cost function f Rt at step t is convex, which implies that

f Rt(zt) f Rt(yt) + g, zt yt

for some g f Rt(zt). Using Holder s inequality to upper bound the inner-product and using the upper bound of Claim 12 for the dual norm of the sub-gradients of f Rt, we get that

f Rt(zt) f Rt(yt) + 2n

γ ||zt yt||T

and finally from Claim 13 we get that

f Rt(zt) f Rt(yt) + 2n

For the moving cost, recall it suffices to use the triangular inequality that || ||T (as a norm) satisfies:

||zt zt 1||T ||zt yt||T + ||yt yt 1||T + ||yt 1 zt 1||T

||yt yt 1||T + 2 r

The proof of Lemma 4 follows easily by plugging in η = (γ

n T) 1, α = (8k Dγ2) 1 and ϵ = O(1/

D.5 Implementation of Projected Mirror Descent

We conclude this section by considering the Projected Mirror Descent update step, namely

y = arg min y FP(T ) [η c, y + DRT (y , y)]

that takes as input a fractional facility placement y FP(T ) and returns some other y FP(T ) that minimizes a linear cost under vector c plus the Bregman Divergence between the initial and the new point under regularizer RT . Here, η > 0 is a tuning parameter that balances the dynamics between the linear cost and the Bregman Divergence.

By letting c be the sub-gradient of the fractional connection cost over the observed sequence of clients, we can use this update step in order to approximate the FTRL objective; this is, in fact, the implementation we did for our experimental evaluation of Algorithm 2. In this section we will argue that the special structure of RT allows us to compute the update step in linear (to the size of the HST) time.

By definition of the Bregman Divergence, we have

DRT (x, y) = RT (x) RT (y) RT (y), x y

Substituting everything, we get that the update step of Projected Mirror Descent can be written as

y = arg min y FP(T ) F(y )

F(y ) = η X

v cv y v + X

v =r wv (y v + δv) ln

y v + δv y p(v) + δp(v)

v =r wv (yv + δv) ln yv + δv yp(v) + δp(v)

wv ln yv + δv yp(v) + δp(v)

2 1[v L(T )] (y v yv)

It is always the case that we update y from some y FP(T ), so we can simplify the above expression to get

F(y ) = η X

v cv y v + X

v =r wv (y v + δv) ln

y v+δv y p(v)+δp(v)

yv+δv yp(v)+δp(v)

2 (1 + 1[v L(T )]) (y v yv)

Recall that by definition, FP(T) is the polytope

y R|V (T )| : yv = P

u cld(v) yu v / L(T ) yv [0, 1] v L(T ) yr = k

Since our objective is to minimize function F( ) over FP(T ), we can write down the KKT optimality conditions to obtain the following conditions about the minimizer y :

y v + δv y p(v) + δp(v) = yv + δv yp(v) + δp(v) exp 1

wv (µp(v) µv ηcv)

where µv is the Lagrange multiplier for constraint yv = P

u cld(v) yu and µv = 0 for v L(T ). To complete our computation of y , it remains to compute the Lagrange multipliers µ.

Since y FP(T ), it is not hard to verify that for any v / L(T ) it holds X

y u + δu y v + δv = 1

and using the KKT optimality condition, this implies that for any v / L(T )

yu + δu yv + δv exp 1

wu (µv µu ηcu) = 1

or equivalently, since wv = 2wu for all u cld(v),

yu + δu yv + δv exp µu + ηcu

Thus, starting from µv = 0 on the leaves, this expression provides as a bottom-up algorithm to compute all the Lagrange multipliers µ. Using these multipliers and the KKT optimality conditions, we can then easily compute the ratios

y v + δv y p(v) + δp(v) = yv + δv yp(v) + δp(v) exp 1

wv (µp(v) µv ηcv)

for all vertices v = r. Finally, we can start from the root vertex r, for which we know that y r = k, and cascade these ratios downwards until we reach the leaves and we have compute all entries of y . Clearly, this is all done in linear time to the number of vertices.

Intuitively, this update step can be interpreted as an application of the Multiplicative Weights Update algorithm on every parent vertex v that decides how its mass should be split to its children. We repeat this process in a bottom-up manner, and then we simply start with k facilities on the root and begin splitting them based on these ratios while moving downwards.

E Analysis of Cut&Round (Proofs of Section 5)

In this chapter of the appendix we present all the omitted proofs from Section 5 concerning our online rounding scheme Cut&Round. To avoid repetition, from now on we fix an arbitrary HST T and use FP(T ) to denote the set of all fractional placements of k facilities on the leaves of T .

Roadmap. In section E.1, we argue about the correctness of Cut&Round; namely, we show that no matter the input, Cut&Round always returns a set of k-leaves of T where the facilities are placed. Then, in section E.2 we establish the main property of Cut&Round and prove Lemma 5. Finally, in section E.3 we analyze the expected connection cost of Cut&Round s output and prove Item 1 of Theorem 4 (Lemma 6) while in section E.4 we analyze the expected moving cost of Cut&Round s output and prove Item 2 of Theorem 4 (Lemma 7).

E.1 Correctness of Cut&Round

We begin by proving the correctness of Cut&Round. Fix any y FP(T ) and any set of thresholds α [0, 1]|V (T )|. Let F = Cut&Round(T , y, α). In this section, we will prove that |F| = k, i.e. we will argue that Cut&Round always returns a set of k leaves at which facilities must be placed, as it is expected to. In order to show this, we will need to analyze the Yv variables produced by Cut&Round.

Claim 14. For any leaf v L(T ), it holds that Yv {0, 1}.

Proof. Observe that for any v V (T ), sub-routine Alloc sets Yv to either yv or yv + 1. By definition of FP(T ), we have yv [0, 1] for each leaf v L(T ). We distinguish between two different cases. If yv [0, 1), then clearly Yv {0, 1}. If yv = 1, then δ(yv) = 0 and thus Alloc will always set Yv = yv = 1. Thus, the claim holds for all leaves v L(T ).

Claim 15. Let v / L(T ) be any non-leaf vertex. Then, Yv = P

u cld(v) Yu.

Proof. Fix any non-leaf vertex v / L(T ). We will analyze the inner loop of Cut&Round that iterates over v s children. Initially, Cut&Round sets Yrem = Yv and yrem = yv. Then, we proceed to itteratively call Alloc, once per child vertex of v. Each time Alloc assigns some value Yu to a child vertex u cld(v), we update Yrem to Yrem Yu; thus, to prove our claim it suffices to argue that after we update the last child vertex, we have Yrem = 0.

Since by definition of sub-routine Alloc we know that Yv { yv , yv + 1}, we know that initially (before any child vertex is assigned a value Yu) it holds that Yrem { yrem , yrem + 1}. In fact, a simple case analysis over the decision tree of sub-routine Alloc suffices to see that this invariant holds not only at the beginning, but even after we begin assigning values to the child vertices and update Yrem and yrem.

Since y FP(T ), we know that yv = P

u cld(v) yu and thus yrem = yu at the time we iterate over the last child vertex u cld(v). Furthermore, from the above discussion we know that Yrem { yu , yu + 1}. Since δ(yu) = δ(yrem), it is easy to verify that in any case Alloc sets Yu = Yrem and thus after the last update we have Yrem = 0, as desired.

Proof of Correctness. Recall that by definition, the output of Cut&Round is F = {v L(T ) : Yv = 1}. Since from Claim 14 we know that Yv {0, 1} for all v L(T ), this implies that |F| = P

v L(T ) Yv. We apply Claim 15 to the root vertex r, then again to each u cld(r) and so on until we reach the leaves. This gives us that Yr = P

v L(T ) Yv and thus |F| = Yr. Since by definition Cut&Round sets Yr = k, we have proven that |F| = k as desired.

E.2 Proof of Lemma 5 (Computing the Allocation Probabilities)

In this section, we formally prove the main property of algorithm Cut&Round, as stated in Lemma 5. Fix any fractional facility placement y FP(T ) and let αv Unif(0, 1) be independent uniformly random thresholds for all v V (T ). Let F = Cut&Round(T , y, α) be the output of algorithm Cut&Round on this set of inputs. Recall that algorithm Cut&Round sets the variables Yv during its execution, for all v V (T ). As we have already discussed, Yv is the total number of facilities in F on the leaves of the sub-tree rooted at v, i.e. Yv = |T(v) F|. We will prove that for any v V (T ), we have

Yv = yv with probability 1 δ(yv) yv + 1 with probability δ(yv)

We begin by writing down the following property for sub-routine Alloc:

Claim 16. Fix any fractional facility placement y FP(T ) and let αv Unif(0, 1) for all v V (T ). For any vertex u V (T ) of T , let Yu = Alloc(yu, yrem, Yrem, αu) be the number of facilities assigned to the sub-tree of u by Line 8 of Algorithm Cut&Round (Algorithm 4). Then,

Pα [Yu = yu ] =

1 if Yrem = yrem and δ(yu) δ(yrem) 1 δ(yu) 1 δ(yrem) if Yrem = yrem and δ(yu) > δ(yrem) 0 if Yrem = yrem and δ(yu) > δ(yrem) δ(yrem) δ(yu)

δ(yrem) if Yrem = yrem and δ(yu) δ(yrem)

Proof. This claim is a direct consequence of sub-routine Alloc s description (Algorithm 5) and the fact that αv Unif(0, 1) for all v V (T ).

Using this claim, we are now ready to prove Lemma 5.

Proof of Lemma 5. We prove the lemma via a top-down induction on the vertices of T (decreasing level order). For the root vertex, we know that since y FP(T ) we have yr = k and also by definition of Cut&Round we have Yr = k. Thus, we get that Yr = yr = yr with probability 1 δ(yr) = 1 and the claim holds. Now, fix any non-leaf vertex v / L(T ) and assume that Yv = yv with probability 1 δ(yv) and Yv = yv + 1 with probability δ(yv). To complete our induction, we will now proceed to prove the claim for all the children vertices of v.

We begin by proving the claim for the first child of vertex v, and then we will show how the same arguments extend for all its children. Let u cld(v) be the first child vertex of v that Cut&Round iterates over. Then, by definition of Cut&Round we have that Yrem = Yv and yrem = yv. Using the inductive hypothesis on v, this implies that Yrem = yrem with probability 1 δ(yrem) and Yrem = yrem + 1 with probability δ(yrem). Conditioning on the value of Yrem, we get

Pα [Yu = yu ] = Pα [Yu = yu | Yrem = yrem ] (1 δ(yrem)) + Pα [Yu = yu | Yrem = yrem + 1] δ(yrem)

We distinguish between two different cases based on whether δ(yu) δ(yrem) or δ(yu) > δ(yrem). In any case, we can use Claim 16 to substitute the conditional probabilities on the above expression and easily get that Pα [Yu = yu ] = 1 δ(yu)

Thus, we have already proven the claim for the first child of v. However, to complete our induction, we need to prove the claim for all children of v and not just the first one. The only property that we used and holds specifically for the first child was that Yrem = yrem with probability 1 δ(yrem) and Yrem = yrem + 1 with probability δ(yrem). Let Y rem and y rem be the updated remaining facilities after the value Yu of the first child has been assigned. If we can prove that Y rem = y rem with probability 1 δ(y rem) and Y rem = y rem + 1 with probability δ(y rem), then we can keep applying the same argument and inductively prove the claim for all the children of v.

By definition, we have that Y rem = Yrem Yu and y rem = yrem yu. Once again, we distinguish between two different cases.

Let δ(yu) δ(yrem). In that case, we get that y rem = yrem yu and also that δ(y rem) = δ(yrem) δ(yu). Since we know that Yrem { yrem , yrem + 1} and Yu { yu , yu + 1}, this implies that

Pα[Y rem = y rem ] = Pα[Yrem = yrem Yu = yu ] + Pα[Yrem = yrem + 1 Yu = yu + 1]

Using conditional probabilities and the inductive hypothesis on the distribution of Yrem, we obtain Pα [Y rem = y rem ] = Pα [Yu = yu | Yrem = yrem ] (1 δ(yrem)) + Pα [Yu = yu + 1 | Yrem = yrem + 1] δ(yrem) Using Claim 16 to substitute the conditional probabilities, we finally get Pα [Y rem = y rem ] = 1 δ(yrem) + δ(yu) = 1 δ(y rem) as desired. Let δ(yu) > δ(yrem). In that case, we get that y rem = yrem yu 1 and also that δ(y rem) = 1 + δ(yrem) δ(yu). Since we know that Yrem { yrem , yrem + 1} and Yu { yu , yu + 1}, this implies that Pα [Y rem = y rem ] = Pα [Yrem = yrem Yu = yu + 1]

Using conditional probabilities and the inductive hypothesis on the distribution of Yrem, we obtain Pα [Y rem = y rem ] = Pα [Yu = yu + 1 | Yrem = yrem ] (1 δ(yrem))

Using Claim 16 to substitute the conditional probabilities, we finally get Pα [Y rem = y rem ] = δ(yu) δ(yrem) = 1 δ(y rem) as desired.

Thus, we have concluded the proof of Lemma 5.

E.3 Proof of Item 1 in Theorem 4 (Bounding the Expected Connection Cost)

Lemma 6. Let F = Cut&Round(y, α) where for all v V (T ), αv Unif(0, 1) independently. Then, Eα[CR(F)] = f R(y) for any R L(T )

Proof. Fix any y FP(T ) and let α [0, 1]|V (T )| be a set of thresholds such that for each v V (T ), αv is drawn independently at random from the uniform distribution, i.e. αv Unif(0, 1). Let F = Cut&Round(T , y, α). We will prove that for any set of clients R L(T ), it holds that Eα[CR(F)] = f R(y).

Recall that the Yv variables set by Cut&Round denote the total number of facilities in F that are placed on the sub-tree rooted at vertex v, i.e. Yv = |F T(v)|. As argued in section E.1, we know that Y FP(T ) N, i.e. Y is a valid integral facility placement. Thus, from Claim 1 of section C.1, we know that CR(F) = f R(Y ). This implies that by definition of the fractional connection cost under client request R, we have that

v P (j,r) 2lev(v)+1 max (0, 1 Yv)

Thus, we get

Eα[CR(F)] = X

v P (j,r) 2lev(v)+1 Eα[max (0, 1 Yv)]

v P (j,r) 2lev(v)+1 Pα[Yv = 0]

where the first equality holds by linearity of expectation, and the second equality holds by the fact that Yv N for all v V (T ). Since Yv { yv , yv + 1}, we know that for any v V (T ), Yv can be 0 only if yv [0, 1). Furthermore, from Lemma 5, we know that in the case of uniformly random thresholds, this happens with probability precisely 1 yv. Combining these facts, we get Pα[Yv = 0] = max(0, 1 yv) and thus

Eα[CR(F)] = X

v P (j,r) 2lev(v)+1 max(0, 1 yv)

= f R(y) concluding the proof of Lemma 6.

E.4 Proof of Item 2 in Theorem 4 (Bounding the Expected Moving Cost)

Lemma 7. Let F = Round&Cut(y, α) and also let F = Round&Cut(T , y , α) where αv Unif(0, 1) for all v V (T ). Then, γ Eα [MT (F, F )] 4 ||y y ||T

Proof. Fix any pair of fractional facility placements y, y FP(T ) and let corresponding outputs of Cut&Round be denoted as F = Cut&Round(T , y, α) and F = Cut&Round(T , y , α). Observe that the same set of (uniformly random) thresholds αv is used in both cases, as this will play a crucial part in our analysis. To prove Lemma 7, we need to prove that γ Eα [MT (F, F )] 4 ||y y ||T where the expectation is taken over the value of the uniformly random thresholds αv.

The proof of Lemma 7 is technically involved, and thus we will break down our approach into smaller sections to ease the presentation. We begin by proving the Lemma in the special case where the transition from y to y has a very simple structure, which we now proceed to define: Definition 10. We say that two fractional facility placements y, y FP(T ) are ϵ-neighboring if there are two leaves s, t L(T ) with least common ancestor p V (T ) such that the following hold:

1. y v = yv ϵ for all v P(s, p) \ {p}.

2. y v = yv + ϵ for all v P(t, p) \ {p}.

3. y v = yv for all other v V (T ).

Furthermore, we say that y, y are strictly ϵ-neighboring if ϵ is sufficiently small to satisfy

1. ϵ δ(yv) for all v P(s, p) \ {p} with δ(yv) > 0.

2. ϵ 1 δ(yv) for all v P(t, p) \ {p} with δ(yv) > 0.

Basically, if y and y are ϵ-neighboring then y is obtained by pushing ϵ-mass on y from s to t along the unique path that connects these two leaves. Furthermore, if ϵ is sufficiently small so that for any v V (T ) either yv = y v or |yv y v| 1 and at least one of the two is integral, then we say that the two fractional facility placements are strictly ϵ-neighboring. As we will shortly argue, Lemma 7 holds in the special case where y, y are strictly ϵ-neighboring.

Claim 17. If y, y FP(T ) are strictly ϵ-neighboring for some ϵ 0, then γ Eα [MT (F, F )] 4 ||y y ||T .

Before proving Claim 17, let us first show why it suffices to argue about the general case and prove Lemma 7. Let y, y FP(T ) be any two fractional placements. Recall that ||y y ||T captures precisely the minimum transportation cost from y to y on T . If we break down this transportation plan into small movements of masses between leaves, then we can view it as a sequence of transitions between strictly ϵ-neighboring placements. This is formalized in the following claim:

Claim 18. For any y, y FP(T ), there exists a finite sequence y0, y1, . . . , ym FP(T ) of fractional facility placements with y = y0 and y = ym such that

1. yj, yj+1 are strictly ϵ-neighboring for some ϵ 0 for j = 0, 1, . . . , m 1.

2. ||y y ||T = Pm j=1 ||yj yj 1||T .

We will now prove Lemma 7. Let Fj = Cut&Round(T , yj, α) be the corresponding output of Cut&Round on yj using the same (uniformly random) thresholds αv. Then,

γ Eα [MT (F, F )] γ Eα

j=0 MT (Fj, Fj+1)

j=0 Eα [MT (Fj, Fj+1)]

j=0 ||yj yj+1||T

= 4 ||y y ||T

In the above calculation, the first inequality holds from the fact that the minimum transportation cost satisfies the triangular inequality. The first equality holds from linearity of expectation. The second inequality holds from Claim 17 and the second equality holds from Claim 18.

Thus, we have shown that proving Lemma 7 for the special case of strictly ϵ-neighboring fractional facility placements y, y suffices to prove Lemma 7 for the general case of any y, y FP(T ) and conclude this section. The rest of this section is dedicated to proving Claim 17, which is the main technical challenge towards proving Lemma 7.

Proof of Claim 17. Fix any pair of strictly ϵ-neighboring fractional facility placements y, y FP(T ) and let the corresponding outputs of Cut&Round be F = Cut&Round(T , y, α) and F = Cut&Round(T , y , α). In section E.1 we have already shown that F, F L(T ) are valid facility placements since |F| = |F | = k. Let Y, Y FP(T ) be used to denote the corresponding integral placements, i.e.

Yv := |L(T ) F| = number of facilities in F placed on the leaves of the sub-tree rooted at v

Y v := |L(T ) F | = number of facilities in F placed on the leaves of the sub-tree rooted at v

Recall that Y and Y are precisely the values of the Y -variables that algorithm Cut&Round sets. As shown in Claim 2 of Section E.4, we know that γ MT (F, F ) = ||Y Y ||T . Thus, in order to prove Claim 17, we need to show that

Eα [||Y Y ||T ] 4 ||y y ||T

Since y, y are strictly ϵ-neighboring fractional facility placements, we know that there exist two leaves s, t L(T ) with lowest common ancestor p V (T ) such that |yv y v| is ϵ among vertices on the (unique) path from s to t (excluding vertex p) and is 0 otherwise. Let L = lev(p). Then, by definition of || ||T we have

||y y ||T = X

v V (T ) 2lev(v) |yv y v| = 2ϵ

l=0 2l = 2ϵ (2L 1) (5)

Furthermore, recall that from Lemma 5, Cut&Round rounds yv to either Yv = yv + 1 with probability δ(yv) or to yv with probability 1 δ(yv). Since ϵ is sufficiently small so that either yv = y v or |yv y v| 1 and at lowest one of the two is integral (and it is thus always rounded to itself), we get that |Yv Y v| 1 for all v V (T ). This implies that

Eα [||Y Y ||T ] = Eα

v V (T ) 2lev(v) |Yv Y v|

v V (T ) 2lev(v) Eα [|Yv Y v|]

v V (T ) 2lev(v) Pα [|Yv Y v| = 1]

Let l [0, h(T )] be any level on the HST T and let Cl be used to denote the expected number of vertices at level l that are rounded to different values, i.e.

Cl := Eα [|{v V (T ) : lev(v) = l and Yv = Y v}|]

Then, the above imply that

Eα [||Y Y ||T ] =

l=0 2l Cl (6)

It remains to compute Cl for all l [0, h(T )]. This is done in Claim 19, where we prove that Cl = 0 for l L (the level of s and t s lowest common ancestor) and Cl 4ϵ (L l) otherwise. Combining this claim with equations (5) and (6) immediately implies that

Eα [||Y Y ||T ] 4 ||y y ||T which completes the proof of Claim 17.

Claim 19. For any l L, Cl = 0. For any l < L, Cl 4ϵ (L l).

Proof. Recall that for fixed thresholds αv, the output of Cut&Round is deterministic. Since L is the level of vertex p (the lowest common ancestor of leaves s, t) and by definition of strictly ϵ-neighboring placements y, y we know yv = y v for any vertex v such that lev(v) L, we immediately get that Cl = 0 for any l L.

We will now proceed to analyze Cl for any l < L. We partition the set of vertices v V (T ) with lev(v) = l into three sets:

A vertex v is called active if it lies on the (unique) path between leaves s and t.

A vertex v is called inactive if it is not a descendant of p (the lowest common ancestor of leaves s and t).

A vertex v is called affected if it is not active and is a descendant of p.

Obviously, each vertex v with lev(v) = l must lie in exactly one of these sets.

Inactive Vertices. We will prove that for every inactive vertex v, Pα[Yv = Y v] = 0. Since the same set of thresholds α is used to round both y and y , the output of Cut&Round is deterministic. Furthermore, if a vertex v is inactive, then we know that yv = y v and also yu = y u for any ancestor vertex of u of v (by Definition 10 of neighboring facility placements). Thus, this immediately implies that Yv = Y v with probability 1 and thus we do not need to account for inactive vertices when computing Cl.

Active Vertices. We will prove that for every active vertex v, Pα[Yv = Y v] = ϵ. Recall that any active vertex is either an ancestor of leaf s or leaf t. We will only prove the claim in the case when v is an ancestor of t; the other case is completely analogous. A formal proof by induction is given in Claim 20, presented at the end of this section. As a direct corollary, since there are only two active vertices per level, the expected number of active vertices in level l that are rounded two different values is precisely 2ϵ.

Affected Vertices. Finally, we will now analyze the affected vertices. By definition, we know that each affected vertex v will have a unique active ancestor (also counting p). We partition the set of affected vertices on level l into 2(L l 1) + 1 groups, based on their corresponding active ancestor. The main argument we need to establish is that by definition of Round&Cut, at most one vertex in each of these groups can be rounded to a different value.

To see this, observe that Round&Cut is monotone, in the sense that if y v yv and also y u yu for all ancestors u of v, then (assuming the same set of thresholds is used), we know that Y v Yv. Using this fact on the vertices of a group, since all of them can either only increase or decrease, in order to maintain balance at most one of them can change, otherwise we would get a change of 2 or more on the parent node which cannot happen.

Furthermore, for a specific group, if both the common active ancestor and its child u with yu = y u end up rounded to the same value, we get (from the fact that the same thresholds are used) that all the vertices in the group will be rounded to the same value. Thus, in order for a (unique) vertex in any group to change, at least one of two active vertices must change, which happens with probability at most 2ϵ. Since there are 2(L l 1) + 1 groups, we get as a corollary that the expected number of affected vertices at level l that get rounded to a different value is at most 2ϵ (2L 2l 1).

Combining everything, we get that Cl 0 + 2ϵ + 2ϵ (2L 2l 1) = 4ϵ (L l).

Claim 20. Let v be any active vertex that is an ancestor of t. Then, Pα[Yv = Y v] = ϵ.

Proof. In fact, we will in fact prove the following stronger claim,

Pα[Yv = yv and Y v = yv ] = 1 δ(yv) ϵ.

Pα[Yv = yv and Y v = yv + 1] = ϵ.

Pα[Yv = yv + 1 and Y v = yv ] = 0

Pα[Yv = yv + 1 and Y v = yv + 1] = δ(yv)

which clearly implies Claim 20.

Once again, we will prove the claim via induction, starting from the highest active ancestor of t at level l = L 1 and moving towards the leaf t at level l = 0. We begin by mentioning that for vertex p (s and t s lowest common ancestor at level L) we know for sure that Yp = Y p since yp = y p and yu = y u for any u such that lev(u) L; thus, since the same set of thresholds α is used, the execution of Cut&Round will be identical up to this point.

We assume that the first child of any vertex v visited by Alloc is always the active child; this can be done without loss of generality as the order that Alloc visits the vertices hasn t played any part on our analysis yet.

Base of the induction. For the base of the induction, let v be the (unique) child of p that is an ancestor of t; i.e. let v be the highest active ancestor of t. We have already mentioned that Yp = Y p with probability 1. Thus, it can either be the case that Yp = Y p = yp of Yp = Y p = yp +1. From Lemma 5 we know that the first happens with probability 1 δ(yp) and the latter with probability δ(yp). We distinguish between the following cases:

Let δ(yv) < δ(yp). Then, if Yp = Y p = yp we know from the description of Alloc that Yv = Y v = yv with probability 1. On the other hand, if Yp = Y p = yp +1, we know that Yv = yv +1 if αv δ(yv)/δ(yp) and likewise Y v = yv +1 if αv (δ(yv)+ϵ)/(δ(yp)). Thus, by conditioning on the values of Yp and Y p, we get

1. Pα[Yv = yv and Y v = yv ] = (1 δ(yp)) 1+δ(yp) (1 δ(yv)+ϵ

δ(yp) ) = 1 δ(yv) ϵ.

2. Pα[Yv = yv and Y v = yv + 1] = (1 δ(yp)) 0 + δ(yp) ( δ(yv)+ϵ

δ(yp) δ(yv)

δ(yp)) = ϵ.

3. Pα[Yv = yv + 1 and Y v = yv ] = (1 δ(yp)) 0 + δ(yp) 0 = 0 .

4. Pα[Yv = yv + 1 and Y v = yv + 1] = (1 δ(yp)) 0 + δ(yp) δ(yv)

δ(yp) = δ(yv).

Let δ(yv) δ(yp). Then, if Yp = Y p = yp + 1 we know from the description of Alloc that Yv = Y v = yv + 1 with probability 1. On the other hand, if Yp = Y p = yp , we know that Yv = yv + 1 if αv (δ(yv) δ(yp))/(1 δ(yp)) and likewise Y v = yv + 1 if αv (δ(yv) + ϵ δ(yp))/(1 δ(yp)). Thus, by conditioning on the values of Yp and Y p, we get

1. Pα[Yv = yv and Y v = yv ] = (1 δ(yp)) (1 δ(yv)+ϵ δ(yp)

1 δ(yp) ) + δ(yp) 0 = 1 δ(yv) ϵ.

2. Pα[Yv = yv and Y v = yv ] = (1 δ(yp)) ( δ(yv)+ϵ δ(yp)

1 δ(yp) δ(yv) δ(yp)

1 δ(yp) )+δ(yp) 0 = ϵ. 3. Pα[Yv = yv + 1 and Y v = yv ] = (1 δ(yp)) 0 + δ(yp) 0 = 1 δ(yv) = 0.

4. Pα[Yv = yv + 1 and Y v = yv ] = (1 δ(yp)) δ(yv) δ(yp)

1 δ(yp) + δ(yp) 1δ(yv).

So in both cases, the base of the induction holds.

Inductive Step. Using the exact same approach, we can prove the claim for any active ancestor u of t, assuming that the claim holds for u s father v = p(u). The only difference, is that now we can t claim that Yv = Y v with probability 1. Instead, there are three different cases that we need to consider; namely

1. Yv = Y v = yv with probability 1 ϵ δ(yv).

2. Yv = Y v = yv + 1 with probability δ(yv).

3. Yv = yv and Y v = yv + 1 with probability ϵ.

where the probabilities hold from the inductive hypothesis on the parent vertex v. Next, we will need to once again consider the case of whether δ(yu) < δ(yv) or not (notice that the same relation will hold for y u and y v) and use the description of Alloc to get the assignment probabilities. Since this is a simple matter of arithmetic, the details are omitted.

F Experimental Evaluation

In this section we experimentally evaluate the performance of Algorithm 2 with respect to the best fixed facility placement and compare it with the respective performance of the algorithm proposed by [31]. In all the following experiments the step-size of Algorithm 3 (subroutine of Algorithm 2) is set to η := max(γ, 1)

Periodically Moving Clients. We first present a simple setting to indicate the inefficiency of the online learning algorithm of [31] in handling moving costs. In this experiment the underlying graph is the 0.01-discretization of [0, 1] [0, 1]. At each round t 1, we periodically select one of four balls of radius R = 0.2 depicted in Figure 1 and then a client arrives uniformly at random on the selected ball. In Figure 1 and Table 1 we present the overall cost of Algorithm 2 and the algorithm of [31] for different values of facility-weight γ, k = 3 facilities and T = 4000 time-steps. In all cases, the facilities of Algorithm 2 eventually converge to three of the four ball-centers, which is the optimal fixed facility placement. As the experiment reveals, the algorithm of [31] admits significantly larger cost as the facility-weight increases while Algorithm 2 is robust to the increase.

Figure 1: We plot the evolution of the approximation ratio for Algorithm 2 (red curve) and the algorithm from [31] (blue curve) compared to the hindsight optimal facility placement for facility weights γ = 0, γ = 1 and γ = 10. Both scales are logarithmic. The bottom-right plot depicts the facilities eventually placed by our Algorithm 2 which coincides with the optimal configuration

Table 1: Ratio of the overall cost of both algorithms with respect to to the hindsight optimal (20 runs).

Moving Clients γ = 0 γ = 1 γ = 10

[31] 1.297 0.045 1.943 0.466 3.388 1.335

Algorithm 2 1.083 0.001 1.091 0.001 1.343 0.014

Real-World Datasets. We evaluate the performance of Algorithm 2 on the MNIST and CIFAR10 datasets. We randomly sample n = 10000 images and construct a graph where each image corresponds to a vertex with the edge weights given by the Euclidean distance of the respective images. At each round t, an image is sampled uniformly at random and a client arrives in the corresponding vertex. We then evaluate Algorithm 2 in the latter setting for T = 3000 rounds and k = 10 facilities. In Table 2 we present the ratio of the overall cost of Algorithm 2 over the ratio cost of the fractional hindsight optimal7. As our experiments indicate, the sub-optimality of Algorithm 2 is way smaller than the theoretical O(log n) upper bound on the regret.

Table 2: The ratio of the cost of Algorithm 2 with respect to the cost of the fractional hindsight optimal facility placement (20 runs).

Algorithm 2 γ = 0 γ = 1 γ = 10

MNIST 1.118 0.01 1.403 0.04 1.5631 0.03

CIFAR10 1.113 0.01 1.189 0.04 1.59 0.31

Beyond unit batch sizes and random arrivals. Finally, we once again evaluate the performance of Algorithm 2 on the MNIST and CIFAR10 datasets. This time, the requests arrive in batches of size R = 10 for T = 3000 rounds. In order to go beyond the random arrival model,we first sample the R T requested vertices uniformly at random from [n] and then we proceed to order them based on their respective categories, using the lexicographical vector order to break ties. Then, we partition these requests into T batches of size R and sequentially reveal them to the algorithm as usual. As a result, all images/vertices from the first category are requested first, then the second etc.

In Table 3 , we present our experimental evaluations on the above constructed sequence. As our experiments indicate, our algorithm admits way better performance than the theoretical O(log n) guarantees even in sequences with higher batch sizes and non-random arrivals.

Table 3: The ratio of the cost of our Algorithm with respect to the cost of the fractional hindsight optimal facility placement.

Our Algorithm γ = 0 γ = 1 γ = 10

CIFAR10 1.050 1.048 1.051

MNIST 1.082 1.045 1.12

7The cost of the fractional hindsight optimal can be efficiently computed [31] and lower bounds the cost of the optimal facility placement. As a result, the presented ratios in Tables 2 and 3 are upper bounds on the actual ratio of Algorithm 2 and the optimal facility-placement.