# optimal_clustering_with_bandit_feedback__cc28338b.pdf Journal of Machine Learning Research 25 (2024) 1-54 Submitted 9/22; Revised 3/24; Published 7/24 Optimal Clustering with Bandit Feedback Junwen Yang junwen yang@u.nus.edu Institute of Operations Research and Analytics National University of Singapore 117602, Singapore Zixin Zhong zixinzhong@hkust-gz.edu.cn Thrust of Data Science and Analytics Hong Kong University of Science and Technology (Guangzhou) 511453, Guangzhou, Guangdong, China Vincent Y. F. Tan vtan@nus.edu.sg Department of Mathematics Department of Electrical and Computer Engineering Institute of Operations Research and Analytics National University of Singapore 119076, Singapore Editor: Chris Wiggins This paper considers the problem of online clustering with bandit feedback. A set of arms (or items) can be partitioned into various groups that are unknown. Within each group, the observations associated to each of the arms follow the same distribution with the same mean vector. At each time step, the agent queries or pulls an arm and obtains an independent observation from the distribution it is associated to. Subsequent pulls depend on previous ones as well as the previously obtained samples. The agent s task is to uncover the underlying partition of the arms with the least number of arm pulls and with a probability of error not exceeding a prescribed constant δ. The problem proposed finds numerous applications from clustering of variants of viruses to online market segmentation. We present an instance-dependent information-theoretic lower bound on the expected sample complexity for this task, and design a computationally efficient and asymptotically optimal algorithm, namely Bandit Online Clustering (BOC). The algorithm includes a novel stopping rule for adaptive sequential testing that circumvents the need to exactly solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC s performance matches the lower bound asymptotically, and significantly outperforms a non-adaptive baseline algorithm. Keywords: clustering, K-means, online learning, multi-armed bandits, pure exploration 1. Introduction Clustering, the task of partitioning a set of items into smaller clusters based on their commonalities, is one of the most fundamental tasks in data analysis and machine learning with a rich and diverse history (Driver and Kroeber, 1932; Cattell, 1943; Ruspini, 1969; Jain et al., 1999; Celebi et al., 2013). It has numerous applications in a wide variety of c 2024 Junwen Yang, Zixin Zhong and Vincent Y. F. Tan. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-1088.html. Yang, Zhong, and Tan areas including business analytics, bioinformatics, pattern recognition, and social sciences. In this era of abundance of medical data, clustering is a powerful tool to uncover the underlying patterns of unknown treatments or diseases when related systematic knowledge is underdeveloped (Mac Cuish and Mac Cuish, 2010). In commercial decision making, aiming at increasing customer satisfaction and maximizing potential benefit, marketers utilize clustering strategies to partition the inclusive business market into more narrow market segments with similar characteristics (Chaturvedi et al., 1997). Due to its enormous importance in practical applications, clustering has been studied extensively in the literature from multidisciplinary perspectives. A plethora of algorithms (e.g., K-means and spectral clustering) have been proposed for the task of clustering (see Jain et al. (1999) or Saxena et al. (2017) for comprehensive reviews). In particular, the K-means algorithm by Mac Queen (1967) and Lloyd (1982) is arguably the most ubiquitous algorithm due to its simplicity, efficiency, and empirical successes (Jain, 2010). Although modern data analysis has benefited immensely from the abundance and richness of data, there is an urgent need to develop new techniques that are adapted to the sequential and uncertain nature of data collection. In this paper, we are interested in online clustering with bandit feedback, which is an online variant of the classical offline clustering problem. With bandit feedback, the agent only observes a noisy measurement on the selected arm (or item) at each time step. However, the agent can decide which arm to pull adaptively, so as to minimize the expected number of total arm pulls it takes to correctly partition the given arm set with a given (high) probability. Two Motivating Examples. Our online clustering model captures various contemporary real-world scenarios, in which data contaminated by some degree of measurement noise become available in a sequential and adaptive fashion. We are firstly motivated by medical and public health professionals arduous battles against new viruses (e.g., COVID-19). In the face of an unknown type of virus that has different variants, let us assume that there is only one dominant variant in each sub-region of a particular state. When accurate laboratory analysis is not available especially in underdeveloped regions, how can healthcare professionals partition the virus samples into specific dominant variants based on noisy measurements of infectious patients from various sub-regions? This realistic and critical problem can be well modelled by our framework, namely online clustering with bandit feedback, where the healthcare personnel adaptively obtains independent observations of patients from the selected sub-regions and finally partitions the whole state into various groups based on the types of dominant variants. Due to the prohibitive costs in obtaining the measurements, this must be done with as few of them as possible. Our second motivation is in digital marketing in which customer feedback on certain products are collected in an online manner and always accompanied by random or systematic noise. For market segmentation, one important objective is to reduce the cost of feedback collection while maintaining a high quality of clustering, so that subsequent recommendations are suitably tailored to particular groups of consumers. See Figure 1 for a protocol of online market segmentation that our framework is able to model well. Main Contributions. Our main contributions are as follows: (i) We formulate the online clustering with bandit feedback problem in Section 3. We identify some subtleties of the framework, which may be of independent interest for Optimal Clustering with Bandit Feedback Time 1 2 3 4 5 6 7 8 9 10 11 12 Figure 1: Example involving partitioning 6 sub-groups of customers into 3 market segments with bandit feedback. Initially, customers are preliminarily divided into subgroups, treated as arms, based on some basic characteristics such as age and gender. At each time step, an algorithm chooses a sub-group to query and receives a multidimensional sample (e.g., product ratings) from a single customer within that sub-group. Based on the previously chosen sub-groups and their samples, the algorithm decides which sub-group to query next. Finally, when it is sufficiently confident of producing a partition of the sub-groups into market segments that share similar preferences, the algorithm terminates. future research. In particular, there are multiple ways that a partition of an arm set can be represented. This is further complicated by the fact that each cluster is also identified by a mean vector. We propose a precise expression for an instance of cluster bandits, and establish two equivalence relations for the partitions and bandit instances, respectively. (ii) In Section 4, we derive an instance-dependent (information-theoretic) lower bound on the expected sample complexity for the online clustering problem; this lower bound, however, involves a tricky optimization problem. By exploiting the structure of the problem and leveraging an interesting combinatorial property, we simplify the optimization to a finite convex minimax problem, which can be solved efficiently. Further analyses of the lower bound provide fundamental insights and essential tools for the design of our algorithm. (iii) In Section 5, we propose and analyze Bandit Online Clustering (or BOC). We show that it is not only computationally efficient but also asymptotically optimal in the sense that its expected sample complexity attains the lower bound as the error probability tends to zero. This result is somewhat surprising since solving the corresponding offline clustering problem exactly is NP-hard (Aloise et al., 2009; Mahajan et al., 2012). En route to overcoming this combinatorial challenge and demonstrating Yang, Zhong, and Tan desirable properties of BOC, we utilize a variant of the classical K-means algorithm in the sampling rule and propose a novel stopping rule for adaptive sequential testing. In particular, the natural stopping rule based on the generalized likelihood ratio (GLR) statistic (Chernoff, 1959; Garivier and Kaufmann, 2016) is intractable. Our workaround involves another statistic that exploits our insights on the lower bound. Finally in Section 6, we show via numerical experiments that BOC is indeed asymptotically optimal and significantly outperforms a uniform sampling strategy on both synthetic and real-world benchmark datasets. 2. Literature Review Clustering and the K-means algorithm. Although we consider an online version of the clustering problem, a certain variant of the classical (offline) K-means algorithm serves as a subroutine of BOC. Given the d-dimensional observations of M items, the K-means algorithm (Mac Queen, 1967; Lloyd, 1982) aims at partitioning the items into K disjoint clusters in order to minimize the sum of squared Euclidean distances between each item to the center of its associated cluster. If each item is associated to a certain weight, then a reasonable objective is the minimization of the weighted sum of squared Euclidean distances. Due to the prominence of the K-means algorithm, the corresponding clustering problem is also referred to as the (weighted) K-means clustering problem in the literature. This non-convex K-means clustering problem has been shown to be NP-hard even for d = 2 (Mahajan et al., 2012) or K = 2 (Aloise et al., 2009). In fact, for general d, K and M, the problem can be exactly solved in O(Md K+1) time (Inaba et al., 1994). When used as a heuristic algorithm, the performance of K-means depends to a large extent on how it is initialized. To improve the stability and the quality of the eventual solution that K-means produces, various initialization methods have been proposed (e.g., Forgy s method (Forgy, 1965), Maximin (Gonzalez, 1985), K-means++ (Arthur and Vassilvitskii, 2007), PCA-Part (Su and Dy, 2007)). See Celebi et al. (2013) for detailed comparisons on initialization methods for K-means. There have been some attempts in the literature to adapt the vanilla K-means algorithm to an online framework involving streams of incoming data (Choromanska and Monteleoni, 2012; Liberty et al., 2016; Cohen-Addad et al., 2021). In this line of works, each item only has one observation. In particular, at each time step t, the agent receives an observation of one item and has to determine its cluster index before the arrival of next observation. This is vastly different from our setting of bandit feedback, where the arm pulls are determined adaptively by the agent and the observations are stochastic. In addition, another related work by Khaleghi et al. (2012) addressed the case where every item is associated with an infinite sequence generated by one of K unknown stationary ergodic processes. At each time step t, some observations arrive, each being either a new sequence or the continuation of some previously observed sequence, and the agent needs to partition the observed sequences into K groups. Finally, Mazumdar and Saha (2017) studied clustering with noisy queries in both the adaptive and non-adaptive settings, where the agent receives a potentially incorrect answer on whether two items belong to the same cluster at each time step. To the best of our knowledge, the online clustering with bandit feedback problem (formally described in Section 3) has not been considered before. Optimal Clustering with Bandit Feedback Bandit Algorithms. The stochastic multi-armed bandit problem, originally introduced by Thompson (1933), provides a simple but powerful online learning framework. This problem has been studied extensively in recent years. While the regret minimization problem aims at maximizing the cumulative reward by balancing the trade-offbetween exploration and exploitation (Auer et al., 2002; Abbasi-Yadkori et al., 2011; Bubeck and Cesa-Bianchi, 2012; Agrawal and Goyal, 2012), the pure exploration problem focuses on efficient exploration with specific goals, e.g., best (top-k) arm identification (Even-Dar et al., 2006; Audibert et al., 2010; Karnin et al., 2013; Garivier and Kaufmann, 2016; Jun et al., 2016), and its variants in linear bandits (Soare et al., 2014; Jedra and Proutiere, 2020; Yang and Tan, 2022), and cascading bandits (Zhong et al., 2020), among others. Our online clustering task can also be viewed as a pure exploration problem although the rewards are multidimensional and the arm set has an inherent cluster structure. It is worth mentioning that Prabhu et al. (2020) introduced a framework of sequential multi-hypothesis testing with bandit feedback, which is a generalization of the odd arm identification problem (Vaidhiyan and Sundaresan, 2017). Our online clustering problem falls within this framework if each partition of the arm set is viewed as a hypothesis. However, the methodology proposed in Prabhu et al. (2020) relies heavily on a strong continuity assumption on the proportions of arm pulls (Assumption A therein). Even if one accepts continuity assumption, the total number of hypotheses (i.e., the number of possible partitions) is prohibitively large and the corresponding computation of the modified GLR is intractable (see Remark 17 for more details). Our work does away with the continuity assumption and, in fact, proves that the required continuity property holds (see Proposition 9). We refer to Lattimore and Szepesv ari (2020) for a comprehensive review on bandit algorithms. There is also a line of works that incorporates cluster structures into multi-armed bandits. In particular, Nguyen and Lauw (2014); Gentile et al. (2014), Li et al. (2016), and Carlsson et al. (2021) assumed that users can be divided into groups and the users within each group receive similar rewards for each arm. Besides, Bouneffouf et al. (2019) and Singh et al. (2020) assumed that the arm set is pre-clustered and the reward distributions of the arms within each cluster are similar. However, all the works mentioned above focus on leveraging the cluster structure to improve the performance of regret minimization, which differs from our objective, i.e., to uncover the underlying partition of the arms. Finally, Wang and Scarlett (2022) considered a min-max grouped bandits problem in which the objective is to find a sub-group (among possibly overlapping groups) whose worst arm has the highest mean reward. 3. Problem Setup and Preliminaries A Bandit Feedback Model with Cluster Structure. We consider a bandit feedback model, in which the arm set has an inherent cluster structure. In particular, the agent is given an arm set A = [M], which can be partitioned into K disjoint nonempty clusters, and the arms in the same cluster share the same d-dimensional mean vector, also referred to as the center of the corresponding cluster. Without loss of generality, we assume that K < M, otherwise there is only one possible partition. Therefore, an instance of cluster bandits can be fully characterized by a pair (c, U), where c = [c1, c2, . . . , c M] [K]M consists of the cluster indices of the arms and U = [µ(1), µ(2), . . . , µ(K)] Rd K represents the K centers Yang, Zhong, and Tan {m [M] : cm = 1}{m [M] : cm = 2} {m [M] : cm = 3} Arm set [M]: 1 2 3 4 5 6 7 8 9 10 11 12 Mean vectors: µ(1) µ(2) µ(3) If At = 2, then Xt = µ(c2) + ηt. If At+1 = 8, then Xt+1 = µ(c8) + ηt+1. Figure 2: Online clustering with bandit feedback with K = 3 and M = 12. of the clusters. Since each cluster has at least one arm, for any cluster k [K], there exists an arm m [M] such that cm = k. To reduce clutter and ease the reading, we always index the arms and the clusters by subscripts and numbers in parentheses, respectively. At each time t, the agent selects an arm At from the arm set A, and then observes an noisy measurement on the mean vector of At, i.e., Xt = µ(c At) + ηt where {ηt} t=1 Rd is a sequence of independent (noise) random variables, each following the standard d-dimensional Gaussian distribution N(0, Id). See Figure 2 for a schematic of the model. Remark 1 In this work, we assume that the noise at each time t is a standard d-dimensional Gaussian random vector. For general (e.g., non-diagonal) noise covariance matrices, one can use the Cholesky decomposition to transform the raw observations in an affine manner into ones that have an identity covariance matrix, without loss of generality. The Equivalences of Partitions and Instances. The representation of a partition c or an instance (c, U) is not unique and we can accordingly define two equivalence relations. For a permutation σ on [K], let σ(c) := [σ(c1), σ(c2), . . . , σ(c M)] and σ(U) := [µ(σ(1)), µ(σ(2)), . . . , µ(σ(K))]. Similarly, we define σ(c, U) := (σ(c), σ(U)). For two partitions c and c , if there exists a permutation σ on [K] such that c = σ(c ), then we write c c . Due to the bijectivity of permutations, there also exists another permutation σ 1 on [K] such that c = σ 1(c). Therefore, it is straightforward to verify this is indeed an equivalence relation. For two instances (c, U) and (c , U ),1 if µ(cm) = µ (c m) for all m [M], then the two instances are equivalent and we denote this as (c, U) (c , U ). Note that (c , U ) = σ(c, U) for some permutation σ indicates (c, U) (c , U ), but the reverse implication may not be true. That is to say, if (c, U) (c , U ), there may not exist a permutation σ such that (c , U ) = σ(c, U). See Example 1. 1. Throughout this paper, we denote the columns of U Rd K as {µ(k)}K k=1. The same convention applies to other forms of U, such as U, U , and U . Optimal Clustering with Bandit Feedback Example 1 Let K = 2, M = 3 and d = 1. Consider the three instances (c, U), (c , U ) and (c , U ) where c = [1, 1, 2] , c = [2, 2, 1], c = [1, 2, 2] and U = U = U = [1, 1]. Although (c, U) (c , U ) (c , U ), it holds that c c c . Online Clustering with Bandit Feedback. We consider a pure exploration task, aiming to find the unknown cluster structure c by pulling arms adaptively. In the fixedconfidence setting where a confidence level δ (0, 1) is given, the agent is required to find a correct partition cout of the arm set [M] (i.e., cout c) with a probability of at least 1 δ in the smallest number of time steps. More formally, the agent uses an online algorithm π to decide the arm At to pull at each time step t, to choose a time τδ to stop pulling arms, and to recommend cout as the partition to output eventually. Let Ft = σ(A1, X1, . . . , At, Xt) denote the σ-field generated by the past measurements up to and including time t. Thus, the online algorithm π consists of three components, namely, the sampling rule selects At, which is Ft 1-measurable; the stopping rule determines a stopping time τδ adapted to the filtration (Ft) t=1; the recommendation rule outputs a partition cout, which is Fτδ-measurable. Definition 2 For a fixed confidence level δ (0, 1), an online clustering algorithm π is said to be δ-PAC (probably approximately correct) if for all instances (c, U), Pr(τδ < ) = 1 and the probability of error Pr(cout c) δ. Our overarching goal is to design a computationally efficient online δ-PAC clustering algorithm while minimizing the expected sample complexity E[τδ]. To rule out pathological cases that might lead to infinite expected sample complexities for any algorithm, throughout this work, we only consider partitioning the instances (c, U) that satisfy the following natural property: the mean vectors for different clusters are distinct (i.e., the instances (c, U) subject to U U := { U Rd K : µ(k1) = µ(k2) for all 1 k1 < k2 K}). However, this property might not hold for general instances of cluster bandits (including the alternative instances Alt(c) that we introduce in the next section). Note that the odd arm identification problem (Vaidhiyan and Sundaresan, 2017; Karthik and Sundaresan, 2020, 2021)) is not a special case of our online clustering problem with K = 2 since we do not require the knowledge of the number of arms in each cluster. Other Notations. Let N denote the set of positive integers and R+ denote the set of non-negative real numbers. For any positive integer N, PN := {x [0, 1]N : x 1 = 1} denotes the probability simplex in RN while P+ N := {x (0, 1)N : x 1 = 1} denotes the open probability simplex in RN. For two partitions c and c , let d H(c, c ) denote the Hamming distance between c and c , i.e., d H(c, c ) := PM m=1 1{cm = c m}. For any a, b (0, 1), the binary relative entropy, which is the KL-divergence between Bernoulli distributions with means a and b, is denoted as d KL(a, b) := a log(a/b) + (1 a) log((1 a)/(1 b)). When we write i = arg mini A f(i) where f(i) is a function of i and A is a finite set of integers or a finite set of vectors of integers, we are referring to the minimum index (in lexicographic order) in the set {i A : f(i) = minj A f(j)} if it is not a singleton. Yang, Zhong, and Tan 4. Lower Bound In this section, we leverage the ubiquitous change-of-measure argument for deriving impossibility results to derive an instance-dependent lower bound on the expected sample complexity E[τδ] for the online clustering problem. The lower bound is closely related to a combinatorial optimization problem. Although the optimization in its original form appears to be intractable, we prove an interesting combinatorial property and reformulate the optimization as a finite convex minimax problem. Moreover, we further present some results on the computation and other useful properties (e.g., the continuity of the optimizer and the optimal value) of the optimization problem (and its sub-problem) embedded in the lower bound, which are fundamental and essential in our algorithm design (see Section 5). The change-of-measure argument, of which the key idea dates back to Chernoff(1959), is ubiquitous in showing various lower bounds in (and beyond) bandit problems (e.g., regret minimization (Lai and Robbins, 1985; Lattimore and Szepesv ari, 2017), pure exploration (Kaufmann et al., 2016; Vaidhiyan and Sundaresan, 2017)). Using this technique, the probabilities of the same event under different probability measures are related via the KL-divergence between the two measures. For any fixed instance (c, U), we define Alt(c) := {(c , U ) : c = c for any (c , U ) (c , U )}, which is the set of alternative instances where c is not a correct partition. In particular, we consider the probabilities of correctly identifying the cluster structures under (c, U) and the instances in Alt(c), and apply the transportation inequality (Kaufmann et al., 2016, Lemma 1). The instance-dependent lower bound on E[τδ] is presented in the following theorem; see Appendix B.1 for the proof. Theorem 3 For a fixed confidence level δ (0, 1) and instance (c, U), any δ-PAC online clustering algorithm satisfies E[τδ] d KL(δ, 1 δ)D (c, U) D (c, U) := 1 2 sup λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 1 . (1) Furthermore, lim inf δ 0 E[τδ] log(1/δ) D (c, U). (2) We refer to D (c, U) as the hardness parameter of the online clustering task in the sequel. The asymptotic version of the instance-dependent lower bound given in Equation (2) in Theorem 3 is tight in view of the expected sample complexity of the efficient algorithm we present in Section 5. Intuitively, any λ PM in Equation (1) can be understood as the proportion of arm pulls, which inspires the design of the sampling rule of a δ-PAC online clustering algorithm. The agent wishes to find the optimal proportion of arm pulls to distinguish the instance c from the most confusing alternative instances in Alt(c) (for which c is not a correct partition). Therefore, with the knowledge of the instance (c, U), the Optimal Clustering with Bandit Feedback optimization problem embedded in (1) naturally unveils the optimal sampling rule, which is the basic idea behind the design of our sampling rule in Section 5. For ease of description, we refer to the entire optimization problem and the inner infimization in (1) as Problem ( ) and Problem ( ), respectively, i.e., Problem ( ): sup λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 Problem ( ): inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2. These two optimization problems will also feature in the sampling rule and the stopping rule, respectively, of our proposed algorithm in Section 5. However, even given the full information of the instance (c, U), solving Problem ( ) is tricky although the inner objective function is a weighted sum of quadratic functions. In particular, the alternative instances in Alt(c) need to be identified by both c and U , which is rather involved as the definition of Alt(c) is combinatorial and the number of instances in it is obviously infinite. For Problem ( ), one may consider first fixing c , then optimizing over different U . We remark that this idea is only theoretically but not practically feasible since for a fixed number of clusters K, the total number of possible partitions (which is called the Stirling number of the second kind (Graham et al., 1989)) grows asymptotically as KM/K! as the number of arms M tends to infinity. Nevertheless, we show a natural and useful property of Problem ( ). Lemma 4 For any λ PM and (c, U), inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 = inf (c ,U ) Alt(c): d H(c ,c)=1 m=1 λm µ(cm) µ (c m) 2. Lemma 4 provides a useful combinatorial property of Problem ( ), which makes it possible to solve for the hardness parameter D (c, U) efficiently. Instead of considering all the alternative instances in Alt(c), Lemma 4 shows it suffices to consider the instances whose partitions have a Hamming distance of 1 from the given partition c. The complete proof of Lemma 4 is deferred to Appendix B.2, which consists of four steps (see Figure 4 therein for an illustration). In particular, we show for any instance (c , U ) Alt(c) such that d H(c , c) > 1, there exists another instance (c , U ) Alt(c) such that d H(c , c) = 1 and the objective function under (c , U ) is not larger than that under (c , U ). This desirable combinatorial property depends strongly on the specific structure of a valid partition. In fact, in Example 2 in Section 5, we will see an example where a similar combinatorial property no longer holds if the true mean vectors {µ(cm)}M m=1 in (1) are replaced by some empirical estimates which may be obtained as the algorithm proceeds. Thanks to Lemma 4, Problem ( ) turns out to be equivalent to a much simpler finite minimization problem, as shown in Proposition 5. Yang, Zhong, and Tan Proposition 5 For any λ PM and (c, U), inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 = min k,k [K]: n(k)>1,k =k w(k)w(k ) w(k)+w(k ) µ(k) µ(k ) 2 if λ P+ M 0 otherwise where w(k) := PM m=1 λm1{cm = k}, n(k) := PM m=1 1{cm = k} and w(k) := minm [M]:cm=k λm. Moreover, if λ P+ M, the infimum in Problem ( ) can be replaced with a minimum. As a corollary of our former intuition where λ represents the proportion of arm pulls, given the knowledge of the instance (c, U) and the proportion of arm pulls, Problem ( ) tells us how similar the true instance c and the most confusing alternative instances in Alt(c) are. In fact, Proposition 5 plays an essential role in the computation of the stopping rule of our method in Section 5, which succeeds in circumventing the need to solve NPhard optimization problems. In addition, Proposition 6 below asserts the continuity of the optimal value of Problem ( ), which will help to assert that the stopping rule proposed in Section 5 is asymptotically optimal. Refer to Appendices B.3 and B.4 for the proofs of Proposition 5 and Proposition 6, respectively. Proposition 6 For any fixed c, define g : PM Rd K R+ as g(λ, U) := inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2. Then g is continuous on PM U . As a consequence of Proposition 5, Proposition 7 below (proved in Appendix B.5), transforms Problem ( ) into a finite convex minimax problem, which has been studied extensively in the optimization literature (e.g., Gigola and Gomez (1990) Herrmann (1999), Gaudioso et al. (2006)). Proposition 7 For any (c, U), D (c, U) = 2 min w P+ K max k,k [K]: n(k)>1,k =k w(k) + 1 w(k ) µ(k) µ(k ) 2. A by-product of Proposition 7 is that the outer supremum in Problem ( ) can be replaced with a maximum. Intuitively, the maximizer of Problem ( ) in PM represents the optimal proportion of arm pulls, which will be of considerable importance in our design of the sampling rule. In fact, there exists a bijection between the solution to the finite convex minimax problem above and Problem ( ), as shown in Proposition 8 below. Although the finite convex minimax problem is not strictly convex in w P+ K, Proposition 8 states that the solution to Problem ( ) is unique. This, together with Proposition 9 concerning the continuity of the solution to Problem ( ), guarantees the computationally efficiency and the asymptotic optimality of our sampling rule in Section 5. The proofs of Proposition 8 and Proposition 9 are deferred to Appendices B.6 and B.7, respectively. Optimal Clustering with Bandit Feedback Proposition 8 For any (c, U), the solution to arg max λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 (3) is unique. If λ denotes the unique solution to (3) and w denotes the unique solution to arg min w P+ K max k,k [K]: n(k)>1,k =k w(k) + 1 w(k ) µ(k) µ(k ) 2, then λ can be expressed in terms of w as λ m = w (cm) n(cm) for all m [M]. Proposition 9 For any fixed c, define Λ : Rd K PM as Λ(U) := arg max λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2. Then Λ is continuous2 on U . 5. Algorithm: Bandit Online Clustering For the online clustering task with bandit feedback, we propose a computationally efficient and asymptotically optimal algorithm, namely Bandit Online Clustering (or BOC), whose pseudocode is presented in Algorithm 1. As elucidated in Section 3, our algorithm comprises three components: the sampling rule for selecting the arms to pull (Lines 3-9), the stopping rule for determining a stopping time (Lines 11-13), and the recommendation rule for producing a final result (the clustering cout). In the following subsections, we first introduce an important subroutine called K-means Maximin. This subroutine is employed at each time step to return a guess of both the correct partition and mean vectors based on the past measurements. We then explain the methodology of our sampling rule and stopping rule, respectively, focusing on their connections with the optimization problems discussed in Section 4. Finally, from a theoretical standpoint, we demonstrate the effectiveness of BOC and prove that its expected sample complexity E[τδ] is asymptotically optimal as the confidence level δ tends to zero. 5.1 Weighted K-means with Maximin Initialization Although we only aim at producing a correct partition in the final recommendation rule as noted in Section 3, learning the K unknown mean vectors of the clusters is essential in the sampling rule as well as the stopping rule. Different from other pure exploration problems in bandits (e.g., best arm identification (Garivier and Kaufmann, 2016), odd 2. In finite-dimensional spaces, pointwise convergence and convergence in Lp norm are equivalent. Yang, Zhong, and Tan Algorithm 1 Bandit Online Clustering (BOC) Input: Number of clusters K, confidence level δ and arm set [M] 1: Sample each arm once, set t = M and initialize ˆµm(t) and Nm(t) = 1 for all m [M]. 3: if minm [M] Nm(t) max( t M/2, 0) then 4: Sample At+1 = arg minm [M] Nm(t), and (ct, Ut) (ct 1, Ut 1) Forced exploration 6: (ct, Ut) K-means Maximin(K, {ˆµm(t)}m [M], {Nm(t)}m [M]) Algorithm 2 7: Solve Proposition 8 λ (t) = arg max λ PM inf (c ,U ) Alt(ct) m=1 λm µt(ct m) µ (c m) 2 (4) 8: Sample At+1 = arg maxm [M](tλ m(t) Nm(t)) 10: t t + 1, update ˆµm(t) and Nm(t) for all m [M] 11: Compute m=1 Nm(t) ˆµm(t) µt 1(ct 1 m ) 2 and solve Proposition 5 Z2(t) = min (c ,U ) Alt(ct 1) m=1 Nm(t) µt 1(ct 1 m ) µ (c m) 2 13: until Z(t) β(δ, t) Output: τδ = t and cout = ct 1 arm identification (Vaidhiyan and Sundaresan, 2017)), in the online clustering problem, it is not straightforward to recommend an estimate of the pair (c, U) given some past measurements on the arm set. However, using maximum likelihood estimation, we will see the equivalence between the recommendation subroutine and the classical offline weighted K-means clustering problem, which has been shown to be NP-hard (Aloise et al., 2009; Mahajan et al., 2012). Given the past arm pulls and observations up to time t (i.e., A1, X1, . . . , At, Xt), the log-likelihood function of the hypothesis that the instance can be identified by the pair (c , U ) can be written as ℓ(c , U | A1, X1, . . . , At, Xt) := 1 s=1 Xs µ (c As) 2 td 2 log(2π). (5) Optimal Clustering with Bandit Feedback For any arm m [M], let Nm(t) := Pt s=1 1{As = m} and ˆµm(t) := Pt s=1 Xs1{As = m}/Nm(t) denote the number of pulls and the empirical estimate up to time t, respectively. By rearranging Equation (5), the maximum likelihood estimate of the unknown pair (c, U) can be expressed as arg min (c ,U ) m=1 Nm(t) ˆµm(t) µ (c m) 2 (6) which consists in minimizing a weighted sum of squared Euclidean distances between the empirical estimate of each arm and its associated center. Therefore, any algorithm designed for the weighted K-means clustering problem is applicable to obtain an approximate (not exact) solution to (6). We remark that although the weighted variant of the original K-means algorithm (Mac Queen, 1967; Lloyd, 1982) is an efficient heuristic for the weighted K-means clustering problem, to the best of our knowledge, there are no theoretical guarantees for finding a global minimum of this problem in general. To establish the asymptotic optimality of our online clustering method BOC, in the initialization stage of K-means, we leverage the Maximin method, which is a farthest point heuristic proposed by Gonzalez (1985). The complete pseudocode for Weighted K-means with Maximin Initialization (abbreviated as K-means Maximin) is presented in Algorithm 2. In the following, we derive some useful properties of K-means Maximin; see Appendix D.1 for the proof of Proposition 10. Proposition 10 Given an instance (c, U), if the empirical estimates of the arms {ˆµm}m [M] satisfy max m [M] ˆµm µ(cm) < 1 4 min k,k [K]:k =k µ(k) µ(k ) , then K-means Maximin will output a correct partition cout c. Furthermore, suppose that c = σ(cout) for some permutation σ on [K]. Then max k [K] µout(k) µ(σ(k)) max m [M] ˆµm µ(cm) . Remark 11 The K-means Maximin subroutine used in Algorithm 1 can be replaced by any offline clustering algorithm that meets the following conditions without affecting the subsequent results including Proposition 12, Proposition 18 and Theorem 19: (i) given an instance (c, U), if the empirical estimates of all the arms are sufficiently accurate, i.e., maxm [M] ˆµm µ(cm) is smaller than a constant ϵ that depends on the problem instance, then the clustering algorithm will output a correct partition cout, i.e., one that satisfies that c = σ(cout) for some permutation σ; (ii) moreover, for any cluster k [K], µout(k) µ(σ(k)) is not larger than maxm [M] ˆµm µ(cm) . 5.2 Sampling Rule Once the estimates of the partition and the mean vectors are obtained by the K-means Maximin subroutine based on the past measurements, our sampling rule utilizes the efficient method presented in Proposition 8 to find a plug-in approximation of the optimal oracle sampling rule. This then informs the algorithm of the selection of the next arm to pull. Yang, Zhong, and Tan Algorithm 2 Weighted K-means with Maximin Initialization (K-means Maximin) Input: Number of clusters K, empirical estimate ˆµm and weighting Nm for all m [M] 1: Choose the empirical estimate of an arbitrary arm as the first cluster center ˆµ(1) 2: for k = 2 to K do Maximin Initialization 3: Choose the empirical estimate of the arm that has the greatest Euclidean distance to the nearest existing center as the k-th center ˆµ(k): ˆµ(k) = arg max m [M] min 1 k k 1 ˆµm ˆµ(k ) 5: repeat Weighted K-means 6: Assign each arm to its closest cluster center: ˆcm = arg min k [K] ˆµm ˆµ(k) 7: Update each cluster center as the weighted mean of the empirical estimates of the arms in it: P m [M] Nmˆµm1{ˆcm = k} P m [M] Nm1{ˆcm = k} 8: until Clustering ˆc no longer changes 9: Set µout(k) = ˆµ(k) for all k [K] Output: cout = ˆc and Uout = [µout(1), µout(2), . . . , µout(K)] In particular, Algorithm 1 follows the so-called D-Tracking rule, originally proposed by Garivier and Kaufmann (2016) for best arm identification, to track the optimal sampling rule. For the purpose of ensuring the plug-in approximation λ (t) to converge to the true optimal sampling rule λ , the D-Tracking rule introduce a stage of forced exploration. At each time t, if there exists an arm whose number of pulls is not larger than the preset threshold (max( t M/2, 0)), then the agent chooses to sample that under-sampled arm. Otherwise, the agent chooses the arm according to the difference between the current proportions of arm pulls and the plug-in approximation λ (t) (as described in Line 8 of Algorithm 1). Proposition 12, proved in Appendix D.2, shows the asymptotic optimality of our sampling rule, which is a joint consequence of Propositions 8, 9 and 10. Proposition 12 Using the sampling rule as detailed in Lines 3-9 of Algorithm 1, the proportions of arm pulls in Algorithm 1 converge to the optimal oracle sampling rule λ almost surely, i.e., Pr lim t Nm(t) t = λ m for all m [M] = 1. Remark 13 We utilize the forced exploration stage in Algorithm 1 (Lines 3 and 4) to obtain coarse estimates of the means of the arms. This appears to be necessary since we do not make any assumption on the instance of cluster bandits except that the centers for different clusters are distinct. However, the clusters being distinct does not preclude them being arbitrarily Optimal Clustering with Bandit Feedback close to one another. This hinders the adoption of a more adaptive exploration stage. When the ratio between the minimal and the maximal pairwise distances among the centers is lower bounded by a known constant 0 < r 1 (see Assumption 1), an improved and more aggressive exploration method is presented in Equation (22) of Appendix C. This is based on a delicate quantitative analysis of λ (Proposition 28 therein). Experimental results using this more aggressive forced exploration procedure are also presented in Appendix C.2. 5.3 Stopping Rule As the arm sampling proceeds, the algorithm needs to determine when to stop the sampling and recommend a partition with an error probability of at most δ, namely the stopping rule. Most existing algorithms for pure exploration in the fixed-confidence setting (e.g., Garivier and Kaufmann (2016), Jedra and Proutiere (2020), Feng et al. (2021), R eda et al. (2021)) consider the Generalized Likelihood Ratio (GLR) statistic and find suitable task-specific threshold functions. This strategy dates back to Chernoff(1959). However, we show that the method based on the standard GLR is computationally intractable in the following. Let (ct , Ut ) be the maximum likelihood estimate of the unknown pair (c, U) given the past measurements up to time t. Then (ct , Ut ) is also the global minimizer to (6), i.e., (ct , Ut ) = arg min(c ,U ) PM m=1 Nm(t) ˆµm(t) µ (c m) 2. Using the definition of the loglikelihood function in (5), the logarithm of the GLR statistic (referred to as the log-GLR) for testing (ct , Ut ) against its alternative instances can be written as log -GLR = ℓ(ct , Ut | A1, X1, . . . , At, Xt) min (c ,U ) Alt(ct ) ℓ(c , U | A1, X1, . . . , At, Xt) m=1 Nm(t) ˆµm(t) µt (ct m) 2 + min (c ,U ) Alt(ct ) m=1 Nm(t) ˆµm(t) µ (c m) 2 ! There are two critical computational issues with the above expression. First, evaluating the log-GLR as Equation (7) requires the exact global minimizer to (6), which is NP-hard to find in general. More importantly, even with the knowledge of (ct , Ut ), one cannot efficiently solve the optimization problem in the second term of Equation (7) although it appears to be similar to Problem ( ) discussed in Section 4. One may conjecture that an analogous combinatorial property to Lemma 4 holds for the second term of Equation (7), which might shrink the feasible set from Alt(ct ) to the alternative instances whose partitions have a Hamming distance of exactly 1 from ct . We disprove this conjecture by showing a counterexample in Example 2. As a consequence, the only feasible approach, at least for the moment, is to check all the possible partitions in Alt(ct ), which is certainly computationally intractable. Example 2 Let K = 2, M = 4 and d = 2. At time t, suppose that the empirical estimates of the 4 arms are respectively [0, 0] , [a, 0] , [0, b] and [a, b] , where 0 < b/ 2 < a < b. Then the partition that attains the minimum in (6) is [1, 1, 2, 2] (or [2, 2, 1, 1]) whereas the partition of the most confusing alternative instances (the minimizer to the second term of Equation (7)) is [1, 2, 1, 2] (or [2, 1, 2, 1]). Obviously, their Hamming distance is 2 rather than 1. Yang, Zhong, and Tan To construct a practicable stopping rule, instead of the GLR statistic, we consider the statistic m=1 Nm(t) ˆµm(t) µt 1(ct 1 m ) 2 Z2(t) := min (c ,U ) Alt(ct 1) m=1 Nm(t) µt 1(ct 1 m ) µ (c m) 2. Note that Z(t) involves (ct 1, Ut 1), which is the estimate of the true pair (c, U) produced by the K-means Maximin subroutine based on the past measurements. In particular, (ct 1, Ut 1) is not necessarily the global minimizer to (6) and our stopping rule does not have any requirement on the quality of this estimate. For the term Z2(t), Proposition 5 can be utilized directly since the inherent optimization is equivalent to Problem ( ) after an appropriate normalization. Let ζ( ) denote the Riemann zeta function, i.e., ζ(s) = P n=1 n s. The stopping time of Algorithm 1 is defined as τδ := inf{t N : Z(t) β(δ, t)} m=1 2d log(4 + log(Nm(t))) + Md Ψ log(1/δ) Ψ(x) = min 1/2 h 1 2 2 log(4h) + log(ζ(2h)) is a threshold function inspired by the concentration results for univariate Gaussian distributions (Kaufmann and Koolen, 2021). Remark 14 The function Ψ used in the threshold β(δ, t) possesses some useful properties that can be verified in a straightforward manner or found in Kaufmann and Koolen (2021): (i) Ψ(x) = x + log(x) + o(log(x)) as x ; (ii) Ψ(x) x for all x > 0; (iii) let ψ(h) := 2 2 log(4h) + log(ζ(2h)) h and then ψ is a unimodal function on [1/2, 1]. Overall, our stopping rule is easy to implement and computationally efficient and furthermore, we will see that it is asymptotically optimal in the next subsection. The effectiveness of our stopping rule is shown in Proposition 15, which ensures that provided the algorithm stops within a finite time, the probability of recommending an incorrect partition is no more than δ. Proposition 15 The stopping rule of BOC (Algorithm 1) ensures that Pr(τδ < , cout c) δ. Optimal Clustering with Bandit Feedback The proof of Proposition 15 is deferred to Appendix D.3. To confirm that our method BOC is indeed a δ-PAC online clustering algorithm, it remains to show it terminates within a finite time almost surely. Remark 16 Our stopping rule is not only applicable to our sampling rule and the estimates of (c, U) produced by the K-means Maximin subroutine. In fact, it also applies to any other sampling rule and the estimates by any method of estimation. In Section 6, we will experimentally compare different sampling rules with the same stopping rule. Remark 17 Vaidhiyan and Sundaresan (2017) proposed a modified GLR, where the likelihood function in the numerator of the GLR statistic is replaced by an averaged likelihood function with respect to an artificial prior probability distribution, for the odd arm identification task. We remark that this method is also not practical for the online clustering task since it requires to compute one corresponding modified GLR for each possible partition. However, the total number of possible partitions is enormous (namely the Stirling number of the second kind). 5.4 Sample Complexity Analysis In this subsection, we analyze the correctness and sample complexity of our algorithm BOC (Algorithm 1). Proposition 18 verifies that BOC terminates within a finite time almost surely. Together with Proposition 15, it shows BOC is indeed a δ-PAC online clustering algorithm. Specifically, for any instance (c, U), BOC recommends a correct partition cout based on the noisy measurements on the arm set with a probability of at least 1 δ. As shown in Proposition 18 and Theorem 19 respectively, the sample complexity of BOC asymptotically matches the instance-dependent lower bound presented in Section 4, both almost surely and in expectation, as the confidence level δ tends to zero. Therefore, BOC provably achieves asymptotic optimality in terms of the expected sample complexity and, at the same time, is also computationally efficient in terms of its sampling, stopping and recommendation rules. Thus, it achieves the best of both worlds. Refer to Appendices D.4 and D.5 for the proofs of Proposition 18 and Theorem 19, respectively. Proposition 18 For any instance (c, U), Algorithm 1 ensures that Pr(τδ < ) = 1 Pr lim sup δ 0 τδ log(1/δ) D (c, U) = 1. Theorem 19 For any instance (c, U), Algorithm 1 ensures that lim sup δ 0 E[τδ] log(1/δ) D (c, U). Yang, Zhong, and Tan 6. Numerical Experiments In this section, we study the empirical performance of our algorithm BOC and compare it with two baselines, namely Uniform and Oracle. Since our stopping rule is the only computationally tractable one available and the recommendation rule is embedded in either the sampling rule or the stopping rule, we focus on evaluating the efficacy of our sampling rule in terms of the time it takes for the algorithm(s) to stop. As we discussed in Remark 16, any sampling rule and recommendation rule can be combined with our stopping rule. Therefore, for the sake of fairness in comparison, Uniform and Oracle only differ from BOC in the sampling rules and retain the other frameworks. In particular, Uniform samples the M arms in a simple round-robin fashion while Oracle samples the arms based on the optimal oracle sampling rule λ (i.e., the estimate λ (t) in Line 8 of Algorithm 1 is replaced by λ , which is calculated with the unknown true pair (c, U)). In each setting, the reported sample complexities of different methods are averaged over 256 independent trials and the corresponding standard deviations are also shown as error bars or directly in the table. Finally, we mention that the partitions we learn in all our experiments are always correct. 6.1 Synthetic Dataset: Verifying the Asymptotic Optimality of BOC To study the asymptotic behavior of the expected sample complexities of different methods, we construct three synthetic instances with varying difficulty levels, where K = 4, M = 11 and d = 3. The partitions and the first three cluster centers of all the three instances are the same, while their fourth cluster centers vary. In particular, the three instances can be expressed as follows: c = [1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4] µ(1) = [0, 0, 0] µ(2) = [0, 10, 0] µ(3) = [0, 0, 10] [5, 0, 0] for the easy instance, [1, 0, 0] for the moderate instance, [0.5, 0, 0] for the challenging instance. Moreover, motivated by Garivier and Kaufmann (2016), we also consider using a heuristic threshold function β(δ, t) := log((1 + log(t))d/δ), which is an approximation to the original threshold function β(δ, t) in Equation (8). Although no theoretical guarantee is available when we use β(δ, t), it seems practical (and even conservative) in view of the empirical error probabilities (which are always zero) for large δ (e.g., δ 10 1). The experimental results of the different methods with the two kinds of threshold functions for different confidence levels δ are presented in Figure 3. To better demonstrate the asymptotic behavior, we plot the empirical averaged sample complexities of the three methods as well as the instance-dependent lower bound of the expected sample complexity (see Theorem 3) with respect to log(1/δ) in the each sub-figure. From Figure 3, we have the following observations: Optimal Clustering with Bandit Feedback 0 100 200 300 400 0 (a) The easy instance with β(δ, t). 0 100 200 300 400 0 (b) The easy instance with β(δ, t). 0 100 200 300 400 0 (c) The moderate instance with β(δ, t). 0 100 200 300 400 0 (d) The moderate instance with β(δ, t). 0 100 200 300 400 0 (e) The challenging instance with β(δ, t). 0 100 200 300 400 0 (f) The challenging instance with β(δ, t). Figure 3: The empirical averaged sample complexities of the different methods with the two kinds of threshold functions for different confidence levels δ on the synthetic dataset. Yang, Zhong, and Tan Although our algorithm BOC does not require the optimal oracle sampling rule λ , and instead approximates it on the fly, the curves of BOC and Oracle in the each sub-figure are almost completely overlapping, which suggests the proportion of arm pulls for BOC converges to the distribution λ very quickly. The main observation is that as δ decreases (or equivalently, as log(1/δ) increases), the slope of the curve corresponding to the BOC algorithm in the each sub-figure is almost equal to the slope of the lower bound, which is exactly equal to D (c, U). However, the slope of the curve corresponding to Uniform is consistently larger than that of the lower bound, exemplifying its suboptimality. This suggests that the expected sample complexity of our algorithm BOC matches the instance-dependent lower bound asymptotically, corroborating our theoretical results (see the lower bound and upper bound in Theorems 3 and 19, respectively). There are unavoidable gaps between the lower bound and BOC (or, almost equivalently, Oracle), which is not an unexpected phenomenon. Although we have shown that the sampling rule resulting in λ is asymptotically optimal, we have to utilize the stopping rule to evaluate the quality of the final recommendation (i.e., whether the recommended partition has an error probability of at most δ). In addition, our bounds are only guaranteed to be tight asymptotically as δ 0. Comparing the results for the same instance with the two different threshold functions, it can be seen that the heuristic one β(δ, t) leads to lower sample complexities, especially for large δ (i.e., small log(1/δ)). Even though there are no theoretical guarantees when β(δ, t) is used in our stopping rule, this threshold appears to work well empirically. 6.2 Real-world Datasets: Confirming the Non-asymptotic Superiority of BOC To complement the experiments on synthetic data and to verify that BOC also excels in the non-asymptotic regime, we conduct experiments on the Iris and Yeast datasets (Dua and Graff, 2017), both of which are ubiquitous in offline clustering and classification tasks. Here, we perform a novel task online clustering with bandit feedback. In the Iris dataset, the number of clusters K = 3, the number of arms M = 150, and the dimension d = 4, while in the Yeast dataset, K = 10, M = 1484, and d = 8. Note that the total number of partitions grows asymptotically as KM/K! (i.e., approximately 1070 or 101477 in the Iris and Yeast datasets, respectively); hence, it is impractical to exhaustively enumerate over all partitions in these datasets. We emphasize that BOC succeeds in circumventing the need to solve any NP-hard optimization problem as a subroutine in the online clustering task. To adapt these datasets to be amenable to online clustering tasks, we choose each cluster center to be the mean of the original data points of the arms in it, and then rescale the centers so that the hardness parameter D (c, U) is equal to 2 for both datasets. Since the performances of BOC and Oracle are similar (as observed in Section 6.1) and the heuristic threshold function β(δ, t) generally achieves lower sample complexities (compared to when β(t, δ) is used), we only present the results of the two methods (namely BOC and Uniform) with β(δ, t) on the two real datasets. Optimal Clustering with Bandit Feedback Iris Dataset Yeast Dataset δ BOC Uniform BOC Uniform 10 1 886.1 55.9 1176.4 69.2 14430.5 371.3 19536.0 530.4 10 2 922.5 69.4 1208.7 64.1 14531.2 273.4 19697.4 636.2 10 3 954.6 80.0 1244.9 72.7 14589.5 175.2 19997.4 718.7 10 4 993.8 87.3 1286.2 71.9 14631.5 101.6 20220.3 679.2 10 5 1026.9 84.8 1306.0 72.0 14639.6 150.0 20467.7 591.0 10 6 1059.0 68.4 1335.7 67.2 14686.3 218.1 20564.5 499.3 10 7 1075.8 62.4 1368.5 65.5 14723.8 270.3 20686.5 338.6 10 8 1090.0 56.9 1389.4 71.7 14797.3 348.5 20733.1 240.6 10 9 1104.8 50.4 1415.6 77.1 14844.6 385.7 20762.0 132.4 10 10 1120.2 48.2 1447.0 73.3 14977.1 445.0 20766.8 107.1 Table 1: The averaged empirical sample complexities of BOC and Uniform with the heuristic threshold function β(δ, t) for different confidence levels δ on the realworld datasets. The sample complexities for different confidence levels δ are presented in Table 1.3 From the table, we see that BOC significantly outperforms the non-adaptive baseline method Uniform for all δ in terms of sample complexities. This demonstrates that BOC is able to effectively learn the clusters in an online manner given bandit feedback. In Appendix E, we consider a dataset with a much higher dimensionality d, namely, the MNIST dataset (Le Cun et al., 1998). From Table 3 therein, we observe that the same conclusions apply. 7. Conclusion In this paper, we proposed a novel online clustering with bandit feedback framework in which there is a set of arms that can be clustered into non-overlapping groups, and at each time, one arm is pulled, and a sample from the distribution it is associated with is observed. We proposed and analyzed Bandit Online Clustering (or BOC) that, as discussed in Section 5.3, overcomes some critical computational limitations that a standard and natural GLR statistic suffers from due to the combinatorial search space of partitions. In addition to its computational efficiency, we proved that BOC is asymptotically optimal in the sense that it attains an instance-dependent information-theoretic lower bound as the confidence level δ tends to zero. There are some limitations of the current model and theoretical contributions that serve as fertile avenues for future research. Firstly, in real-world applications such as recommendation systems and online market segmentation, it is often the case that the absolute correct 3. The instance-dependent lower bound is not presented in Table 1 since it is not informative for these two real-world datasets in the non-asymptotic regime. For instance, even when the confidence level δ is equal to 10 10, the lower bound is only D (c, U) log(1/δ) 46, which is much smaller than the total number of arms (i.e., 150 or 1484 in the Iris and Yeast datasets, respectively). Yang, Zhong, and Tan clustering does not have to be found; an approximate clustering, with the advantage of further computational reductions, is usually sufficient. Developing computationally efficient and statistically optimal algorithms that allow for some distortion from the optimal clustering is thus of practical and theoretical importance. Secondly, our results are asymptotic in nature; they are only tight when the confidence level δ tends to zero. As we have seen in Section 6, this results in a gap between the upper bound and the actual performance of BOC when δ is not vanishingly small. It would thus be instructive to develop non-asymptotic or refined asymptotic bounds, perhaps by leveraging the second-order results in Malyutov and Tsitovich (2001) and Li and Tan (2020). Thirdly, our methodology can be generalized to the situation that the distributions of the observations are in the multivariate exponential families, whereas it requires efforts to preserve our computational efficiency. Finally, it is worth developing bandit feedback models and algorithms for other generalizations of clustering, such as hierarchical clustering, fuzzy or soft clustering, or community detection on graphs (Abbe, 2017). Optimal Clustering with Bandit Feedback Acknowledgments We express our sincere gratitude to the reviewers and action editor for their meticulous reading and insightful comments. We are also thankful for the valuable discussions with Karthik Periyapattana Narayanaprasad and Yunlong Hou. This research work is funded by the Singapore Ministry of Education Academic Research Fund (Ac RF) Tier 2 under grant number A-8000423-00-00 and the Singapore Ministry of Education Ac RF Tier 1 under grant number A-8000980-00-00. Appendix A. Auxiliary Lemmas Lemma 20 (The Maximum Theorem (Berge, 1963; Sundaram, 1996)) Let f : S Θ R be a continuous function , and D : Θ S be a compact-valued continuous correspondence. Let f : Θ R and D : Θ S be defined by f (θ) = max{f(x, θ) : x D(θ)} and D (θ) = arg max{f(x, θ) : x D(θ)} = {x D(θ) : f(x, θ) = f (θ)}. Then f is a continuous function on Θ, and D is a compact-valued, upper hemicontinuous4 correspondence on Θ. Lemma 21 (Sundaram (1996, Theorem 9.12)) A single-valued correspondence that is hemicontinuous (whether upper or lower hemicontinuous) is continuous when viewed as a function. Conversely, every continuous function, when viewed as a single-valued correspondence, is both upper and lower hemicontinuous. Lemma 22 Let λ denote the oracle optimal sampling rule of the instance (c, U). If there exists ϵ > 0 and t0(ϵ) such that sup t t0(ϵ) max m [M] |λ m(t) λ m| ϵ, then there exists t1(ϵ) t0(ϵ) such that sup t t1(ϵ) max m [M] Furthermore, a valid choice of t1(ϵ) is t0(ϵ) Proof Our sampling rule satisfies the assumptions of Garivier and Kaufmann (2016, Lemma 17), and the existence of t1(ϵ) follows. The choice of t1(ϵ) can be obtained by working through the proof of the same lemma. 4. Hemicontinuity of a correspondence (resp. the adjective hemicontinuous) is also termed as semicontinuity (resp. semicontinuous) in some books, e.g., Sundaram (1996). Yang, Zhong, and Tan Lemma 23 For any x, y R+, 2 = max α 0 αx + α α + 1y . Proof Notice that αx + α α + 1y = (1 + α)x 1 1 + αy + (x + y). If y x, then αx + α α + 1y = (1 + α)x 1 1 + αy α= y x 1 + (x + y) = 2 xy + (x + y) If y < x, then αx + α α + 1y = (1 + α)x 1 1 + αy α=0 + (x + y) Lemma 24 For any x, y Rd and α 0, α x 2 + α α + 1 y 2 x y 2. Proof Notice that the above inequality is equivalent to (1 + α) x 2 2x y + 1 α + 1 y 2 0, which is equivalent to 1 + α x 1 α + 1 y Thus, the result obviously holds. Optimal Clustering with Bandit Feedback Lemma 25 It holds that m=1 Nm(t) ˆµm(t) µ(cm) 2 β(δ, t) where β(δ, t) is defined in Equation (8). Proof In this proof, we use [x]i to denote the ith component of the vector x Rd. Recall that at each time t, the agent selects an arm At and observes Xt = µ(c At) + ηt, where ηt follows the standard d-dimensional Gaussian distribution N(0, Id). Since each individual component of ηt independently follows the standard univariate Gaussian distribution N(0, 1), one arbitrary arm can be treated as d independent sub-arms. Equivalently, the agent selects a group of d sub-arms and observes [Xt]i = [µ(c At)]i + [ηt]i for all i [d]. Note that there are Md sub-arms in total. Then Lemma 25 follows from the concentration inequality for the empirical means of sub-arms (Kaufmann and Koolen, 2021, Theorem 9). Lemma 26 (Adapted from Garivier and Kaufmann (2016, Lemma 18)) For any two constants a > 0 and b R such that b + log 1 + log b + log 1 satisfies ax log(x) + b. Appendix B. Proofs of Section 4 B.1 Proof of Theorem 3 Proof For fixed δ (0, 1) and instance (c, U), consider any δ-PAC online clustering algorithm. We will see in Proposition 7 that D (c, U) is finite so the situation that E[τδ] is infinite is trivial. Henceforth, we assume that E[τδ] is finite. For any arm m [M], let Nm(t) denote the number of pulls of arm m up to time t. Consider an arbitrary instance (c , U ) in Alt(c). By applying the transportation inequality (Kaufmann et al., 2016, Lemma 1) and the KL-divergence for the multivariate normal distribution, we have m=1 E[Nm(τδ)] µ(cm) µ (c m) 2 d KL(δ, 1 δ). Yang, Zhong, and Tan Since the above displayed inequality holds for all instances in Alt(c) and [E[N1(τδ)], E[N2(τδ)], . . . , E[NM(τδ)]] / E[τδ] forms a probability distribution in PM, we obtain d KL(δ, 1 δ) 1 2 inf (c ,U ) Alt(c) m=1 E[Nm(τδ)] µ(cm) µ (c m) 2 2 E[τδ] inf (c ,U ) Alt(c) E[τδ] µ(cm) µ (c m) 2 2 E[τδ] sup λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 = E[τδ]D (c, U) 1. Since limδ 0 d KL(δ, 1 δ)/log(1/δ) = 1, letting δ 0 yields lim inf δ 0 E[τδ] log(1/δ) D (c, U) as desired. B.2 Proof of Lemma 4 Proof For any fixed λ PM and (c , U ), let Dist(c , U ) := m=1 λm µ(cm) µ (c m) 2. To prove Lemma 4, we only need to show for any instance (c , U ) Alt(c) such that d H(c , c) > 1, there exists another instance (c , U ) Alt(c) such that d H(c , c) = 1 and Dist(c , U ) Dist(c , U ). The proof consists of four steps and Figure 4 serves as an illustration to help understand the various constructions. Step 1 (Permute the partition of the given instance). To construct a new instance, we construct a permutation σ on [K] such that σ 1(1) = arg min k [K] µ (k) µ(1)||2 σ 1(2) = arg min k [K]\{σ 1(1)} µ (k) µ(2)||2 σ 1(3) = arg min k [K]\{σ 1(1),σ 1(2)} µ (k) µ(3)||2 σ 1(K 1) = arg min k [K]\{σ 1(1),σ 1(2),...σ 1(K 2)} µ (k) µ(K 1)||2. Optimal Clustering with Bandit Feedback 2 3 1 2 3 1 2 3 1 3 3 2 2 1 1 1 2 1 2 3 1 3 3 2 2 1 1 3 3 2 2 1 1 2 3 1 3 3 2 2 1 1 2 3 1 3 3 2 2 1 1 Figure 4: An illustration of the proof of Lemma 4. Figure 4 illustrates the construction of the sequence of instances when the number of clusters K = 3 and the number of arms (or items) M = 9. Each pair of items connected by a double arrow represents the cluster indices of one arm in the true partition c and one of the newly constructed partitions c( k) for k = 0, 1, 2, 3. After the application of the permutation σ as defined in (10) in Step 1, each of the arms can have any cluster index. However, due to the desirable property of the permutation as stated in (11), in Step 2, we are able to construct a new partition c(1) such that for any arm, its cluster index in c(1) is not larger than that of c. Next, we modify the new partition from right to left in Step 3 (see Equations (13) and (14)) and we finally return to a partition that is identical to c. Let (c(0), U(0)) = σ(c , U ) and hence we have 1 = arg min 1 k K µ(0)(k) µ(1)||2 2 = arg min 2 k K µ(0)(k) µ(2)||2 3 = arg min 3 k K µ(0)(k) µ(3)||2 K 1 = arg min K 1 k K µ(0)(k) µ(K 1)||2. Yang, Zhong, and Tan Obviously, d H(c, c(0)) 1, otherwise (c , U ) Alt(c). Step 2 (Update the clustering). Now we construct another instance (c(1), U(0)), in which c(1) m = min(cm, c(0) m ) for all m [M]. It holds that Dist(c(0), U(0)) m=1 λm µ(cm) µ(0)(c(0) m ) 2 m=1 λm µ(cm) µ(0)(c(0) m ) 21{cm c(0) m } + m=1 λm µ(cm) µ(0)(c(0) m ) 21{cm > c(0) m } m=1 λm µ(cm) µ(0)(cm) 21{cm c(0) m } + m=1 λm µ(cm) µ(0)(c(0) m ) 21{cm > c(0) m } m=1 λm µ(cm) µ(0)(c(1) m ) 2 = Dist(c(1), U(0)) where the inequality is due to our construction of (c(0), U(0)) in Equation (11). Step 3 (Construct a sequence of instances). For any (c , U ) and k, k [K], let wk,k (c ) := m=1 λm1{cm = k, c m = k }. Dist(c , U ) = k =1 wk,k (c ) µ(k) µ (k ) 2. Moreover, the minimization problem min U Dist(c , U ) has a unique solution U which is defined by its components as µ (k ) = PK k=1 wk,k (c )µ(k) PK k=1 wk,k (c ) (12) for all k [K]. Consider an instance (c(1), U(1)), in which U(1) = arg min U Dist(c(1), U ). Clearly, it holds that Dist(c(1), U(0)) Dist(c(1), U(1)). From the construction of c(1), we also know that wk,K(c(1)) = 0 for any k = K. Therefore, by (12), µ(1)(K) = µ(K). Then consider another instance (c(2), U(1)), where ( K if cm = K c(1) m otherwise. (13) Since µ(1)(K) = µ(K), Dist(c(1), U(1)) Dist(c(2), U(1)). Next, we construct (c(2), U(2)), in which U(2) = arg min U Dist(c(2), U ). From the construction of c(2), we also know that Optimal Clustering with Bandit Feedback wk,K 1(c(2)) = 0 for any k = K 1. Therefore, µ(2)(K 1) = µ(K 1) and Dist(c(2), U(2)) Dist(c(3), U(2)) where ( K 1 if cm = K 1 c(2) m otherwise. (14) Following the same method, we can then construct a sequence of instances {(c(3), U(3)), (c(4), U(3)), . . . , (c(K), U(K))}. Note that eventually (c(K), U(K)) = (c, U) since the cluster indices of c(K) and c are the same and U(K) = arg min U Dist(c(K), U ). Step 4 (Find the desired instance within the constructed sequence). Consider the whole sequence {(c(0), U(0)), (c(1), U(0)), (c(1), U(1)), (c(2), U(1)), . . . , (c(K), U(K))}. Both the distance function Dist( , ) and the Hamming distance d H(c, ) are non-increasing. Furthermore, d H(c, c(K)) = 0. If d H(c, c(0)) = 1, we can simply choose (c , U ) = (c(0), U(0)). If d H(c, c(0)) > 1, from the sequence of instances, we can select the first instance whose partition is exactly the same as c, denoted as (c( k), U( k 1)) (1 k K), i.e., c( k) = c. We also know (c( k 1), U( k 1)) satisfies that d H(c, c( k 1)) 1. Our construction of the sequence of instances is consecutive in the sense that we can modify the cluster indices of c( k 1) one by one until we get exactly the partition c( k). Therefore, we can construct another sequence of instances {(c( k 1), U( k 1)), (c( k 1), U( k 1)), (c( k 1), U( k 1)), . . . , (c( k), U( k 1))}, where the Hamming distance d H(c, ) is strictly decreasing by 1 in each step in this sequence. Besides, the distance function Dist( , ) is also nonincreasing. As a result, we can always find an instance (c , U ) in this sequence such that d H(c , c) = 1. B.3 Proof of Proposition 5 Proof Due to Lemma 4, we only need to consider the set of alternative instances whose partitions have a Hamming distance of 1 from the given partition c, i.e., {(c , U ) Alt(c) : d H(c , c) = 1}. If λ P+ M, suppose that c and c only differ in the label of arm m, which is changed from cm to k . Since c is a valid partition, we have n(cm) > 1. To minimize the objective function, for any k [K]\{k }, µ (k) can be exactly set to be µ(k). Therefore, the objective function can be simplified to λm µ(cm) µ (k ) 2 + w(k ) µ(k ) µ (k ) 2, which shows the mean vector of cluster k should be chosen as µ (k ) = λmµ(cm)+w(k )µ(k ) λm+w(k ) , a weighted sum of µ(cm) and µ(k ). Yang, Zhong, and Tan Altogether, the optimal value can be derived as follows: inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 = inf (c ,U ) Alt(c): d H(c ,c)=1 m=1 λm µ(cm) µ (c m) 2 = inf m [M],k [K]: n(cm)>1,k =cm λmw(k ) λm + w(k ) µ(cm) µ(k ) 2 (15) = min m [M],k [K]: n(cm)>1,k =cm λmw(k ) λm + w(k ) µ(cm) µ(k ) 2 (16) = min k,k [K]: n(k)>1,k =k w(k)w(k ) w(k) + w(k ) µ(k) µ(k ) 2 (17) where Equation (15) follows from our choice of the mean vectors above and Equation (17) follows from the fact that for any y, z > 0, f(x) = xy x+z is increasing on (0, ). As a consequence, the infimum in Problem ( ) can be replaced with a minimum since the infimum can be attained by some (c , U ) Alt(c). If λ PM \ P+ M, suppose that λm = 0 for some m [M]. Notice that the objective function is always non-negative. We consider two situations. First, if n(cm) > 1, we will construct an instance (c , U ) Alt(c) such that the objective function is zero. In particular, we can change the label of the arm m from cm to any k = cm, and keep all the mean vectors the same. Then the objective function for the new instance is exactly zero, which shows inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 = 0. Second, if n(cm) = 1, we will construct an instance (c , U ) Alt(c) such that the objective function is arbitrarily close to zero. In particular, we can change the label of any other arm m (subject to n(cm ) > 1) from cm to cm. Thus the objective function can be simplified to λm µ(cm ) µ (cm) 2. For the mean vector of cm, note that we cannot set µ (cm) to be exactly µ(cm ) otherwise (c , U ) / Alt(c). However, we can let µ (cm) be arbitrarily close to µ(cm ), which yields inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 = 0. This completes the proof of Proposition 5. Optimal Clustering with Bandit Feedback Remark 27 In fact, from the above proof, we know more about the optimal solution if λ P+ M. Let (k , k ) := arg min k,k [K]: n(k)>1,k =k w(k)w(k ) w(k) + w(k ) µ(k) µ(k ) 2. Then there exists (c , U ), where ( k if m = arg minm [M]:cm=k λm cm otherwise ( w(k )µ(k )+w(k )µ(k ) w(k )+w(k ) if m = k µ(k) otherwise (c , U ) arg min (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2. B.4 Proof of Proposition 6 Proof First, we define f1 : R+ R+ R+ as f1(x, y) := x+y if x + y = 0 0 otherwise which is continuous on R+ R+. Next, we will prove that for all (λ, U) PM U , g(λ, U) as defined in Proposition 6 can be written as g(λ, U) = min m [M],k [K]: n(cm)>1,k =cm m=1 λ m1{c m = k} µ(cm) µ(k) 2. (18) For any λ P+ M, Equation (18) follows directly from Equation (16) in the proof of Proposition 5. For any λ PM \ P+ M, it suffices to find m [M], k [K] subject to n(cm) > 1 and k = cm such that f1 λm, PM m=1 λ m1{c m = k} = 0. Suppose that λ ˆm = 0 for some ˆm [M] and we will consider two situations. If n(c ˆm) > 1, we can simply choose m = ˆm and any k = c ˆm. If n(c ˆm) = 1, we can choose any m subject to n(cm) > 1, and k = c ˆm. Altogether, Equation (18) holds for all (λ, U) PM U . Finally, for any m [M], k [K] such that n(cm) > 1 and k = cm, we consider the function f2(m, k; λ, U) := f1 m=1 λ m1{c m = k} µ(cm) µ(k) 2 Yang, Zhong, and Tan so that g(λ, U) = min m [M],k [K]: n(cm)>1,k =cm f2(m, k; λ, U). Since (λ, U) 7 f2(m, k; λ, U) is continuous on PM U for fixed m and k and the finite minimum operation preserves continuity, (λ, U) 7 g(λ, U) is also continuous on PM U . B.5 Proof of Proposition 7 Proof For any (c, U), D (c, U) can be written as: ( 1 2 sup λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 ) 1 ( 1 2 max λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 ) 1 1 2 max λ P+ M min k,k [K]: n(k)>1,k =k w(k)w(k ) w(k) + w(k ) µ(k) µ(k ) 2 = 2 min λ P+ M max k,k [K]: n(k)>1,k =k 1 w(k) + 1 w(k ) µ(k) µ(k ) 2. where Equation (19) follows from the continuity of g(λ, U) as shown in Proposition 6 and the compactness of PM, and Equation (20) follows from Proposition 5 and the non-negativity of D (c, U). Notice that both {w(k)}K k=1 and { w(k)}K k=1 depend on λ. For any k and fixed w(k), w(k) = minm [M]:cm=k λm is maximized if and only if for all m [M] such that cm = k, λm are equal. Therefore, we can solve the outer minimization on a smaller probability simplex P+ K. This yields D (c, U) = 2 min w P+ K max k,k [K]: n(k)>1,k =k w(k) + 1 w(k ) µ(k) µ(k ) 2 as desired. B.6 Proof of Proposition 8 Proof According to the proof of Proposition 7, there exists a bijective map (which is specified in the proof of Proposition 7 as well as the statement of Proposition 8) between the solution(s) to arg max λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 Optimal Clustering with Bandit Feedback and the solution(s) to arg min w P+ K max k,k [K]: n(k)>1,k =k w(k) + 1 w(k ) µ(k) µ(k ) 2. (21) Therefore, it suffices to show the solution to (21) is unique. We will show this by contradiction as follows. Suppose that w and w are different solutions to (21) and w (ˆk) = w (ˆk) for some ˆk [K]. For any k, k [K] such that n(k) > 1 and k = k, consider the function f1(k, k ; w, U) := 2 n(k) w(k) + 1 w(k ) µ(k) µ(k ) 2. Since µ(k) µ(k ) 2 > 0 by the definition of U , f1(k, k ; w, U) is convex in w. In particular, f1(k, k ; w, U) is strictly convex in the pair (w(k), w(k )). Besides, since the pointwise maximum operation preserves convexity, f2(w, U) := max k,k [K]: n(k)>1,k =k f1(k, k ; w, U) is also convex in w. Therefore, w := 1 2(w + w ) is also a solution to (21) and D (c, U) = f2(w , U) = f2(w , U) = f2(w , U). We claim that for any k, k [K] such that n(k) > 1 and k = k, if ˆk {k, k }, then f1(k, k ; w , U) < D (c, U). Otherwise, using the fact w (ˆk) = w (ˆk) and the strict convexity of f1(k, k ; w , U) in the pair (w (k), w (k )), D (c, U) = f1(k, k ; w , U) 2 f1(k, k ; w , U) + f1(k, k ; w , U) 2 (f2(w , U) + f2(w , U)) which leads to a contradiction. Therefore, our claim holds. In view of this claim, we can identify a w P+ K such that f2(w , U) < D (c, U), which results in a contradiction to the optimality of D (c, U). In particular, such a w can be defined through its components as ( w (k) (K 1)ϵ if k = ˆk w (k) + ϵ otherwise Yang, Zhong, and Tan with a sufficiently small ϵ > 0 such that f1(k, k ; w , U) < D (c, U) for all k, k [K] such that n(k) > 1, k = k and ˆk {k, k }. The existence of such a sufficiently small ϵ > 0 is guaranteed by the continuity of f1(k, k ; w , U) in the pair (w (k), w (k )). In addition, for any k, k [K] such that n(k) > 1, k = k and ˆk {k, k }, f1(k, k ; w , U) = 2 n(k) w (k) + 1 w (k ) µ(k) µ(k ) 2 w (k) + 1 w (k ) µ(k) µ(k ) 2 = f1(k, k ; w , U) In view of the fact that under both cases (ˆk {k, k } and ˆk / {k, k }), we have f1(k, k ; w , U) < D (c, U), we conclude that f2(w , U) = max k,k [K]: n(k)>1,k =k f1(k, k ; w , U) < D (c, U) which contradicts the fact that D (c, U) is the optimal value. Altogether, the solution to (21) is unique and hence the solution to arg max λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 is also unique. B.7 Proof of Proposition 9 Proof Recalling Proposition 6, g(λ, U) = inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 is continuous on PM U . Since PM does not depend on U and is compact, by Lemma 20, Λ(U) = arg max λ PM g(λ, U) is an upper hemicontinuous correspondence on U . According to Proposition 8, Λ(U) is single-valued. Consequently, by Lemma 21, Λ is continuous on U . Optimal Clustering with Bandit Feedback Appendix C. On the Forced Exploration Stage of Algorithm 1 In this appendix, we study the situation that the ratio between the minimal and the maximal pairwise distances among the cluster centers is bounded, and propose a novel exploration method, which is more adaptive than the original forced exploration used in Algorithm 1. In particular, we assume that (c, U), the instance of cluster bandits to be partitioned, satisfies Assumption 1. Assumption 1 The ratio between the minimal and the maximal pairwise distances among the K centers of the clusters {µ(k)}k [K] is lower-bounded by a known constant r (0, 1], i.e., mink,k [K]:k =k µ(k) µ(k ) maxk,k [K]:k =k µ(k) µ(k ) r. We start from a quantitative result on the optimal oracle sampling rule λ , which is proved in Appendix C.1. Proposition 28 Under Assumption 1, λ , which is the unique solution to arg max λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2, satisfies λ m r2 2M for all m [M]. Basically, Proposition 28 shows that the proportions of arm pulls for the optimal sampling rule λ are uniformly lower-bounded by a positive constant. Therefore, at each time step t, we no longer need to check whether there exists an arm whose number of pulls is not larger than a preset threshold. Once we obtain the plug-in approximation λ (t) with the estimate produced by K-means Maximin, we project it onto the constrained probability simplex PM( r2 2M ) := {λ [ r2 2M , 1]M : λ 1 = 1}. Then instead of the original plug-in approximation λ (t) which may result in under-sampling, we can track the optimal sampling rule based on this projection (which is denoted as λ (t)). In summary, the sampling rule (Line 8 of Algorithm 1) becomes At+1 = arg max m [M] t λ m(t) Nm(t) (22) λ (t) = Proj PM( r2 2M )(λ (t)) = arg min 2M ) λ λ (t) . We refer to the improved sampling rule in Equation (22) as the linear-exploration sampling rule henceforth. Compared to the original one proposed in Section 5.2, the linearexploration sampling rule circumvents the forced exploration stage and always maintains a linear minimal exploration rate, which ensures that the plug-in approximation λ (t) converges to the oracle optimal sampling rule λ (almost surely). Moreover, since Proposition 28 indicates that the optimal sampling rule always lies in the constrained probability Yang, Zhong, and Tan simplex PM( r2 2M ) under Assumption 1, the projection λ (t) also converges to λ (almost surely). Therefore, asymptotic optimality is preserved. Besides, one can easily see that our subsequent results on sample complexity (Proposition 18 and Theorem 19) also hold. However, since the forced exploration stage in the original sampling rule of BOC might take effect even if the plug-in approximation λ (t) is close to the optimal sampling rule λ , the linear-exploration sampling rule may result in better empirical performance in the non-asymptotic regime, as demonstrated in the numerical results in Appendix C.2. Finally, we remark that the linear-exploration sampling rule requires extra knowledge of the pairwise distances among the cluster centers. This is unlikely to be available in most practical applications and hence we adopt an arguably less elegant forced exploration strategy in the main text. C.1 Proof of Proposition 28 Proof By Proposition 8, it suffices to show that w , which is the unique solution to arg min w P+ K max k,k [K]: n(k)>1,k =k w(k) + 1 w(k ) µ(k) µ(k ) 2, satisfies w (k) r2n(k) 2M for all k [K]. In the following, we will show this result by contradiction. Suppose that there exists ˆk [K] such that w (ˆk) < r2n(ˆk) 2M . For ease of notation, we denote ( ) := min w P+ K max k,k [K]: n(k)>1,k =k w(k) + 1 w(k ) µ(k) µ(k ) 2 = max k,k [K]: n(k)>1,k =k w (k) + 1 w (k ) µ(k) µ(k ) 2. If n(ˆk) = 1, then ( ) max k [K]: n(k)>1,k =ˆk n(k) w (k) + 1 µ(k) µ(ˆk) 2 w (ˆk) min k,k [K]:k =k µ(k) µ(k ) 2 r2 min k,k [K]:k =k µ(k) µ(k ) 2. Optimal Clustering with Bandit Feedback If n(ˆk) > 1, then ( ) max k [K]:k =ˆk w (ˆk) + 1 w (k ) µ(ˆk) µ(k ) 2 w (ˆk) min k,k [K]:k =k µ(k) µ(k ) 2 r2 min k,k [K]:k =k µ(k) µ(k ) 2. Therefore, in both cases, we have r2 min k,k [K]:k =k µ(k) µ(k ) 2. (23) On the other hand, we construct w RK such that w(k) = n(k) M for all k [K]. Since P k [K] n(k) = M, w is a valid element in P+ K. Thus, under Assumption 1, it holds that ( ) max k,k [K]: n(k)>1,k =k w(k) + 1 w(k ) µ(k) µ(k ) 2 = max k,k [K]: n(k)>1,k =k M + M n(k ) µ(k) µ(k ) 2 2M max k,k [K]:k =k µ(k) µ(k ) 2 r2 min k,k [K]:k =k µ(k) µ(k ) 2 which contradicts Inequality (23). The proof of Proposition 28 is thus completed. C.2 Numerical Experiments on the Linear-Exploration Sampling Rule In this section, we numerically compare the more aggressive linear-exploration sampling rule described in Equation (22) to the the original sampling rule proposed in Section 5.2. We experiment this on the challenging synthetic instance in Equation (9) and the Iris dataset as these datasets behave differently. For the challenging synthetic instance, we rescale the cluster centers such that its hardness parameter D (c, U) is equal to that of the Iris dataset. For the linear-exploration sampling rule, the exact information of the ratio between the minimal and the maximal pairwise distances is given to the algorithm as an input. The empirical stopping times for various (non-vanishing) failure probabilities δ are presented in Table 2. The behaviors on both datasets are different. The linear-exploration rule outperforms the original one for the rescaled challenging synthetic instance, while there are no obvious gaps between the two sampling rules for the Iris dataset. This is because for the rescaled challenging instance, there exists some arms that require a few arm pulls Yang, Zhong, and Tan Rescaled Challenging Instance Iris Dataset δ Original LE Original LE 10 1 102.9 29.5 78.9 20.3 886.1 55.9 886.8 47.5 10 2 119.0 28.9 86.7 21.3 922.5 69.4 920.7 71.8 10 3 132.3 29.4 95.9 22.3 954.6 80.0 953.4 77.8 10 4 142.1 31.5 102.8 23.6 993.8 87.3 990.1 88.8 10 5 158.3 32.7 114.7 25.4 1026.9 84.8 1023.6 83.7 10 6 165.8 32.3 120.1 26.5 1059.0 68.4 1055.3 71.8 10 7 176.1 31.2 128.9 27.1 1075.8 62.4 1079.1 58.8 10 8 189.1 35.3 135.4 26.7 1090.0 56.9 1092.2 54.1 10 9 199.6 32.8 143.3 27.3 1104.8 50.4 1109.6 48.7 10 10 209.7 33.8 152.2 29.6 1120.2 48.2 1120.0 49.7 Table 2: The averaged empirical sample complexities of BOC with the original forcedexploration sampling rule described in Algorithm 1 (and indicated by Original in the table) or the linear-exploration Sampling Rule (indicated by LE) with the heuristic threshold function β(δ, t) for different confidence levels δ. to learn their means sufficiently accurately. However, due to the forced exploration rule, they are pulled more often than necessary and this results in the inefficiency. On the other hand, this phenomenon is not observed for the Iris dataset and both Algorithm 1 and the linear-exploration rule work well in ensuring that the approximate proportions of arm pulls (λ (t) or λ (t)) converge to the optimal one dictated by the lower bound. Appendix D. Proofs of Section 5 D.1 Proof of Proposition 10 Proof The proof consists of three steps, using the triangle inequality frequently. First, we show that the K arms chosen in Maximin Initialization are selected from K disjoint clusters. Let m1, m2, . . . , m K denote the arms chosen in Maximin Initialization in order. Assuming that we have already taken the empirical estimates of k arms as the cluster centers, consider A k, the set of the arms that shares the same true cluster index with at least one of the existing centers. For any arm m A k, the Euclidean distance to the nearest existing center can be upper bounded as follows: min 1 k k ˆµm ˆµ(k) = min 1 k k ˆµm ˆµmk min 1 k k ( ˆµm µ(cm ) + µ(cm ) µ(cmk) + µ(cmk) ˆµmk ) 2 max m [M] ˆµm µ(cm) + µ(cm ) µ(cmk) = 2 max m [M] ˆµm µ(cm) + min 1 k k µ(cm ) µ(cmk) Optimal Clustering with Bandit Feedback = 2 max m [M] ˆµm µ(cm) where the last equality results from the fact that at least one of the existing centers shares the same true cluster index as that of arm m . On the other hand, for any arm m [M] \ A k, the Euclidean distance to the nearest existing center can be lower bounded as follows: min 1 k k ˆµm ˆµ(k) = min 1 k k ˆµm ˆµmk min 1 k k ( ˆµm µ(cm ) + µ(cm ) µ(cmk) µ(cmk) ˆµmk ) 2 max m [M] ˆµm µ(cm) + µ(cm ) µ(cmk) = 2 max m [M] ˆµm µ(cm) + min 1 k k µ(cm ) µ(cmk) 2 max m [M] ˆµm µ(cm) + min k,k [K]:k =k µ(k) µ(k ) where the last inequality follows from the fact that none of the existing centers shares the same true cluster index with arm m . The above upper bound and lower bound, together with the constraint on the accuracy of the empirical estimates, show that the k-th center must come from [M] \ A k. Therefore, Maximin Initialization succeeds in choosing K centers from K disjoint clusters. Second, we prove that after the first step (Line 6) in the first iteration of Weighted K-means, ˆc is a correct partition, i.e., ˆc c. For any arm m [M], we can always find exactly one arm (denoted as m k) from m1, m2, . . . , m K that shares the same true cluster index since m1, m2, . . . , m K are selected from K disjoint clusters. The distance between ˆµm and ˆµ( k) can be upper bounded as follows: ˆµm ˆµ( k) ˆµm µ(cm ) + µ(cm ) ˆµ( k) = ˆµm µ(cm ) + µ(cm k) ˆµm k 2 max m [M] ˆµm µ(cm) . However, with respect to the remaining cluster centers, we have min k [K]:k = k ˆµm ˆµ(k) min k [K]:k = k ( ˆµm µ(cm ) + µ(cm ) µ(cmk) µ(cmk) ˆµ(k) ) = min k [K]:k = k ( ˆµm µ(cm ) + µ(cm ) µ(cmk) µ(cmk) ˆµmk ) min k [K]:k = k 2 max m [M] ˆµm µ(cm) + µ(cm ) µ(cmk) = 2 max m [M] ˆµm µ(cm) + min k [K]:k = k µ(cm k) µ(cmk) 2 max m [M] ˆµm µ(cm) + min k,k [K]:k =k µ(k) µ(k ) > 2 max m [M] ˆµm µ(cm) + 4 max m [M] ˆµm µ(cm) Yang, Zhong, and Tan = 2 max m [M] ˆµm µ(cm) ˆµm ˆµ( k) . Hence, ˆcm = arg min k [K] ˆµm ˆµ(k) = k. Since the arm m is arbitrary, the above argument shows those arms that share the same true cluster index still share the same cluster index in ˆc. Besides, there are K disjoint non-empty clusters in ˆc. Therefore, ˆc is a correct partition. Finally, we prove that the partition ˆc no longer changes after the first iteration of Weighted K-means, which we term as the Clustering is stabilized. We need to show after the update (Line 7) of ˆµ(k), arg mink [K] ˆµm ˆµ(k) still returns ˆcm for any arm m [M]. Since ˆc c, there exists a permutation σ on [K] such that c = σ(ˆc). For any cluster k [K], ˆµ( k) µ(σ( k)) = P m [M] Nmˆµm1{ˆcm = k} P m [M] Nm1{ˆcm = k} µ(σ( k)) P m [M] Nm1{cm = σ( k)}(ˆµm µ(σ( k))) P m [M] Nm1{cm = σ( k)} P m [M] Nm1{cm = σ( k)}(ˆµm µ(cm)) P m [M] Nm1{cm = σ( k)} P m [M] Nm1{cm = σ( k)} ˆµm µ(cm) P m [M] Nm1{cm = σ( k)} max m [M] ˆµm µ(cm) . (24) Then for any arm m [M], the distance between ˆµm and ˆµ(ˆcm ) can be upper bounded as follows ˆµm ˆµ(ˆcm ) ˆµm µ(cm ) + µ(cm ) ˆµ(ˆcm ) = ˆµm µ(cm ) + µ(σ(ˆcm )) ˆµ(ˆcm ) 2 max m [M] ˆµm µ(cm) while the minimal distance between ˆµm and other centers can be lower bounded as follows min k [K]:k =ˆcm ˆµm ˆµ(k) min k [K]:k =ˆcm ( ˆµm µ(cm ) + µ(cm ) µ(σ(k)) µ(σ(k)) ˆµ(k) ) = min k [K]:k =ˆcm ( ˆµm µ(cm ) + µ(σ(ˆcm )) µ(σ(k)) µ(σ(k)) ˆµ(k) ) min k [K]:k =ˆcm 2 max m [M] ˆµm µ(cm) + µ(σ(ˆcm )) µ(σ(k)) Optimal Clustering with Bandit Feedback = 2 max m [M] ˆµm µ(cm) + min k [K]:k =ˆcm µ(σ(ˆcm )) µ(σ(k)) 2 max m [M] ˆµm µ(cm) + min k,k [K]:k =k µ(k) µ(k ) > 2 max m [M] ˆµm µ(cm) + 4 max m [M] ˆµm µ(cm) = 2 max m [M] ˆµm µ(cm) ˆµm ˆµ(ˆcm ) . Therefore, arg mink [K] ˆµm ˆµ(k) remains equal to ˆcm . Since the arm m is arbitrary, the partition is proved to be stabilized after the first iteration of Weighted K-means. Moreover, ˆµ(k) for all k [K] will stay the same once the partition stabilizes. Hence, by Equation (24), max k [K] ˆµ(k) µ(σ(k)) max m [M] ˆµm µ(cm) . Now the proof of Proposition 10 is completed. D.2 Proof of Proposition 12 Proof Due to the forced exploration in Algorithm 1 and the strong law of large numbers, for all m [M], ˆµm(t) converges almost surely to µ(cm), i.e., ˆµm(t) a.s. µ(cm), as t tends to infinity. In the following, we condition on the event E = n lim t ˆµm(t) = µ(cm) for all m [M] o which has probability 1. Note that in finite-dimensional spaces, pointwise convergence and convergence in Euclidean norm are equivalent. Therefore, by Proposition 10, we know that for sufficiently large t, Algorithm 2 will output a correct partition ct such that ct = σt(c) for some permutation σt on [K]; moreover, µt(σt(k)) µ(k) for all k [K]. Therefore, Alt(ct) = Alt(c) for sufficiently large t, and σt(Ut) U. Hence, for sufficiently large t, λ (t) = arg max λ PM inf (c ,U ) Alt(ct) m=1 λm µt(ct m) µ (c m) 2 = arg max λ PM inf (c ,U ) Alt(c) m=1 λm µt(σt(cm)) µ (c m) 2. By Proposition 9, λ (t) arg max λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 Yang, Zhong, and Tan Consequently, λ (t) converges pointwisely to λ . That is to say, for any ϵ > 0, there exists t0(ϵ) such that sup t>t0(ϵ) max m [M] |λ m(t) λ m| ϵ 3(M 1). By Lemma 22, there further exists t1(ϵ) t0(ϵ) such that sup t>t1(ϵ) max m [M] This is also equivalent to lim t Nm(t) t = λ m for all m [M] as desired. D.3 Proof of Proposition 15 Proof Recall the definitions of Z(t), Z1(t), and Z2(t) from Section 5.3 or Algorithm 1. We can then bound the error probability as follows: Pr(τδ < , cout c) Pr( t N : ct 1 c, Z(t) β(δ, t)) = Pr t N : ct 1 c, 1 αZ1(t) + α α + 1Z2(t) β(δ, t) (25) αZ1(t) + α α + 1 m=1 Nm(t) µt 1(ct 1 m ) µ(cm) 2 ! m=1 Nm(t) ˆµm(t) µ(cm) 2 β(δ, t) Line (25) follows from Lemma 23. Line (26) follows from the fact that if ct 1 c, then (c, U) Alt(ct 1) and hence m=1 Nm(t) µt 1(ct 1 m ) µ(cm) 2. Optimal Clustering with Bandit Feedback Line (27) follows from Lemma 24 and thus αZ1(t) + α α + 1 m=1 Nm(t) µt 1(ct 1 m ) µ(cm) 2 ! m=1 Nm(t) α ˆµm(t) µt 1(ct 1 m ) 2 + α α + 1 µt 1(ct 1 m ) µ(cm) 2 ! m=1 Nm(t) ˆµm(t) µ(cm) 2 ! m=1 Nm(t) ˆµm(t) µ(cm) 2. Line (28) follows from Lemma 25. D.4 Proof of Proposition 18 Proof Similarly to the proof of Proposition 12, in the following, we condition on the event E = n lim t ˆµm(t) = µ(cm) for all m [M] o which has probability 1. Note that in finite-dimensional spaces, pointwise convergence and convergence in Euclidean norm are equivalent. By Proposition 10, there exists t1 > 0 such that for all t t1, Algorithm 2 will output a correct partition ct = σt(c) for some permutation σt on [K]; moreover, µt(σt(k)) µ(k) for all k [K]. Therefore, for all m [M], µt 1(ct 1 m ) µ(cm). Accordingly, on the event E, as t tends to infinity, ˆµm(t) µt 1(ct 1 m ) 2 0 for all m [M]. Since Nm(t)/t is uniformly bounded by 1 for all m [M], we have t ˆµm(t) µt 1(ct 1 m ) 2 0. Now we consider Z2(t)/t. By Proposition 12 and its proof, conditioned on the event E, Nm(t)/t λ m for all m [M]. When t > t1, ct 1 is always correct (i.e., ct 1 c) and hence Alt(ct 1) = Alt(c). Thus, for t > t1, t = min (c ,U ) Alt(ct 1) t µt 1(ct 1 m ) µ (c m) 2 = min (c ,U ) Alt(c) t µt 1(σt 1(cm)) µ (c m) 2. Yang, Zhong, and Tan By Proposition 6, as t tends to infinity, t min (c ,U ) Alt(c) m=1 λ m µ(cm) µ (c m) 2 = 2D (c, U) 1. Consequently, !2 D (c, U) 1. So for any 0 < ϵ < 1, there exists t2 > t1 such that for all t t2, t (1 ϵ)D (c, U) 1. Now consider β(δ, t). Since Nm(t) t for m [M], it holds that there exists t3 > 0 such that for all t t3, M X m=1 2d log(4 + log(Nm(t))) log(t) which implies that β(δ, t) log(t) + Md Ψ log(1/δ) Altogether, we have τδ = inf{t N : Z(t) β(δ, t)} max{t2, t3} inf t N : t(1 ϵ)D (c, U) 1 log(t) + Md Ψ log(1/δ) which shows τδ is finite conditioned on E. Since Pr(E) = 1, we have Pr(τδ < ) = 1. For ease of notation, let a := (1 ϵ)D (c, U) 1 and b := Md Ψ log(1/δ) Md . For suffi- ciently small δ, since b log(1/δ), b + log 1 a > 0. Thus, by Lemma 26, τδ max t2, t3, 1 + log b + log 1 Note that t2 and t3 do not depend on δ and Ψ(x) = x + log(x) + o(log(x)) as x . Thus, lim sup δ 0 τδ log(1/δ) lim sup δ 0 1 a log(1/δ) + log b + log 1 a = (1 ϵ) 1D (c, U). Optimal Clustering with Bandit Feedback Since the above inequality holds for any 0 < ϵ < 1, by letting ϵ 0+, lim sup δ 0 τδ log(1/δ) D (c, U). D.5 Proof of Theorem 19 Proof Let 0 < ϵ < 1. First, we consider the sampling rule. According to Proposition 10 and Proposition 9, there exists a function ξ : (0, 1) 0, 1 4 min k,k [K]:k =k µ(k) µ(k ) such that limϵ 0+ ξ(ϵ) = 0 and for any t > 0, if max m [M] ˆµm(t) µ(cm) ξ(ϵ), (i) Algorithm 2 outputs a correct partition ct such that ct = σt(c) for some permutation σt on [K]; (ii) Ut satisfies that max k [K] µt(σt(k)) µ(k) ξ(ϵ); (iii) λ (t) satisfies that max m [M] |λ m(t) λ m| ϵ. Point (iii) is, in fact, a consequence of the continuity of λ (t) (see Proposition 9): lim σt(Ut) U λ (t) = lim σt(Ut) U arg max λ PM inf (c ,U ) Alt(ct) m=1 λm µt(ct m) µ (c m) 2 = lim σt(Ut) U arg max λ PM inf (c ,U ) Alt(c) m=1 λm µt(σt(cm)) µ (c m) 2 = arg max λ PM inf (c ,U ) Alt(c) m=1 λm µ(cm) µ (c m) 2 In addition, let T M/ϵ3 N and define the event max m [M] ˆµm(t) µ(cm) ξ(ϵ) . Yang, Zhong, and Tan Conditioned on the event ET (ϵ), by Lemma 22 and the definition of ξ(ϵ), for all t T, Now we introduce D ϵ(c, U) 1, which is an ϵ-approximation of D (c, U) 1, and defined as D ϵ(c, U) 1 := 1 2 inf ˆλ, ˆU min (c ,U ) Alt(c) m=1 ˆλm ˆµ(cm) µ (c m) 2 where the infimum is over all ˆλ and ˆU such that ˆλm λ m 3(M 1)ϵ, max k [K] ˆµ(k) µ(k) ξ(ϵ). By the continuity of ϵ 7 D ϵ(c, U) 1 at 0 (which is a consequence of Proposition 6), lim ϵ 0+ D ϵ(c, U) 1 = 1 2 min (c ,U ) Alt(c) m=1 λ m µ(cm) µ (c m) 2 = D (c, U) 1. Therefore, conditioned on the event ET (ϵ), for all t > T, Z2(t)/t 2D ϵ(c, U) 1. Concerning Z1(t), conditioned on the event ET (ϵ), for all t > T, using the inequality in Lemma 24 (with α = 1), t ˆµm(t) µt 1(ct 1 m ) 2 t ˆµm(t) µ(cm) 2 + µ(cm) µt 1(ct 1 m ) 2 t ˆµm(t) µ(cm) 2 + µ(cm) µt 1(σt 1(cm)) 2 max m [M] ˆµm(t) µ(cm) + max k [K] µt 1(σt 1(k)) µ(k) Altogether, conditioned on ET (ϵ), for all t > T, D ϵ(c, U) 1 p Optimal Clustering with Bandit Feedback Then consider β(δ, t). Since Nm(t) t for m [M], it holds that there exists t2 > 0 such that for all t t2, M X m=1 2d log(4 + log(Nm(t))) log(t) which implies that β(δ, t) log(t) + Md Ψ log(1/δ) For ease of notation, let a := p D ϵ(c, U) 1 p 2 and b := Md Ψ log(1/δ) Consequently, conditioned on ET (ϵ), τδ = inf{t N : Z(t) β(δ, t)} max{T + 1, t2} inf {t N : at log(t) + b} . Let t3 := max{ M/ϵ3 + 1, t2} inf {t N : at log(t) + b}. Then for all T + 1 t3, ET (ϵ) {τδ T + 1} and hence Pr(τδ > T + 1) Pr(Ec T (ϵ)). Therefore, E[τδ] can be upper bounded as follows: t=0 Pr(τδ > t) t=0 Pr(τδ > t) + t=t3 Pr(τδ > t) T=t3 1 Pr(Ec T (ϵ)) T= M/ϵ3 Pr(Ec T (ϵ)). Since limϵ 0+ a = D (c, U) 1, a > 0 for sufficiently small ϵ. Together with b log(1/δ), we have b + log 1 a > 0 for sufficiently small δ. Thus, by Lemma 26, E[τδ] max M + log b + log 1 T= M/ϵ3 Pr(Ec T (ϵ)). For the moment, let us assume that the final term P T= M/ϵ3 Pr(Ec T (ϵ)) is finite. Note that M/ϵ3 , t2 and P T= M/ϵ3 Pr(Ec T (ϵ)) do not depend on δ and Ψ(x) = x + log(x) + o(log(x)) as x . Thus, lim sup δ 0 E[τδ] log(1/δ) lim sup δ 0 1 a log(1/δ) + log b + log 1 D ϵ(c, U) 1 p Yang, Zhong, and Tan Since the above inequality holds for any ϵ > 0 such that a > 0, by letting ϵ 0+, lim sup δ 0 τδ log(1/δ) D (c, U). It remains to show that P T= M/ϵ3 Pr(Ec T (ϵ)) is finite. Since max m [M] ˆµm(t) µ(cm) > ξ(ϵ) , by a union bound, we have Pr(Ec T (ϵ)) m [M] Pr( ˆµm(t) µ(cm) > ξ(ϵ)) m [M] Pr i [d] : |[ˆµm(t) µ(cm)]i| > ξ(ϵ) i [d] Pr |[ˆµm(t) µ(cm)]i| > ξ(ϵ) where we use [x]i to denote the ith component of the vector x Rd. Considering sufficiently large T, for any t Tϵ3 , Nm(t) t/2 for all m [M]. Then, by Hoeffding s inequality for sub-Gaussian random variables, Pr(Ec T (ϵ)) 2Md t= Tϵ3 exp ξ(ϵ)2 x= Tϵ3 1 exp ξ(ϵ)2 x Since for any constant c > 0, the infinite sum P n=1 n exp( c n) is convergent, P T= M/ϵ3 Pr(Ec T (ϵ)) is also convergent. This completes the proof of Theorem 19. Appendix E. Additional Numerical Results on a Higher-Dimensional Dataset In this appendix, we extend our numerical experiments to the MNIST dataset (Le Cun et al., 1998), which comprises data points of higher dimensions. Specifically, we conduct Optimal Clustering with Bandit Feedback d BOC Uniform 10 2932.0 52.4 6991.8 120.8 50 8782.6 110.8 16418.7 173.8 100 16663.2 162.8 31270.8 273.5 200 32840.5 138.8 61543.7 275.3 400 65414.2 209.6 122508.3 369.9 Table 3: The averaged empirical sample complexities of BOC and Uniform for δ = 0.01 on the MNIST dataset. our online clustering task as follows. First, we randomly sample 500 images of digits from the dataset, vectorize them, and, akin to Sheikh et al. (2020), we employ Principal Component Analysis (PCA) to project the data onto different number of dimensions. Subsequently, we preprocess the dataset consistently with our approach to the Iris and Yeast datasets (see Section 6). This yields the number of clusters K = 10 and the number of arms M = 500, with the problem dimension d set to be in {10, 50, 100, 200, 400}. Note that the original dimensionality of the data points in the MNIST dataset is 784 = 28 28. The sample complexities for confidence level δ = 0.01 are presented in Table 3. It is evident from the table that BOC outperforms the non-adaptive baseline method, Uniform, by a factor of 2 in terms of their sample complexities, underscoring the advantage of employing an adaptive sampling approach, even on this dataset with large dimensionality. Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24, 2011. Emmanuel Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18:6446 6531, 2017. Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory (COLT), pages 39 1. JMLR Workshop and Conference Proceedings, 2012. Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245 248, 2009. D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027 1035, 2007. Jean-Yves Audibert, S ebastien Bubeck, and R emi Munos. Best arm identification in multiarmed bandits. In Conference on Learning Theory (COLT), pages 41 53, 2010. Yang, Zhong, and Tan Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235 256, 2002. C. Berge. Topological Spaces: Including a Treatment of Multi-valued Functions, Vector Spaces, and Convexity. Oliver & Boyd, 1963. Djallel Bouneffouf, Srinivasan Parthasarathy, Horst Samulowitz, and Martin Wistuba. Optimal exploitation of clustering and history information in multi-armed bandit. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 2016 2022, 2019. S ebastien Bubeck and Nicol o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1): 1 122, 2012. Emil Carlsson, Devdatt Dubhashi, and Fredrik D. Johansson. Thompson sampling for bandits with clustered arms. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, pages 2212 2218, 2021. Raymond B. Cattell. The description of personality: Basic traits resolved into clusters. The Journal of Abnormal and Social Psychology, 38(4):476, 1943. M Emre Celebi, Hassan A Kingravi, and Patricio A Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1):200 210, 2013. Anil Chaturvedi, J Douglas Carroll, Paul E Green, and John A Rotondo. A featurebased approach to market segmentation via overlapping k-centroids clustering. Journal of Marketing Research, 34(3):370 377, 1997. Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755 770, 1959. Anna Choromanska and Claire Monteleoni. Online clustering with experts. In International Conference on Artificial Intelligence and Statistics, pages 227 235. PMLR, 2012. Vincent Cohen-Addad, Benjamin Guedj, Varun Kanade, and Guy Rom. Online k-means clustering. In International Conference on Artificial Intelligence and Statistics, pages 1126 1134. PMLR, 2021. Harold Edson Driver and Alfred Louis Kroeber. Quantitative Expression of Cultural Relationships, volume 31. Berkeley: University of California Press, 1932. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http:// archive.ics.uci.edu/ml. Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(6), 2006. Optimal Clustering with Bandit Feedback Yifan Feng, Rene Caldentey, and Christopher Thomas Ryan. Robust learning of consumer preferences. Operations Research, 2021. Edward W. Forgy. Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21:768 769, 1965. Aur elien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998 1027. PMLR, 2016. Manlio Gaudioso, Giovanni Giallombardo, and Giovanna Miglionico. An incremental method for solving convex finite min-max problems. Mathematics of Operations Research, 31(1):173 187, 2006. Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In International Conference on Machine Learning, pages 757 765. PMLR, 2014. Cristina Gigola and Susana Gomez. A regularization method for solving the finite convex min-max problem. SIAM Journal on Numerical Analysis, 27(6):1621 1634, 1990. Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293 306, 1985. Ronald L. Graham, Donald E. Knuth, Oren Patashnik, and Stanley Liu. Concrete mathematics: A foundation for computer science. Computers in Physics, 3(5):106 107, 1989. Jeffrey W. Herrmann. A genetic algorithm for minimax optimization problems. In Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406), volume 2, pages 1099 1103. IEEE, 1999. Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the Tenth Annual Symposium on Computational Geometry, pages 332 339, 1994. Anil K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31 (8):651 666, 2010. Anil K. Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering: A review. ACM Computing Surveys (CSUR), 31(3):264 323, 1999. Yassir Jedra and Alexandre Proutiere. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33:10007 10017, 2020. Kwang-Sung Jun, Kevin Jamieson, Robert Nowak, and Xiaojin Zhu. Top arm identification in multi-armed bandits with batch arm pulls. In International Conference on Artificial Intelligence and Statistics, pages 139 148. PMLR, 2016. Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pages 1238 1246. PMLR, 2013. Yang, Zhong, and Tan P. N. Karthik and Rajesh Sundaresan. Learning to detect an odd Markov arm. IEEE Transactions on Information Theory, 66(7):4324 4348, 2020. P. N. Karthik and Rajesh Sundaresan. Detecting an odd restless Markov arm with a trembling hand. IEEE Transactions on Information Theory, 67(8):5230 5258, 2021. Emilie Kaufmann and Wouter M. Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. Journal of Machine Learning Research, 22 (246):1 44, 2021. Emilie Kaufmann, Olivier Capp e, and Aur elien Garivier. On the complexity of best-arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17 (1):1 42, 2016. Azadeh Khaleghi, Daniil Ryabko, J er emie Mary, and Philippe Preux. Online clustering of processes. In International Conference on Artificial Intelligence and Statistics, pages 601 609. PMLR, 2012. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985. Tor Lattimore and Csaba Szepesv ari. The end of optimism? An asymptotic analysis of finite-armed linear bandits. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 728 737. PMLR, 2017. Tor Lattimore and Csaba Szepesv ari. Bandit Algorithms. Cambridge University Press, 2020. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 539 548, 2016. Yonglong Li and Vicent Y. F. Tan. Second-order asymptotics of sequential hypothesis testing. IEEE Transactions on Information Theory, 66(11):7222 7230, 2020. Edo Liberty, Ram Sriharsha, and Maxim Sviridenko. An algorithm for online k-means clustering. In 2016 Proceedings of the eighteenth workshop on algorithm engineering and experiments (ALENEX), pages 81 89. SIAM, 2016. Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129 137, 1982. John David Mac Cuish and Norah E. Mac Cuish. Clustering in Bioinformatics and Drug Discovery. CRC Press, 2010. James Mac Queen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281 297. Oakland, CA, USA, 1967. Optimal Clustering with Bandit Feedback Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k-means problem is NP-hard. Theoretical Computer Science, 442:13 21, 2012. Mikhail B. Malyutov and Ivan I. Tsitovich. Second-Order Optimal Sequential Tests, volume 51, chapter 7, pages 201 213. Atkinson A., Bogacka B., Zhigljavsky A. (eds) Optimum Design, Springer, Boston, MA., 2001. Arya Mazumdar and Barna Saha. Clustering with noisy queries. Advances in Neural Information Processing Systems, 30, 2017. Trong T. Nguyen and Hady W. Lauw. Dynamic clustering of contextual multi-armed bandits. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 1959 1962, 2014. Gayathri R. Prabhu, Srikrishna Bhashyam, Aditya Gopalan, and Rajesh Sundaresan. Sequential multi-hypothesis testing in multi-armed bandit problems: An approach for asymptotic optimality. ar Xiv preprint ar Xiv:2007.12961, 2020. Cl emence R eda, Andrea Tirinzoni, and R emy Degenne. Dealing with misspecification in fixed-confidence linear top-m identification. Advances in Neural Information Processing Systems, 34, 2021. Enrique H. Ruspini. A new approach to clustering. Information and Control, 15(1):22 32, 1969. Amit Saxena, Mukesh Prasad, Akshansh Gupta, Neha Bharill, Om Prakash Patel, Aruna Tiwari, Meng Joo Er, Weiping Ding, and Chin-Teng Lin. A review of clustering techniques and developments. Neurocomputing, 267:664 681, 2017. Ruksar Sheikh, Mayank Patel, and Amit Sinhal. Recognizing MNIST Handwritten Data Set Using PCA and LDA, pages 169 177. In book: International Conference on Artificial Intelligence: Advances and Applications 2019, Feb 2020. Rahul Singh, Fang Liu, Yin Sun, and Ness Shroff. Multi-armed bandits with dependent arms. ar Xiv preprint ar Xiv:2010.09478, 2020. Marta Soare, Alessandro Lazaric, and R emi Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems 27, 2014. Ting Su and Jennifer G Dy. In search of deterministic methods for initializing k-means and Gaussian mixture clustering. Intelligent Data Analysis, 11(4):319 338, 2007. Rangarajan K. Sundaram. A First Course in Optimization Theory. Cambridge University Press, 1996. William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. Nidhin Koshy Vaidhiyan and Rajesh Sundaresan. Learning to detect an oddball target. IEEE Transactions on Information Theory, 64(2):831 852, 2017. Yang, Zhong, and Tan Zhenlin Wang and Jonathan Scarlett. Max-min grouped bandits. In Proc. of the 36th AAAI Conference on Artificial Intelligence. AAAI Press, 2022. Junwen Yang and Vincent Y. F. Tan. Minimax optimal fixed-budget best arm identification in linear bandits. Advances in Neural Information Processing Systems, 35:12253 12266, 2022. Zixin Zhong, Wang Chi Cheung, and Vincent Y. F. Tan. Best arm identification for cascading bandits in the fixed confidence setting. In International Conference on Machine Learning, pages 11481 11491. PMLR, 2020.