# gaussian_process_bandits_for_topk_recommendations__4a823c03.pdf Gaussian Process Bandits for Top-k Recommendations Mohit Yadav University of Massachusetts Amherst ymohit@cs.umass.edu Daniel Sheldon University of Massachusetts Amherst sheldon@cs.umass.edu Cameron Musco University of Massachusetts Amherst cmusco@cs.umass.edu Algorithms that utilize bandit feedback to optimize top-k recommendations are vital for online marketplaces, search engines, and content platforms. However, the combinatorial nature of this problem poses a significant challenge, as the possible number of ordered top-k recommendations from n items grows exponentially with k. As a result, previous work often relies on restrictive assumptions about the reward or bandit feedback models, such as assuming that the feedback discloses rewards for each recommended item rather than a single scalar feedback for the entire set of top-k recommendations. We introduce a novel contextual bandit algorithm for top-k recommendations, leveraging a Gaussian process with a Kendall kernel to model the reward function. Our algorithm requires only scalar feedback from the top-k recommendations and does not impose restrictive assumptions on the reward structure. Theoretical analysis confirms that the proposed algorithm achieves sub-linear regret in relation to the number of rounds and arms. Additionally, empirical results using a bandit simulator demonstrate that the proposed algorithm outperforms other baselines across various scenarios. 1 Introduction The top-k recommendation problem involves providing a ranked list of k items, such as news articles or products, from a pool of n items [34, 13]. Online algorithms must adapt to dynamic user preferences, making bandit algorithms suitable due to their use of limited feedback [1]. Developing bandit algorithms is challenging due to limited feedback and the need for computational efficiency in real-time recommendation environments. Recent research on user interfaces for recommendations highlights that the overall layout of the recommendation page is crucial for user appeal, as modern UI designs have evolved from simple dropdown lists to complex, visually engaging layouts [17, 13, 18]. Consequently, bandit algorithms must jointly select and display all top-k items, rather than simply choosing the most relevant k items and ordering them by decreasing user relevance [31]. The joint consideration of top-k items makes the number of arms (possible actions for the bandit algorithm) combinatorially large, i.e., Θ(nk). Previous research on bandit algorithms often imposes strict assumptions on feedback models [30, 21]. For instance, semi-bandit feedback provides a scalar reward for each of the top k items, thus decomposing the combinatorial feedback into item-level feedback. However, this type of feedback is frequently unavailable [32]. Another common feedback model is cascade browsing [16], which assumes that users examine items in a predetermined order and cease browsing once a desirable item is found, offering item-specific scalar feedback but failing to capture potential non-linear interactions among items [26]. Figure 1 illustrates the limitations of the cascade model in capturing user interactions within modern top-k recommendation interfaces. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). These limitations motivate us to adopt a more general full-bandit feedback setting, where only a single scalar value is provided for the entire top-k set of recommendations [24]. Figure 1: A snapshot from Etsy showcases Father s Day shopping recommendations. The lack of an obvious linear search order challenges the assumptions of the cascade model. Additionally, the proximity and arrangement of items are likely to influence clicks, indicating complex interaction patterns and supporting the need for full-bandit feedback without assumptions about user interactions with recommended items. Table 1: Compute and memory analysis for the proposed GPTop K bandit algorithm. Rows represent different costs: total compute and memory of the GP-Top K algorithm for T rounds, time for matrix-vector multiplication (mvm) with the kernel matrix KXt for tth round, and time to update KXt. Columns represent different approaches: the kernel approach, which uses full kernel matrices, and our novel feature approach, which performs the same operations through feature expansions and scales more efficiently with respect to T. The symbols c, k, and T denote the embedding size for contexts, the number of items, and the number of rounds, respectively. Tasks kernel approach feature approach compute O(T3) O(c k2 T2) memory O(T2) O(c k2 T) mvm(KXt) O(t2) O(c k2 t) compute KXt O((c + k2) t) O(c k2) Beyond feedback assumptions, the reward structure in bandit algorithms must be decomposable into scalar values for individual items to prevent a combinatorial explosion of arms something that is not always feasible. For example, modern e-commerce platforms value and track metrics such as diversity and fairness [1], which cannot be captured by focusing solely on individual items [15]. This necessitates algorithms for full-bandit feedback settings that operate without specific assumptions about the objectives or reward structures [24]. This work introduces a bandit algorithm that uses Gaussian processes (GPs) to model rewards under full-bandit feedback (i.e., a single scalar value). GPs are selected for their flexibility in accommodating feedback across discrete, continuous, and mixed domains, such as continuous contexts and discrete rankings [33]. Additionally, unlike parametric models that require optimization incorporating accumulated feedback from previous rounds, GP updates are computationally efficient, involving only data updates [24]. Although GP inference may face computational limits, we will develop efficient inference methods tailored to our proposed algorithm. A further challenge in designing GP-based bandit algorithms for top-k recommendations is constructing expressive positivedefinite kernels that capture similarities between top-k recommendations [9]. This work mitigates these computational and modeling challenges, as illustrated in the following sections. Broadly speaking, GPs have been previously explored for bandit algorithms [27, 19]. Krause et al. [14] employed GPs for contextual bandits in continuous domains; we focus on the discrete domain of top-k recommendations. Vanchinathan et al. [28] used GPs with a position-based feedback model, and Wang et al. [31] used GPs with semi-bandit feedback for recommending top-k items. In contrast, our work does not rely on a specific reward model or feedback assumption, and develops an an efficient GP-based bandit algorithm for top-k recommendations. 1.1 Contributions Our primary contribution is the GP-Top K algorithm, a contextual bandit algorithm for recommending top-k items. This algorithm operates in a full-bandit feedback setting without relying on assumptions on reward, making it broadly applicable compared to prior works. We leverage GPs with variants of the Kendall kernel [12] to model the reward function and optimize the upper confidence bound (UCB) [27] acquisition function for selecting the next arm. Additionally, we introduce a novel weighted convolutional Kendall kernel for top-k recommendations that address pathologies in existing variants of the Kendall kernel when applied to top-k recommendations. Our second key contribution is enhancing the scalability of the GP-Top K algorithm for longer time horizons. Initially, the computational cost for top-k recommendations using the GP-Top K algorithm is O(T 4) for T rounds. We first reduce this to O(T 3) by leveraging iterative algorithms from numerical linear algebra [25]. Next, we derive sparse feature representations for the novel weighted convolutional Kendall kernel, further reducing the compute requirements from O(T 4) to O(T 2) and memory requirements from O(T 2) to O(T). Table 1 summarizes these improvements in time and memory requirement, including their dependence on other parameters. We provide a theoretical analysis showing that GP-Top K s regret is sub-linear in T, benefiting from the feature representations of the Kendall kernels introduced in this work. Specifically, we establish an upper bound on regret that is nearly quadratic in n, significantly improving over the naive Θ(nk) bound for top-k recommendations without using feature representations [27]. Finally, we empirically validate GP-Top K s regret on real-world datasets, demonstrating improvement over baseline methods. 1.2 Organization The remainder of this paper is organized as follows: Section 2 introduces Kendall kernels for full and top-k rankings, including the novel weighted convolutional Kendall kernel. Section 3 presents faster matrix-vector multiplication algorithms for Kendall kernels, enhancing the efficiency of the proposed bandit algorithm, which is further detailed along with the regret analysis in Section 4. Finally, Sections 5 and 6 present empirical results and concluding discussion, respectively. 2 Kendall Kernels for Full and Top-k Rankings This section introduces Kendall kernels and their extensions for top-k recommendations, forming the foundation of our approach. We first establish key notations and them present Sections 2.1 and 2.2, which introduce Kendall kernels for full rankings and top-k rankings, respectively. Notations: Let [n] = 1, 2, . . . , n, with π representing a top-k ranking an ordered tuple of k distinct elements from [n]. For a full ranking (k = n), we use σ and denote the set of all possible top-k rankings by Πk, with cardinality |Πk| = Θ(nk). To capture ranking positions, the vector pσ Rn corresponds to a full ranking σ with entry pσ i gives the rank of item i. For top-k rankings, pπ Rn is similarly constructed by arbitrarily assigning distinct ranks to items not in the top k. For relative ranks, indicator functions pσ ij denote whether item i is ranked before or after item j, respectively in σ. Also, pπ ij are similar indicator functions defined for top-k rankings. 2.1 Kendall Kernels for Full Rankings Jiao et al. [9] showed that the Kendall tau rank correlation coefficient [12] is a positive definite (p.d.) kernel for full rankings, which we refer to as the standard Kendall (SK) kernel. The weighted Kendall (WK) kernel generalizes the SK kernel by differentially weighting item pairs [10]. Specifically, the SK and WK kernels for full rankings σ1, σ2 are defined as: ksk(σ1, σ2) := 1 n 2 X ij pσ2 i>j pσ1 ij pσ1 i>j pσ2 ij , where ws is the symmetric weighting function in product-symmetric weights. Then, ϕwk is a linear feature vector for the weighted Kendall kernel with product-symmetric weights wps. Using Claim 2, the linear feature vector for the WK kernel can be extended to the WK top-k ranking kernel by utilizing the structure of product-symmetric weights, which allows weights to be set to zero for items outside of the top-k rankings, as described in Section 2.2. Precisely, such a feature vector for the top-k ranking kernel is sparse; specifically, the feature vector ϕwk(π) contains only O(k2) non-zero entries due to the WK kernel s focus on item pairs within the top-k. Consequently, the runtime for mvm(KXt) in the WK kernel matrix is reduced to O(k2 t). Claim 3. Let ϕwck(π) : Πk 7 R( n 2) be a vector indexed by unique item pairs (i, j) given as: ϕwck i,j (π) := 1 q ( n 2) wwck i,j (π) pπ ij , where wwck i,j (π) is determined as follows: wwck i,j (π) = ws(pπ i , pπ j ) if pπ i [k] & pπ j [k] ws(pπ i , ) else if pπ i [k] & pπ j / [k] , ws(pπ j , ) else if pπ i / [k] & pπ j [k] , 0 otherwise, where ws denotes symmetric weights and ws(ℓ, ) = 1 n k Pn j=k+1 ws(ℓ, j). Then, the vector ϕwck is a linear feature vector for the WCK kernel kwck. By uniformly setting ws( , ) 1 in the definitions above, ϕwck i,j (π) specializes to a linear feature vector for the CK kernel. Moving forward, we focus on deriving a sparse feature vector for the WCK kernel, enabling fast MVMs with the WCK kernel, which includes the CK kernel as a special case. Notably, any convolutional kernel inherits linear features from its constituent kernel. Specifically, P σ Bπ ϕwk i,j (σ) forms a feature vector for the WCK kernel, which follows from Equation 4 and However, computing this feature vector explicitly is computationally challenging, as it requires summing over all σ Bπ, which includes an exponential number of terms, i.e., Θ(nk). In response to this challenge, Claim 3 shows that the summation can be computed analytically and provides explicit linear feature vectors for the WCK and CK kernels. It also shows that ϕwck has only O(k2 + 2nk) nonzero entries among its O(n2) total entries. Consequently, mvm(KXt) for the WCK kernel requires O((k2 + 2nk) t) operations, which improves from O(t2) to linear in t. However, this introduces a dependence on n, the number of items, which poses a serious limitation and is beneficial only when n t. In the following theorem, we leverage redundancy in ϕwck to eliminate this dependence on n, leading to the following main theorem about the mvm(KXt). Theorem 1. For the WCK kernel with product-symmetric weights wps, the computational complexity of multiplying the kernel matrix KXt with any admissible vector is O(k2t), i.e., mvm(KXt) = O(k2t), where Xt is any arbitrary set of t top-k rankings. Appendix A provides the proof in two steps. First, we utilize the values of ϕwck from Claim 3 and categorize ϕwck(π1)T ϕwck(π2) based on item pairs, as summarized in Table 4. Next, we show that only five combinations yield non-zero values, i.e., ϕwck(π1)T ϕwck(π2) = P5 i=1 si(π1, π2). Each term si(π1, π2) is a dot product of vectors ϕai(π1)T ϕbi(π2), which contains at most O(k2) non-zero entries. Thus, for the WCK and CK kernels, mvm(KXt) = O(k2t), since these vectors across all five terms include only O(k2) non-zero entries. Consequently, Theorem 1 demonstrates that employing these vector representations for top-k rankings leads to faster MVMs, i.e., mvm(KXt) = O(k2t) O(t2). 4 Proposed GP-Top K Bandit Algorithm In this section, we begin by formally defining the top-k recommendation problem within a bandit framework and introduce a generic contextual bandit algorithm, detailed in Algorithm 1. We then explain how the components of the algorithm are instantiated using the proposed GP approach, followed by an analysis of its computational complexity and cumulative regret. Let T denote the number of rounds. Contexts C are represented in a finite c-dimensional space, i.e., C Rc. In the tth round, we receive a context ct C and select a top-k ranking πt Πk. Subsequently, a noisy reward yt = ˆf(ct, πt) + ϵt is observed, where ˆf is the true reward function and ϵt is round-independent noise. The regret is defined as rt := max π Πk ˆf(ct, π ) ˆf(ct, πt), with cumulative regret RT := PT t=1 rt. The accumulated data at the tth round is Dt = (ci, πi, yi)t i=1. Below, the Algorithm 1 provides provides a generic schematic of the bandit algorithm. Algorithm 1 Contextual Bandit Algorithm for Top-k Recommendations Input: Total rounds T, initial reward model M0, and acquisition function AF. 1: for t = 1, , T do 2: Observe a context ct from the context space C. 3: Select a top-k ranking πt that maximizes AF(Mt 1(ct, π)) for the context ct. 4: Obtain the scalar reward yt. 5: Update the reward model Mt using the accumulated feedback Dt. 6: end for We aim to design the components of above Algorithm 1 with the objectives of minimizing cumulative regret and ensuring computational efficiency. It requires two key components: (a) a reward model Mt that estimates the reward for any context and top-k ranking utilizing the accumulated data Dt and (b) an acquisition function AF for selecting πt given the reward model Mt and observed context ct. Reward model M and acquisition function AF. The proposed GP-Top K bandit algorithm leverages GP regression to model the reward function over the domain of contexts and top-k rankings. Section B.1 briefs GP regression for the completeness. Essentially, the reward model M maintains a distribution over functions f, i.e., f N(0, k( , )), where k is a product kernel function over both contexts and top-k rankings (C N Πk). Specifically, the kernel function k is defined as follows: k((c1, π1), (c2, π2)) := kc(c1, c2) kr(π1, π2), (6) where kc(c1, c2) = c T 1 c2 is the dot-product kernel and kr is a kernel for top-k rankings. We use variants of the Kendall kernel for kr from Section 2. Updating the reward model Mt at the tth round involves adding new data points to our GP regression, which is computationally inexpensive compared to the fine-tuning steps required by parametric models to incorporate the latest feedback. We use the UCB function as the acquisition function, balancing exploration and exploitation by selecting actions that maximize the upper confidence bound of the estimated reward [27]. The UCB acquisition function is AF(Mt(ct, π)) := µf|D((ct, π))+β 1 2 σf|D((ct, π)), where σf|D((ct, π)) = p kf|D((ct, π), (ct, π)) and β controls the trade-off between exploration and exploitation. Here, µf|D and kf|D are the GP posterior mean and covariance functions, as detailed in Section B.1. At the tth round, the algorithm selects the top-k ranking π Πk that maximizes AF(Mt(ct, π)), which is performed using local search [19], as detailed further in Appendix B. Computational complexity. The GP-Top K bandit algorithm does not require compute for model updates. In other words, updating Mt, i.e., in the Line 5 of the Algorithm 1 requires only updating the list of accumulated feedback data Dt. The GP-Top K relies on local search to optimize AF, so the computational demands stem solely from AF evaluations within the local search. As shown in Section B.1, computing the GP variance term for evaluating AF, i.e, σf|D((ct, π)) involves solving KXt + σ2I 1 v for a vector v, where Xt = [(c1, π1), , (ct, πt)]. Naively, this operation requires O(t3) time per round, amounting to total O(T 4) over T rounds. Iterative algorithms, however, can expedite the process by leveraging fast MVMs with kernel matrices, as discussed in Section 3. Below, Theorem 2 formalizes the computational demands of the GP-Top K algorithm. Theorem 2. Assuming a fixed number of iterations required by the iterative algorithms, the total computational time for running the GP-Top K bandit algorithm for T rounds of top-k recommendations, using the contextual product kernel (Equation 6), is O(k2cℓT 2). This applies to WK, CK, and WCK top-k ranking kernels, where ℓis the number of local search evaluations. The proof of Theorem 2, provided in Appendix B, demonstrates efficiency gains from combining feature representations with iterative algorithms, reducing computational time from O(T 4) to O(T 2). This is a substantial improvement, as even a single MVM with the matrix KXt using the full kernel matrix at each round would require O(T 3) compute time. Additionally, the theorem shows that the running time of the GP-Top K algorithm does not explicitly depend on the number of items n. Regret analysis. The cumulative regret is RT = PT t=1 maxπ Πk ˆf(ct, π ) ˆf(ct, πt), where πt is the ranking chosen at round t. Optimizing cumulative regret for top-k recommendations is challenging, as it requires learning the context-arm relationship and matching the best possible mapping. To bound cumulative regret, regularity assumptions are essential, as noted in prior works [27, 14]. We consider the following two assumptions, either of which suffices. Also, X := C N Πk for below assumptions. Assumption 1. X is finite, meaning that only finite contexts are considered (|C| < ), and the reward function ˆf is sampled from the GP prior with a noise variance of ξ2. Assumption 2. X is arbitrary and the reward function ˆf has a bounded RKHS norm for the kernel k, i.e., f k B. The reward noises ϵt form an arbitrary martingale difference sequence (i.e., reward noise does not systematically depend on its past values) and are uniformly bounded by ξ. The following theorem proves the regret bound for the GP-Top K algorithm under Assumption 1 or 2. Theorem 3. If either Assumptions 1 or 2 hold, setting βt as 2 log |C| |Πk| t2 π2 and 300γt ln3 t δ respectively, the cumulative regret RT of the GP-Top K bandit algorithm for top-k recommendations can, with at least 1 δ probability, be bounded by O(n p C1Tc(log|C| + k + log(T 2π2/6δ))) under Assumption 1, and O(n q C1(2B2c + 300n2c2 ln3(T/δ))T) under Assumption 2. Here, C1 = 8 log(1+ξ 2), and O excludes logarithmic factors related to n, k, and T. Appendix B.4 provides the proof, leveraging the insight that log det|I + ξ 2 KXT | for any set XT can be effectively bounded using the finite-dimensional feature vectors introduced in this work. Specifically, Proposition 2 utilizes the feature vectors from Section 2. Building on Proposition 2, Theorem 3 establishes that the cumulative regret of the GP-Top K bandit algorithm grows sublinearly in T with high probability for both assumptions. Furthermore, this result also underscore the importance of using top-k ranking kernels, which improve the asymptotic order in terms of n by factors of nk/2 1 and nk 1 under Assumptions 1 and 2, respectively, compared to Srinivas et al. [27]. This improvement is substantial even for small values of k, such as k = 6, as shown in Table 3. Table 3: Comparison with Srinivas et al. (2010) on regret bounds for the bandit algorithm under both assumptions. Definitions of notations are provided in the main text. Assumption 1 Srinivas et al. (2010) Proposed GP-Top K Algorithm C1Tc log|C| + k + log T 2π2 C1Tc log|C| + k + log T 2π2 Assumption 2 Srinivas et al. (2010) This work C1Tc 2B2 + 300nkc ln3 T C1Tc 2B2 + 300n2c ln3 T 5 Experiments This section empirically evaluates the proposed GP-Top K bandit algorithms for the top-k recommendations using a simulation based on the Movie Lens dataset [4]. The reliance on simulation for evaluating bandit algorithms is prevalent in the literature. It stems from the difficulty of conducting online evaluations in real-world bandit scenarios, mainly when there are combinatorial arms [28]. Next, we provide details of the simulation setup and considered reward settings. Following that, we present results for the empirical regret for small and large numbers of arms below, respectively. Simulation setup and reward settings. The bandit simulation setup follows the framework outlined by Jeunen et al. [8], utilizing real-world datasets on user-item interactions. Specifically, we train user and item embeddings using a collaborative filtering approach [6]. The user embeddings are accessed by the bandit algorithms as context embeddings, while the item embeddings remain hidden. In the non-contextual setup, the first user from the dataset is chosen as a fixed context throughout the bandit algorithm run, allowing us to use the same reward functions as the contextual bandit algorithm. For setting up the reward functions, we utilize a similarity function s(c, θ) := ς(a (c T θ) b) to measure similarity between any user and item embeddings, where a and b are similarity score and shift scalars, respectively. The sigmoid function ς maps similarity scores to a range between 0 and 1, enhancing the interpretability of the reward signal [31]. We set a and b to 6 and 0.3, respectively, to fully utilize the range of the similarity function, as assessed by evaluating its value for many arms. We set up two preliminary reward functions based on the similarity function s. The first is the DCG metric, ˆfdcg(c, π) = Pk i=1 1 log2(i+1)s(c, θπi), where c and θπi represent the context and item embeddings, respectively. The second is the diversity measure, ˆfdiv(π) = 1 k2 Pk i=1 Pk j=1 θT πjθπi. These metrics quantify the relevance and diversity of top-k recommendations, respectively. We use these functions in two contextual reward settings. The first setting focuses on normalized DCG (n-DCG), ˆfndcg(c, π) = ˆ fdcg(c,π) maxπ ˆ fdcg(c,π ) [7]. The second setting combines ˆfndcg and ˆfdiv as ˆfndcgdiv(c, π) = λ ˆfndcg(c, π) + (1 λ) ˆfdiv(π), evaluating the aggregate effect of relevance and diversity. We set λ = 0.75 to emphasize relevance over diversity. Evaluation for small arm space. This section presents empirical results for the cumulative regret of bandit algorithms with a limited number of arms. Specifically, with n = 20 and k = 3, there are 6, 840 top-k rankings, allowing for an exhaustive search to optimize the acquisition function. All bandit algorithms run in batch mode, updating every five rounds. We consider both reward settings for contextual and non-contextual scenarios, using a subset of five users for the contextual setting. Several baselines are set to assess the benefits of ranking (Kendall) kernels. Section C details the remaining hyper-parameter configurations and details of other baseline bandit algorithms. 0 100 200 T 0 100 200 T 0 100 200 T 0 100 200 T (d) Figure 2: Comparative evaluation of bandit algorithms: The cumulative regret RT over T rounds is shown. Lower values indicate better performance. Plots (a) and (b) represent non-contextual settings for n DCG ( ˆfndcg) and n DCG + diversity ( ˆfndcgdiv) rewards, respectively. Plots (c) and (d) show results for contextual settings for five users using the same rewards. The y-axis for (a) and (b) is on the left, and for (c) and (d) on the right. The GP-Top K algorithm with Kendall kernels, especially the weighted convolutional Kendall (WCK) kernel, outperforms others. Details on other algorithms are in the text. Results are averaged over six trials. The Random algorithm randomly recommends any k items. The ϵ-greedy algorithm alternates between recommending a random top-k ranking with a probability of ϵ and choosing the top-k ranking with the highest observed mean reward. In contextual settings, ϵ-greedy differentiates arms for each unique context. Similarly, MAB-UCB conceptualizes each ranking as an independent arm, an equivalent of using a direct delta kernel approach for GPs along with UCB AF. In contextual scenarios, MAB-UCB also treats arms distinctly per context. Each variant of the top-k ranking kernel yields one variation of the proposed GP-Top K algorithm, namely, WK, CK, and WCK. Figure 2 presents empirical values of the cumulative regrets for the above baseline and the proposed GP-Top K algorithms. In all cases, across both reward settings and in both contextual and non-contextual setups, the variants of the proposed GP-Top K algorithm outperform baselines that do not use Kendall kernels, highlighting the significance of top-k ranking kernels for full bandit feedback. Specifically, the CK and WCK kernels significantly outperform the WK kernel regarding the converged values of the regret, with the WCK kernel further improving on the CK kernel variant. -greedy MAB WK CK WCK 0 50 100 T Figure 3: Comparative evaluation of bandit algorithms for large arm spaces, with > 1.1 105 arms for the left plot and > 1.1 1010 arms for the right plot. Cumulative regret with respect to the rounds of the bandit algorithm is depicted. Results are averaged over six trials. In both settings, the WCK approach outperforms other baselines. For more details, see the textual description. Evaluation for large arm space. We evaluate bandit algorithms in a large arm space scenario with n = 50 and k = 3 and k = 5, resulting in 1.1 105 and 1.1 1010 possible top-k rankings, respectively. Using local search, we focus on the n DCG reward. The remaining configuration is consistent with the small arm space setup. We use 10 restarts and 5 steps in each search direction for the local search, starting with 1000 initial candidates. Figure 3 shows that the regret for the GP-Top K variants remains consistently lower even with a large arm space, despite the use of local search. The WCK approach significantly outperforms the CK, especially for k = 5, as illustrated in the right plot of Figure 3. Additional empirical results on the effectiveness of local search in a large arm space and other rewards are given in Appendix C. 6 Discussion This work develops a contextual bandit algorithm for top-k recommendations using Gaussian processes with Kendall kernels in a full-bandit feedback setting, without restrictive assumptions about feedback or reward models. Gaussian processes provide computationally efficient model updates for accumulated feedback data, although inference can be challenging. We address this by deriving features for Kendall kernels tailored to top-k rankings, resulting in a faster inference algorithm that reduces complexity from O(T 4) to O(T 2). While demonstrated here for the product kernel between contexts and top-k rankings, these computational improvements extend naturally to other kernel types, such as additive kernels. Additionally, we address limitations of known variants and propose a more expressive Kendall kernel for top-k recommendations. Finally, we provide both theoretical and empirical results demonstrating the improved performance of the proposed GP-Top K algorithm. Future Directions and Limitations. This work opens several research avenues. Efficient matrixvector multiplication with Kendall kernel matrices can enable faster bandit algorithms with various acquisition functions, such as Thompson sampling and expected improvement. Exploring other kernels, like Mallow kernels, for top-k rankings and developing efficient algorithms for them is an intriguing direction, especially since the effectiveness of our algorithm depends on the function space induced by the RKHS of the underlying kernel. Assessing how well these kernels approximate various reward functions for top-k recommendations would provide valuable insights. Exploring other bandit problem settings, such as stochastic item availability or delayed feedback, would enhance the applicability of this work to more complex scenarios. Extending the finitedimensional GP framework to other acquisition functions using local search is another promising direction. One limitation of our regret analysis is that it does not account for approximations in the arm selection step due to local search [20]. This limitation is common in continuous domains, where optimizing acquisition functions often involves non-convex optimization [27]. Impact. This research advances bandit algorithms for top-k item recommendations. By improving recommendation efficiency and accuracy, our algorithms can enhance user experiences across platforms, promoting content relevancy and engagement. However, they may reinforce implicit biases in training data, limiting content diversity and entrenching prejudices. Therefore, monitoring over time is essential when deploying these algorithms in real-world environments. [1] Róbert Busa-Fekete, Balázs Szörényi, Paul Weng, and Shie Mannor. Multi-objective bandits: Optimizing the generalized gini index. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 625 634. PMLR, 2017. [2] Aryan Deshwal, Syrine Belakaria, Janardhan Rao Doppa, and Dae Hyun Kim. Bayesian optimization over permutation spaces. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI), pages 6515 6523, 2022. [3] Jacob R. Gardner, Geoff Pleiss, David Bindel, Kilian Q. Weinberger, and Andrew Gordon Wilson. GPy Torch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. Advances in the 31st Neural Information Processing Systems (Neur IPS), pages 7576 7586, 2018. [4] F Maxwell Harper and Joseph A Konstan. The Movielens datasets: History and context. In Transactions on Interactive Intelligent Systems, volume 5, pages 1 19. ACM, 2015. [5] David Haussler. Convolution kernels on discrete structures. Technical report, Technical report, Department of Computer Science, University of California Santa Cruz, 1999. [6] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In 8th IEEE International Conference on Data Mining (ICDM), pages 263 272. IEEE, 2008. [7] Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of ir techniques. In ACM Transactions of Information Systems, volume 20, pages 422 446, 2002. [8] Olivier Jeunen and Bart Goethals. Top-k contextual bandits with equity of exposure. In Proceedings of the 15th ACM Conference on Recommender Systems (Rec Sys), pages 310 320, 2021. [9] Yunlong Jiao and Jean-Philippe Vert. The Kendall and Mallows kernels for permutations. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 1935 1944. PMLR, 2015. [10] Yunlong Jiao and Jean-Philippe Vert. The weighted Kendall and high-order kernels for permutations. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80, pages 2314 2322. PMLR, 2018. [11] Michael N. Katehakis and Arthur F. Veinott. The multi-armed bandit problem: Decomposition and computation. In Mathematics of Operations Research, volume 12, pages 262 268, 1987. [12] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81 93, 1938. [13] Bart P. Knijnenburg, Martijn C. Willemsen, Zeno Gantner, Hakan Soncu, and Chris Newell. Explaining the user experience of recommender systems. In User Modeling and User-Adapted Interaction, volume 22, pages 441 504, 2011. [14] Andreas Krause and Cheng Ong. Contextual gaussian process bandit optimization. Advances in the 24th Neural Information Processing Systems (Neur IPS), page 2447 2455, 2011. [15] Matevvz Kunaver and Tomavz Povzrl. Diversity in recommender systems - a survey. In Knowledge Based Systems, volume 123, pages 154 162, 2017. [16] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 767 776. PMLR, 2015. [17] E. Lex, Dominik Kowald, Paul Seitlinger, Thi Ngoc Trang Tran, Alexander Felfernig, and Markus Schedl. Psychology-informed recommender systems. In Foundations and Trends in Information Retrieval, volume 15, pages 134 242, 2021. [18] Zeyang Liu, Yiqun Liu, Ke Zhou, Min Zhang, and Shaoping Ma. Influence of vertical result in web search examination. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 193 202, 2015. [19] Changyong Oh, Jakub Tomczak, Efstratios Gavves, and Max Welling. Combinatorial bayesian optimization using the graph Cartesian product. Advances in the 32nd Neural Information Processing Systems (Neur IPS), 2019. [20] My Phan, Yasin Abbasi Yadkori, and Justin Domke. Thompson sampling and approximate inference. Advances in the 32nd Neural Information Processing Systems (Neur IPS), 32, 2019. [21] Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. In In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 461 469, 2014. [22] Carl Edward Rasmussen. Evaluation of Gaussian processes and other methods for non-linear regression. Ph D thesis, University of Toronto Toronto, Canada, 1997. [23] Carl Edward Rasmussen. Gaussian Processes in Machine Learning. In Advanced Lectures on Machine Learning, pages 63 71. Springer, 2004. [24] Idan Rejwan and Yishay Mansour. Top-k combinatorial bandits with full-bandit feedback. In Algorithmic Learning Theory, pages 752 776. PMLR, 2020. [25] Yunus Saatçi. Scalable inference for structured Gaussian process models. Ph D thesis, University of Cambridge, 2012. [26] Dong-Her Shih, Kuan-Chu Lu, and Po-Yuan Shih. Exploring shopper s browsing behavior and attention level with an eeg biosensor cap. In Brain Sciences, volume 9, page 301, 2019. [27] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 1015 1022, 2010. [28] Hastagiri P Vanchinathan, Isidor Nikolic, Fabio De Bona, and Andreas Krause. Exploreexploit in top-n recommender systems via Gaussian processes. In Proceedings of the 8th ACM Conference on Recommender Systems (Rec Sys), pages 225 232, 2014. [29] Richard S. Varga. Geršgorin and his circles. 2004. [30] Siwei Wang and Wei Chen. Thompson sampling for combinatorial semi-bandits. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 5114 5122, 2018. [31] Yingfei Wang, Hua Ouyang, Chu Wang, Jianhui Chen, Tsvetan Asamov, and Yi Chang. Efficient ordered combinatorial semi-bandits for whole-page recommendation. In Proceedings of the 20th Conference on Artificial Intelligence (AAAI), volume 31, page 2746 2753, 2017. [32] Yue Wang, Dawei Yin, Luo Jie, Pengyuan Wang, Makoto Yamada, Yi Chang, and Qiaozhu Mei. Beyond ranking: Optimizing whole-page presentation. Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 103 112, 2016. [33] Christopher K. I. Williams and Carl Edward Rasmussen. Gaussian Processes for Regression. Advances in the 9th Neural Information Processing Systems (Neur IPS), pages 514 520, 1996. [34] Dong Xin, Jiawei Han, and Kevin Chen-Chuan Chang. Progressive and selective merge: computing top-k with ad-hoc ranking functions. In ACM SIGMOD Conference, pages 103 114, 2007. A Kendall Kernels for Full and Top-k Rankings Omitted Details This section includes the proofs that were omitted from Section 2, presented in the following order: In Section A.1, we present proofs for Claims 2 and 3, which concern the feature representations of Kendall kernels. In Section A.2, we provide Algorithms 2 and a proof of its correctness for computing the WCK kernel in O(k2) time, thereby proving Claim 1. Additionally, we extend this proof to cover the proof of correctness for Algorithm 3, which can compute the CK kernel in O(k log k), initially introduced by Jiao et al. [9]. The original paper presented the algorithm without a formal proof of correctness, a gap we address and fill in this section. Section A.3 details the proof for Theorem 1, discussing the matrix-vector multiplications with the Kendall kernel matrix for top-k rankings. This proof builds on the Algorithm 2 given for computing the WCK kernel for top-k rankings. A.1 Feature Representation for Kendall Kernels for Top-k Rankings This section revisits the claims regarding the feature representations of the weighted Kendall kernel and the weighted convolutional Kendall kernel, subsequently providing the proofs for these claims mentioned earlier. Claim 2. Let ϕwk(σ) : Πn 7 R( n 2) be a vector indexed by unique item pairs (i, j), defined as: ϕwk i,j (σ) := 1 q n 2 ws(pσ i , pσ j ) pσ ij , where ws is the symmetric weighting function in product-symmetric weights. Then, ϕwk is a linear feature vector for the weighted Kendall kernel with product-symmetric weights wps. Proof. Following the definition of linear feature representation, we need to prove that kwk(σ1, σ2) = ϕ(σ1)T ϕ(σ2) for the product-symmetric weight kernel as given in Equation 5. Recalling from Equation 2, we have kwk(σ1, σ2) as follows: kwk(σ1, σ2) = 1 n 2 X ij pσ2 i>j pσ1 ij pσ1 i>j pσ2 ij) + pσ1 i>j (pσ2 i>j pσ2 ij) (pσ2 ij). Combining the above factorization of ηi,j with Equation 7, we get: kwk(σ1, σ2) = 1 n 2 X ij) (pσ2 ij) ij , where wwck i,j (π) is determined as follows: wwck i,j (π) = ws(pπ i , pπ j ) if pπ i [k] & pπ j [k] ws(pπ i , ) else if pπ i [k] & pπ j / [k] , ws(pπ j , ) else if pπ i / [k] & pπ j [k] , 0 otherwise, where ws denotes symmetric weights and ws(ℓ, ) = 1 n k Pn j=k+1 ws(ℓ, j). Then, the vector ϕwck is a linear feature vector for the WCK kernel kwck. By uniformly setting ws( , ) 1 in the definitions above, ϕwck i,j (π) specializes to a linear feature vector for the CK kernel. Proof. The main idea revolves around leveraging the feature representation of the Weighted Kendall kernel for a full ranking and the linearity of the convolution operation. It is already established that kwk(σ1, σ2) = ϕwk(σ1)T ϕwk(σ2), as demonstrated in Claim 2. Recall that the WCK kernel requires a double summation over pairs of rankings from Bπ1 and Bπ2, which represent the sets of full rankings consistent with their respective top-k rankings, as described in Equation 4. We simplify the WCK kernel as follows: kwck(π1, π2) = 1 |Bπ1| 1 |Bπ2| X σ2 Bπ2 ϕwk(σ1)T ϕwk(σ2) σ1 Bπ1 ϕwk(σ1)T σ2 Bπ2 ϕwk(σ2) | {z } :=ϕwck(π2) = ϕwck(π1)T ϕwck(π2). The simplification above reveals that the feature representation, ϕwck, for the WCK kernel, is a n 2 dimensional vector and can be indexed by unique pairs of items (i, j), much like the ϕwk. However, the double summation is over an exponentially large number of pairs of rankings. Moving forward, we shift our focus to the individual entries of this representation involving this summation, elucidating the analytical values within the summation by exploring four unique cases, each dependent on whether these specific items fall within the top-k rankings. In Case 1, we examine the scenario where items i and j are within the top-k ranking π. Here, the focus is on the feature representation of the pair, specifically when both elements are ranked among the top-k positions. Case 1: pπ i [k] and pπ j [k]. ϕwck i,j (π) = 1 |Bπ| X 1 q n 2 ws(pσ i , pσ j ) pσ ij = 1 |Bπ| 1 q n 2 ws(pπ i , pπ j ) σ Bπ pσ ij = 1 |Bπ| 1 q n 2 ws(pπ i , pπ j ) |Bπ| pπ ij = 1 q n 2 ws(pπ i , pπ j ) pπ ij . (8) The simplification in lines 3rd and 4th follows from the fact that any full ranking σ Bπ, consistent with the top-k ranking π, the relative ranks and weights of items i and j remains unchanged, given pπ i [k] and pπ j [k]. Concretely, this implies pσ ij = 1 |Bπ| 1 q n 2 X σ Bπ ws(pσ i , pσ j ) (1 0) (since pπ i [k] and pπ j / [k]) = 1 |Bπ| 1 q n 2 X σ Bπ ws(pπ i , pσ j ). Next, every possible consistent ranking is considered jointly while fixating on a specific rank outside top-k elements, leading to (n k 1)! different rankings. Given that |Bπ| = (n k)!, we can refine the above expression as follows: ϕwck i,j (π) = 1 |Bπ| 1 q n 2 l=k+1 ws(pπ i , l) (n k 1)! |Bπ| 1 q n 2 l=k+1 ws(pπ i , l) = 1 q n 2 1 n k l=k+1 ws(pπ i , l) = 1 q n 2 ws(pπ i , ). (9) In Case 3, we analyze when item i is not in the top-k ranking while item j is. Case 3: pπ i / [k] and pπ j [k]. Similar to case 2, the simplification follows analogously, with the only change being 1pσ ij = 1 instead of 1. Thus, by symmetry between i and j, we have the following: ϕwck i,j (π) = 1 q n 2 ws( , pπ j ) = 1 q n 2 ws(pπ j , ) (using symmetry of ws). (10) Lastly, in Case 4, we analyze when items i and j are not in the top-k ranking. Case 4: pσ i / [k] and pσ j / [k]. ϕwck i,j (π) = 1 |Bπ| X σ Bπ ϕwk i,j (σ) = 1 |Bπ| 1 q n 2 X σ Bπ ws(pσ i , pσ j ) pσ ij = 0 (by symmetry). (11) The result of zero arises from symmetry. Since pσ i and pσ j are not in the top-k ranking, they are treated symmetrically in the summation overall rankings in Bπ. For any ranking σ, suppose pσ i = l and pσ j = m, there exists a corresponding ranking σ such that only the items i and j are swapped. Therefore, jointly, these two rankings yield ws(l, m) and ws(l, m). Since ws is symmetric, the overall contribution from each pair of such rankings is zero. Hence, the entire summation nets to zero. Thus, with the explanation provided for each case and combining results from Equations 8, 9, 10 and 11, it s trivial to validate the Claim 3, i.e., ϕwck i,j (π) = 1 q ( n 2) wwck i,j (π) pπ ij for all unique pair of items. From Case 4, we have O((n k)2) entries leaving at max only O(k2 + 2nk) non-zero entries. A.2 Algorithms for Computing Kendall Kernels for top-k Rankings In this section, we provide and delve into the proofs of Algorithms 2 and 3 for the weighted convolutional Kendall kernel and the convolutional Kendall kernel, as previously discussed in Section 2. Section A.2.1 for valid both the correctness and computational complexity of Algorithm 2 as given earlier in Claim 1. Following this, Section A.2.2 revisits Algorithm 3, initially introduced by Jiao et al. [10]. The original publication presented the algorithm without formal proof of its correctness, which we rectify and offer in Section A.2.2. A.2.1 Efficiently Computing the Weighted Convolutional Kendall Kernel This section provides a proof to Claim 1 to establish the efficiency and accuracy of Algorithm 2 in computing the weighted convolutional Kendall kernel, as specified in Equation 4, with a focus on its computational complexity. Claim 1. The weighted convolutional Kendall kernel (Equation 4) with product-symmetric rank weights (Equation 5) can be computed in O(k2) time. Proof. The claim is proven through Algorithm 2, where we establish its correctness and demonstrate its computation requirement is O(k2). The essence of our proof centers on analyzing the feature representation of the WCK kernel, ϕwck, as outlined in Claim 3. The feature vectors of ϕwck reside in a n 2 dimensional space, indexed by pairs of items. Our approach is to demonstrate that Algorithm 2 accurately computes the right-hand side (RHS) of the equation kwck(π1, π2) = ϕwck(π1)T ϕwck(π2). This involves a summation over item pairs, expressed as kwck(π1, π2) = P lj = pτ1 i>j. Analogously, σ2 and τ2 are constructed utilizing the set I2 and ranking π2. Algorithm 2 Computing Weighted Convolutional Kendall Kernel Input: Two permutations π1, π2 Πk. Ranking weighting function ws : [n] [n] 7 Rn and its one dimensional marginals are ws(ℓ, ) = 1 n k Pn j=k+1 ws(ℓ, j). Output: Convolutional Weighted Kendall kernel kwck(π1, π2). Let I1 and I2 be the sets of items in rankings π1 and π2, respectively. 1: if |I1 I2| 2 then 2: s1(π1, π2) = 1 ( n 2) P 1 lm 6: end if 7: if |I1 I2| 1 and |I2 \ I1| 1 then 8: s3(π1, π2) = 1 ( n 2) P l I1 I2|m I2\I1 ws(pπ1 l , ) ws(pπ2 l , pπ2 m ) pπ2 lm 9: end if 10: if |I1 \ I2| 1 and |I2 \ I1| 1 then 11: s4(π1, π2) = 1 l I1\I2|m I2\I1 ws(pπ1 l , ) ws(pπ2 m , ) 12: end if 13: if |I1 I2| 1 and |[n] \ (I1 I2)| 1 then 14: s5(π1, π2) = 1 ( n 2) (n |I1 I2|) P l I1 I2 ws(pπ1 l , ) ws(pπ2 l , ) 15: end if 16: kwck(π1, π2) = s1(π1, π2) + s2(π1, π2) + s3(π1, π2) + s4(π1, π2) + s5(π1, π2) Case 1: The pair (l, m) I1 I2 falls within the top-k, leading to three distinct cases. Below, we provide si terms for each case as given in Table 4. Case 1-a: Two items in I1 I2, meaning both l and m belong to I1 I2. Using Claim 3 regarding the feature vector ϕwck, we simplify s1 as follows: s1(π1, π2) = X 1 lm 1 q n 2 ws(pπ2 l , pπ2 m ) pπ2 lm 1 lm 1 q n 2 ws(pπ2 l , ) pπ2 lm 1 lm . Similarly, the partial sum v can be simplified as follows: 1 lm 1 q n 2 ws(pπ2 l , ) pπ2 lm 1 lm 1 ml 1 mm . In the above, the first two lines use results from Claim 3 and use similarity of ws. In the following line, l and m are exchanged. Lastly, the negative sign is pushed into the indicator functions to make the summand function of this partial sum v similar to the partial sum u, and the similarity of the ws is utilized. The above partial sums simplify s2 as follows: s2(π1, π2) = 1 n 2 X l I1 I2|m I1\I2 ws(pπ1 l , pπ1 m ) ws(pπ2 l , ) pπ1 lm . (13) Analogously, in Case 1-b-ii, we deduce the corresponding term s3 for the pair of indices as described in Table 4 through symmetry. Specifically, the term s3 can be outlined as follows: s3(π1, π2) = 1 n 2 X l I1 I2|m I2\I1 ws(pπ1 l , ) ws(pπ2 l , pπ2 m ) pπ2 lm . (14) Case 1-c: Both items are outside I1 I2, specifically, l I1 \ I2 and m I2 \ I1 or the reverse. Like Case 1-b-i, we divide s4 into partial summations u and v. Now, we calculate u under the condition that l I1 \ I2 and m I2 \ I1. 1 lm 1 q n 2 ws(pπ2 m , ) pπ2 lm , 1 lm 1 q n 2 ws(pπ2 l , ) pπ2 lm , 1 lm l I1 I2|m I1\I2 pπ1 lm l I1 I2|m I1\I2 pπ1 lm Next, we examine the terms u and v in detail, starting with u. The term u, which corresponds to pπ1 lm and involves a calculation that takes into account the items ranked after the l-th item in the set I. Specifically, there are k σ1(l) items following the l-th item. Within the intersection I1 I2, the number of items before l is given by |I1 I2| τ1(l). Therefore, the expression (k σ1(l)) (|I1 I2| τ1(l)) represents the count of elements that are positioned after l in the set difference I1 \ I2. Combining the above calculations for both terms u and v, the s2 term for the CK kernel can be simplified as follows: s2(π1, π2) = 1 n 2 X l I1 I2 2 (σ1(l) τ1(l)) k + |I1 I2|. (18) Using the symmetry between Case 1-b-i and Case 1-b-ii, we can simplify s3 for the CK kernel as follows: s3(π1, π2) = 1 n 2 X l I1 I2 2 (σ2(l) τ2(l)) k + |I1 I2|. (19) Simplifying the s4 and s5 Terms: We simplify the s4 and s5 terms for the CK kernel starting from Equation 15 and Equation 16, respectively, as follows: s4(π1, π2) = 1 n 2 X l I1\I2|m I2\I1 ws(pπ1 l , ) ws(pπ2 m , ) = |I1 \ I2| |I2 \ I1| n 2 (20) s5(π1, π2) = 1 n 2 (n |I1 I2|) X l I1 I2 ws(pπ1 l , ) ws(pπ2 l , ) = |I1 I2| |[n] \ (I1 I2)| n 2 . We have obtained the values of all the simplified si terms for the CK kernel in Equations 17, 18, 19, 20, and 21. By combining these terms, we get kck(π1, π2) = P5 i=1 si(π1, π2), where each term si precisely matches its corresponding expression in Algorithm 3. This completes the proof of the correctness of Algorithm 3. Regarding its time complexity, each term si sums at most k2 quantities, and each quantity can be computed in O(1) time. Therefore, the time required for Algorithm 3 to compute the CK kernel is O(k2). A.3 Fast Matrix-Vector Multiplication with Kendall Kernel Matrix on Top-k Rankings This section revisits Theorem 1 about the fact matrix-vector multiplication time for the Kendall kernel matrix for top-k rankings. Specifically, we aim to eliminate the mvm(KX) s dependence on the number of items, i.e., n on and linear dependence in the number of rounds, i.e., T, as claimed in Theorem 1. Theorem 1. For the WCK kernel with product-symmetric weights wps, the computational complexity of multiplying the kernel matrix KXt with any admissible vector is O(k2t), i.e., mvm(KXt) = O(k2t), where Xt is any arbitrary set of t top-k rankings. Proof. The cornerstone of this proof lies in the computation of the WCK kernel, as delineated in Algorithm 2. This algorithm requires only O(k2) computation. For brevity, we write X to represent XT , and the proof follows for any Xt, i.e., any value of t, not just T. As also suggested previously, we will demonstrate through the equation KX = (Φa X)T Φb X, where both matrices Φa X and Φb X have columns with only O(k2) non-zero entries. Consequently, this leads to the computational complexity of matrix-vector multiplication, denoted as mvm(KX), being O(k2 T). From Algorithm 2, we know that each entry of the kernel matrix k(π1, π2), can be expressed as a sum P5 i=1 si(π1, π2). Assuming each si(π1, π2) equals ϕai(π1)T ϕbi(π2), and considering that all vectors ϕai and ϕbi exhibit this property, we can express KX as (Φa X)T Φb X. Here, the ith row of (Φa X)T and the jth column of Φb X are represented by [ϕa1(πi)T , , ϕa5(πi)T ] and [ϕb1(πj), , ϕb5(πj)], respectively. Therefore, the overall mvm complexity can be characterized by the sparsity of the vectors ϕai and ϕbi, as is formalized in the claim presented below. Claim 5. Consider a kernel matrix KX corresponding to any set X of cardinality T. Each entry of KX, denoted as k(x1, x2), is defined by the sum P5 i=1 si(x1, x2), where each si(x1, x2) is the result of the dot product ϕai(x1)T ϕbi(x2), where, ϕai and ϕbi are vectors characterized by having O(z) non-zero entries. Given this structure, the matrix-vector multiplication complexity for KX is O(nnz T), i.e., mvm(KX) = O(z T). Proof. We will demonstrate this in the following discussion by concentrating on the kth entry of the output vector, specifically KXv, for any arbitrary vector v: j KX(k, j)vj = X i=1 si(πk, πj) i=1 ϕai(πk)T ϕbi(πj) j ϕai(πk)T ϕbi(πj)vj i=1 ϕai(πk)T j ϕbi(πj)vj Given that for all i, ϕbi possesses only O(z) non-zero entries for any πj, the computation of P j ϕbi(πj)vj requires O(z) operations. This implies that the expression P j ϕbi(πj)vj also necessitates O(z) computation. Applying a similar rationale to ϕai, it follows that computing (KXv)k demands only O(z) operations. Extending this argument to all entries of the output vector, it is evident that computing KXv requires only O(z T) computation Utilizing Claim 5, it suffices to complete the proof by showcasing that these exist vectors ϕai and ϕbi, each with only O(k2) non-zero elements, corresponding to each si as specified in Algorithm 2. Additionally, these vectors ensure that si(π1, π2) = ϕai(π1)T ϕbi(π2). We will next establish such vectors for all si terms. Starting with the s1 term below. Showcasing s1(π1, π2) = ϕa1(π1)T ϕb1(π2) for sparse ϕa1(π1) and ϕb1(π2) vectors. We begin by manipulating s1, as defined in Equation 12. For the sake of brevity, their scalar factors will be omitted in the following explanation. s1(π1, π2) = X 1 lm) (pπ2 lm), 1 lm) ws(pπ2 l , pπ2 m ) (pπ2 lm), 1 lm) 1pπ1 l ,pπ1 m [k] | {z } :=ϕa1 l,m(π1) ws(pπ2 l , pπ2 m ) (pπ2 lm) 1pπ2 l ,pπ2 m [k] | {z } :=ϕb1 l,m(π2) = (ϕa1(π1)T ϕb1(π2). (22) Both ϕa1 and ϕb1 are sparse by design, taking non-zero values only when l and m appear in the top-k rankings. This demonstrates the existence of sparse vectors for the s1 term. Next, we will establish the same for the s2 and s3 terms. Showcasing sparse vectors for s2 and s3. We begin by manipulating s2, as defined in Equation 13, while ignoring its scalar factor. We will exploit symmetry between s2 and s3 terms. l I1 I2|m I1\I2 ws(pπ1 l , pπ1 m ) ws(pπ2 l , ) pπ1 lm , l I1 I2 ws(pπ2 l , ) X m I1\I2 ws(pπ1 l , pπ1 m ) pπ1 lm , l I1 I2 ws(pπ2 l , ) m I1 ws(pπ1 l , pπ1 m ) pπ1 lm X m I1 I2 ws(pπ1 l , pπ1 m ) pπ1 lm ! l [n] 1pπ2 l [k]ws(pπ2 l , ) | {z } :=ϕb21 l (π2) 1pπ1 l [k] X m I1 ws(pπ1 l , pπ1 m ) pπ1 lm | {z } :=ϕa21 l (π1) l,m I1 I2 ws(pπ2 l , )ws(pπ1 l , pπ1 m ) pπ1 lm , (23) = ϕa21(π1)T ϕb21(π2) X l,m I1 I2 ws(pπ2 l , )ws(pπ1 l , pπ1 m ) pπ1 lm , = ϕa21(π1)T ϕb21(π2) + X l,m [n] ws(pπ2 l , )1pπ2 l ,pπ2 m [k] | {z } :=ϕb22 l,m ws(pπ1 l , pπ1 m ) pπ1 lm 1pπ1 l ,pπ1 m [k] | {z } :=ϕa22 l,m = ϕa21(π1)T ϕb21(π2) + ϕa22(π1)T ϕb22(π2), = [ϕa21(π1); ϕa22(π2)]T | {z } :=ϕa2(π1)T [ϕa21((π2)); ϕb22((π2))] | {z } :=ϕb2(π2 = ϕa2(π1)T ϕb2(π2). (25) Equation 25 demonstrates the existence of vectors ϕa2 and ϕb2 for the s2 term. The vectors ϕa21 and ϕa22, possessing O(k) and O(k2) non-zero entries respectively, are defined in Equations 23 and 24. Consequently, the ϕa2 vector has O(k2) non-zero entries. Similarly, it can be shown that ϕb2 contains O(k2) non-zero entries, thus fulfilling the proof requirements for proving the s2 term. For the s3 term, we observe a symmetry between s2 and s3, namely s3(π1, π2) = s2(π2, π1). This symmetry makes it trivial to satisfy the requirements, as further highlighted by the following equation: s3(π1, π2) = s2(π2, π1) = ϕa2(π2)T ϕb2(π1) = ϕb2(π1)T | {z } :=ϕa3(π1) ϕa2(π2) | {z } :=ϕb3(π2) = ϕa3(π1)T ϕb3(π2). (26) Showcasing sparse vectors s4(π1, π2) = ϕ4a(π1)T ϕ4b(π2). We begin by manipulating the s4 term without scalar, as defined in Equation 15. s4(π1, π2) = X l I1\I2 ws(pπ1 l , ) ws(pπ2 m , ), l I1\I2 ws(pπ1 l , ) m I2 ws(pπ2 m , ) X m I1 I2 ws(pπ2 m , ) Observing that w := P m I2 ws(pπ2 m , ) represents a constant value that does not depend on I2, we can further simplify the above expression for s4 as follows: s4(π1, π2) = X l I1\I2 ws(pπ1 l , ) m I1 I2 ws(pπ2 m , ) l I1 I2 ws(pπ1 l , ) m I1 I2 ws(pπ2 m , ) l I1 I2 ws(pπ1 l , ) + X m I1 I2 + ws(pπ2 m , ) l I1 I2 ws(pπ1 l , ) X m I1 I2 ws(pπ2 m , ). (27) Next, to simplify the above equation, we first focus on the second term and have the following: l I1 I2 ws(pπ1 l , ) + X m I1 I2 ws(pπ2 m , ) l [n] 1pπ1 l [k]ws(pπ1 l , ) | {z } :=ϕ4a1 l (π1) 1pπ2 l [k]w | {z } :=ϕ4b1 l (π2) w ws(pπ2 m , ), = ϕ4a1(π1)T ϕ4b1(π2) + X m [k] 1pπ1 m [k]w | {z } :=ϕ4a2 m (π1) ws(pπ1 m , ) | {z } :=ϕ4b2 m (π2) = ϕ4a1(π1)T ϕ4b1(π2) + ϕ4a2(π1)T ϕ4b2(π2). (28) Next, we simplify the third and last term in the Equation 27 as follows: X l I1 I2 ws(pπ1 l , ) X m I1 I2 ws(pπ2 m , ) = X l [n],m [n] ws(pπ1 l , )1pπ1 l ,pπ1 m [k] | {z } :=ϕ4a3 l,m(π1) ws(pπ2 m , )1pπ2 l ,pπ2 m [k] | {z } :=ϕ4b3 l,m(π2) = ϕ4a3(π1)T ϕ4b3(π2). (29) Next, combining the results from Equations 27, 28, and 29, we obtain the following: s4(π1, π2) = [w, ϕ4a1(π1); ϕ4a1(π1); ϕ4a3(π1)]T | {z } :=ϕ4a(π1)T [ w; ϕ4b1(π2); ϕ4b2(π2); ϕ4b3(π2)]:=ϕ4b(π2) = ϕ4a(π1)T ϕ4b(π2). (30) Equation 30 showcases both ϕ4a and ϕ4b has three components with having only O(k2) non-zero entries, thus fulfilling the requirements for the s4 term. Next, we focus on the s5 term. Showcasing sparse vectors s5(π1, π2) = ϕ5a(π1)T ϕ5b(π2). We begin by examining the s5 term, excluding its scalar component, as outlined in Equation 16. s5(π1, π2) = (n |I1 I2|) X l I1 I2 ws(pπ1 l , ) ws(pπ2 l , ), = (n (2k |I1 I2|)) X l I1 I2 ws(pπ1 l , ) ws(pπ2 l , ), l I1 I2 ws(pπ1 l , ) ws(pπ2 l , ) + |I1 I2| X l I1 I2 ws(pπ1 l , ) ws(pπ2 l , ), n 2k ws(pπ1 l , ) n 2k ws(pπ2 l , ) l I1 I2 ws(pπ1 l , ) ws(pπ2 l , ) |I1 I2|, (31) n 2k ws(pπ1 l , ) 1pπ1 l [k] | {z } :=ϕ5a1 l (π1) n 2k ws(pπ2 l , ) 1pπ2 l [k] | {z } :=ϕ5b1 l (π2) l I1 I2 ws(pπ1 l , ) ws(pπ2 l , ) |I1 I2|, = ϕ5a1(π1)T ϕ5b1(π2) + X l I1 I2 ws(pπ1 l , ) ws(pπ2 l , ) X = (ϕ5a1(π1)T ϕ5b1(π2) + X l I1 I2,m I1 I2 ws(pπ1 l , ) ws(pπ2 l , ), = ϕ5a1(π1)T ϕ5b1(π2) + X l [n],m [n] ws(pπ1 l , ) 1pπ1 l ,pπ1 m [k] | {z } :=ϕ5a2 l,m(π1) ws(pπ2 l , ) 1pπ2 l ,pπ2 m [k] | {z } :=ϕ5b2 l,m(π2) = ϕ5a1(π1)T ϕ5b1(π2) + ϕ5a2(π1)T ϕ5b2(π2), = [ϕ5a1(π1); ϕ5a2(π1)]T | {z } :=ϕ5a(π1)T [ϕ5b1(π2) + ϕ5b2(π2)] | {z } :=ϕ5b(π2) = ϕ5a(π1)T ϕ5b(π2). (33) The equation shows that s5(π1, π2) = ϕ5a(π1)T ϕ5b(π2), where both ϕ5a and ϕ5b possess components with a maximum number of non-zero entries, as indicated in Equations 31 and 32. This completes the proof requirements for the s5 term. By combining the results from Equations 22, 25, 26, 30, and 33, we have demonstrated the existence of vectors ϕai and ϕbi, each containing only O(k2) non-zero elements, and have established that si(π1, π2) = ϕai(π1)T ϕbi(π2) for each i 1, 2, 3, 4, 5. In conjunction with Claim 5, this completes the proof. B Proposed GP-Top K Bandit Algorithm Omitted Details This section includes the proofs that were omitted from Section 4, presented in the following order: Section B.1 outlines a brief of Gaussian process regression for any domain. Section B.2 summarizes the committed details about the local search utilized for optimizing the UCB function. Section B.3 provides the removed proof for the Theorem 2 concerning the overall time for the bandit algorithm. Section B.4 provides the proof for Theorem 3 concerning regret analysis of the proposed bandit algorithm. B.1 Gaussian Process Regression In GP regression [22], the training data are modeled as noisy measurements of a random function f drawn from a GP prior, denoted f N(0, k( , )), where k : X X R is a kernel function over any domain X. The observed training pairs (xi, yi) are collected as X = [x1, . . . , x T ] and y = [y1, . . . , y T ] RT , where, for an input xi, the observed value is modeled as yi = f(xi) + ϵ, with ϵi N(0, σ2). The kernel matrix on data is KX = [k(xi, xj)]T i,j=1 RT T . The posterior mean µf|D and variance σf|D functions for GPs are: µf|D(x) := k T x z (34) σf|D(x) := k(x, x) k T x (KX + σ2I) 1kx (35) where kx RT has as its ith entry k(x, xi), z = (KX + σ2I) 1y, and I is an identity matrix. For GP regression on an arbitrary domain X, the kernel function must be a p.d. kernel [23]. Naive approaches rely on the Cholesky decomposition of the matrix KX + σ2I, which takes Θ(T 3) time [23]. To circumvent the Θ(T 3) runtime, recent works use iterative algorithms such as the conjugate gradient algorithm, which facilitate GP inference by exploiting fast kernel matrix-vector multiplication (MVM) algorithms, i.e., v 7 KXv [3]. In practice, these methods yield highly accurate approximations for GP posterior functions with a complexity of Θ(p T 2) for p iterations of the conjugate gradient algorithm, as mvm(KX) = T 2, and mvm(M) is the operation count for multiplying matrix M by a vector. p T proves to be efficient in practical application [3]. B.2 Contextual GP Reward Model Optimizing the AF, i.e., UCB function, poses a significant challenge due to its enormous size of Πk. Drawing inspiration from prior research on Bayesian optimization within combinatorial spaces, we employ a breadth-first local search (BFLS) to optimize the UCB acquisition function [2, 19]. The BFLS begins with the selection of several random top-k rankings. Subsequently, each specific top-k ranking is compared with the UCB values of its neighboring rankings, proceeding to the one with the highest UCB value. The neighbors of a top-k ranking include all its permutations and the permutations of modified top-k rankings obtained by swapping one item with any of the remaining items. For any top-k ranking, there are (n k) k! + k! neighbors, which is often not huge as k is often 6. This search continues until no neighboring top-k ranking with a higher value is discovered. Although BFLS is a local search, the initial random selection and multiple restart points help it evade local minima, a strategy that previous studies have corroborated [19]. B.3 Assessing GP-Top K Compute Requirements Theorem 2. Assuming a fixed number of iterations required by the iterative algorithms, the total computational time for running the GP-Top K bandit algorithm for T rounds of top-k recommendations, using the contextual product kernel (Equation 6), is O(k2cℓT 2). This applies to WK, CK, and WCK top-k ranking kernels, where ℓis the number of local search evaluations for selecting the next arm in every round. Proof. The proof can be straightforwardly derived by combining the results presented in Table 1, which succinctly summarizes the time complexities for each step of computing the UCB using both feature and kernel approaches. It is important to emphasize that iterative algorithms enhance results from O(T 4) to O(T 3) in computational complexity. Furthermore, these algorithms can further reduce complexity to O(T 2) when used with the feature approach. The results presented in Table 1 can be validated through straightforward observations and by leveraging findings from previous Sections 2. Specifically, Section 2 offers proof for the mvm(KX) row explicitly. For the compute KXt row, the complexity of kernel approaches is deduced from Algorithms 2 and 3. For feature approaches, the compute KXt row is inferred from the sparsity of the feature representations as stated in Claim 3. Lastly, the memory row is straightforwardly deduced for the kernel approach by counting its entries. For the feature approach, it is derived from the sparsity of the feature representations. B.4 Regret Analysis In this section, we revisit Theorem 3 and provide its proof. The proofs build on the work by Krause et al. [14], delivering results for bounding the contextual regret in the context of the top-k ranking problem. To set the stage for our regret analysis, let s first define the critical term maximum mutual information, denoted by γt, is given below: γt := max X X:|X|=t I(y X; f), I(y X; f) = H(y X) H(y X|f), where I(y X; f) quantifies the reduction in uncertainty (measured in terms of differential Shannon entropy) about f achieved by revealing y A [27]. In Gaussian observation case, the entropy can be computed in closed form: H(N(µ, Σ)) = 1 2 log |2πeΣ|, so that I(y X; f) = 1 2 log |I + ξ 2KX|, where KX = [k(x, x )]x,x X is the Gram matrix of k evaluated on set X X. For the contextual bandit algorithm, X represents contexts and arms considered until round t. Before proving Theorem 3, we align the Krause et al. [14] results with our notation for consistency. Furthermore, we modify βt to accommodate embeddings encompassing negative values, aligning with the fact that contextual embeddings may exhibit negative dimensions. Proposition 1 (Theorem 1, [14]). Let δ (0, 1), and the unknown reward function ˆf be sampled from the known GP prior with known noise variance σ2. Suppose one of the following holds: 1. Assumption 1 holds and set βt = 2 log(|X|t2π2/6δ). 2. Assumption 2 holds and set βt = 2B2 + 300γt ln3(t/δ). Then the cumulative regret RT of the contextual GP bandit algorithm with the UCB acquisition function is bounded by O( C1TγT βT ) w.h.p. Precisely, Pr RT C1TγT βt) + 2 T 1 1 δ, where, C1 = 8/ log(1 + σ 2) and the notation O hides logarithmic factors in n, 1 Proposition 1 shows that the regret RT for the contextual GP bandit algorithm, utilizing the UCB acquisition function is bounded with high probability within O( C1TγT βT ), where the notation O hides logarithmic factors in n, 1 δ and T. To ascertain the O order for RT , it is imperative to first bound the O order of γT βt. We begin by examining the γT term in the subsequent proposition. Proposition 2. Under the assumptions of Theorem 3, γT can be succinctly characterized as γT = O(n2c log(n2T) + c log T), which also simplifies to O(n2c), where the O notation omits logarithmic factors in n and T. Proof. For the GP bandit algorithm with the UCB acquisition function, γT = C log |I + σ 2KXT | , where C equals (1/2) (1 1/e) 1 and KXT represents the kernel matrix computed over contexts and arms across T rounds [27, 14]. Precisely, KXT is calculated using the contextual kernel defined in Equation 6. It is applied to contexts and top-k ratings from the feedback data Dt, corresponding to Line 6 of the generic contextual bandit Algorithm 1. Next, we leverage the characteristic of the contextual kernel being a product kernel. Consequently, the maximum mutual information term for the joint kernel, γT , can be upper bounded by c (γπ T +log T), where c denotes the dimensionality of contexts and γπ T represents the maximum information gain in a non-contextual setting [14]. Specifically, γπ T is computed similarly but is confined to top-k rankings. That is, γπ T = C log |I + σ 2KXπ| , with KXπ T being calculated exclusively using the top-k kernels on the top-k rankings as selected by the bandit algorithm. Xπ T represents the top-k rankings selected by the bandit algorithm, i.e., excluding the contexts from the collected feedback. Recalling the formulation for top-k rankings kernels, we have KXT = ΦT Xπ T ΦXπ T , where ΦXπ R( n 2) T comprises feature columns pertinent to the top-k ranking kernels, as elucidated in Section A. Utilizing the Weinstein Aronszajn identity, γπ T is expressed as C log |I + σ 2ΦXπ T ΦT Xπ T | . Further, we deduce that γπ T C P( n 2) i=1 log |1 + σ 2λi| , where λi is an eigenvalue of ΦXπ T ΦT Xπ T . Given the Gershgorin circle theorem, which bounds all eigenvalues of a matrix by the maximum absolute sum of its rows, therefore we can conclude that γπ T = O(n2 log(n2T)), as for all the columns of the ΦXπ have bounded normed as given in Claims 2 and 3, i.e., ||ϕ(π)||2 2 1 [29]. By combining γπ T = O(n2 log(n2T)) with the contextual product kernel, we obtain γT = O(n2c log(n2T) + c log T), thereby providing the claimed bound in the proposition. Next, we build on Propositions 1 and 2 to prove the main theorem regarding the regret of the proposed GP-Top K bandit algorithm for top-k recommendations. Theorem 3. If either Assumptions 1 or 2 hold, setting βt as 2 log |C| |Πk| t2 π2 and 300γt ln3 t δ respectively, the cumulative regret RT of the GP-Top K bandit algorithm for top-k recommendations can, with at least 1 δ probability, be bounded by O(n p C1Tc(log|C| + k + log(T 2π2/6δ))) under Assumption 1, and O(n q C1(2B2c + 300n2c2 ln3(T/δ))T) under Assumption 2. Here, C1 = 8 log(1+ξ 2), and O excludes logarithmic factors related to n, k, and T. Proof. We will prove the above theorem for both cases separately. For Assumption-1. Given |C| is finite and βT = 2 log(|D|T 2π2/6δ). First, we focus on bounding βT as follows: βT = 2 log(|D|T 2π2/6δ) = O log|C| + log|Πk| + log(T 2π2/6δ) As n k nk and k! kk, we also have log|Πk| = log n k k! log nkkk = O(k log(nk)), which implies that βT = O(log|C| + k log(nk) + log(T 2π2/6δ)). Combining this with Proposition 2, we have following: O(γT βT ) = O (n2c log(n2T) + c log T)(log|C| + k log(nk) + log(T 2π2/6δ) = O n2c log(n2T)(log|C| + k log(nk) + log(T 2π2/6δ) (Ignoring c log T term) = O n2c log|C| + k + log(T 2π2/6δ) . Thus, we showcase the asserted bound for the regret RT as O C1TγT βT = O n p C1Tc(log|C| + k + log(T 2π2/6δ)) . For Assumption-2. Given f k B and βt = 2B2 + 300γt ln3(t/δ). First, we bound the βT term using Proposition 2 as follows: βT = 2B2 + 300 γT ln3(T/δ), = 2B2 + 300 n2c log(n2T) + c log T ln3(T/δ). Using the above result, we have the following: C1TγT βT ) = O q C1TγT 2B2 + 300 γT ln3(T/δ) , C1Tn2c log(n2T) 2B2 + 300 n2c log(n2T) ln3(T/δ) , C1Tc(2B2 + 300n2c ln3(T/δ)) . Comparison with Srinivas et al. (2010). Using the identity kernel for top-k rankings, we can develop a finite-dimensional feature for the contextual kernel and apply Theorem 5 by Srinivas et al. (2010). Given that γT = O(nkc log T), the regret bounds are as follows under both assumptions. For instance, the calculations for the O( C1TγT βT ) under the Assumption 2 are as follows: C1TγT βT ) = O q C1TγT 2B2 + 300 γT ln3(T/δ) , C1T (nkc log T) 2B2 + 300 (nkc log T) ln3(T/δ) , = O n k 2 q C1Tc(2B2 + 300nkc ln3(T/δ)) . Similarly, we can analogously perform the analysis for Assumption 1 and combine it with Proposition 1 to obtain the regret bounds mentioned in the Table 3. C Experiments Omitted Details This section presents omitted details from the main body of the text. C.1 Compute resources We utilized multiple NVIDIA Tesla M40 GPUs with 40 GB RAM on our in-house cluster for our experiments. The experiments in Section 5 required approximately 5 GPU-hours for small arm space and 24 GPU-hours per iteration for large arm space. We conducted about 50 to 100 iterations throughout the project. The results reported in Section C.3 required the same computational resources as the large arm space experiments. WK CK WCK 0 WK CK WCK 0.0 (b) Figure 4: Local search results for optimizing combinatorial objectives in Πk for n = 50 and k = 6. For details, see the textual description. Left (a) shows how many times out of 100 trials the local search recovers the exact maximizer, i.e., π , and right plot (b) shows the average value of the objective for the returned maximizer. These results indicate that the local search utilized in this work is effective. C.2 Bandit Simulation and Hyper-parameter Configurations Omitted Details To set up the simulation, we utilized embeddings trained on the Movie Lens dataset using a collaborative filtering approach [6]. We consider a 1M variant of the Movie Lens dataset, which contains 1 million ratings from 6040 users for 3677 items. Specifically, we train user embeddings cu and item embeddings θi such that the user s attraction to the items are captured by the inner product of the user embedding with the item embeddings, respectively. Both context and item embeddings, i.e., cu and θi, are 5-dimensional, optimized by considering the 5-fold performance on this dataset. The reward provided in our experiments is contaminated with zero mean and standard deviation equals 0.05. For the ϵ-greedy baselines, we considered various values of ϵ are considered, specifically ϵ = {0.01, 0.05, 0.1}. The outcomes are presented for the configuration that demonstrates optimal performance. For MAB-UCB baseline, the algorithm has an upper confidence score ucb(i) = µi + βmab q ni [11]. Here, µi represents the average reward, n denotes the total number of rounds, and ni signifies the frequency of arm i being played. βmab is a hyper-parameter. We evaluate βmab values within the set {0.1, 0.25, 0.5} and disclose results for the best-performing configuration. For the parameters of proposed GP-Top K bandit algorithms, we set βt = βgp log(|X| t2 π2) with βgp {0.05, 0.1, 0.5}, reporting results the value that yields the best performance. The choice of βt is informed by prior work in GP bandits [27]. The selection of σ for all variants is determined by optimizing the log-likelihood of the observed after every 10 rounds by considering values in the set {0.01, 0.05, 0.1}. C.3 Additional results Local search results for optimizing combinatorial objectives in Πk for n = 50 and k = 6. Specifically, π = maxπ ϕr(π)T ϕr(π ), where ϕr(π represents the feature vector for Kendall kernels on top-k rankings. Notably, for this optimization problem, it is known that the optimal value is 1 obtained by only π . Figure 4 shows results for this optimization problem when applied to WK, CK, and WCK kernels. Reward results for large arm space for the n DCG + diversity reward. Similar to Figure 3, a large setup with n = 50 for k = 3 and k = 6, is considered. For k = 6, the possible arms are over 1.1 1010 possible top-k rankings. Given the vastness of this arm space, computing the optimal arm for the diversity reward is not straightforward. Therefore, we focus on reporting the cumulative reward in Figure 5. We implement this setup using a Local search in batch mode, updating every 5 round and considering a substantial horizon of T = 100 rounds. Specifically, we use 5 restarts, 5 steps in every search direction, and start with 1000 initial candidates. Figure 5 shows that the WCK approach demonstrates superior performance, continuing to learn effectively even after extensive rounds. MAB WK CK WCK Figure 5: Comparative evaluation of bandit algorithms for large arm spaces for the n DCG + diversity reward, with > 1.1 105 for the left plot and > 1.1 1010 for the right plot, respectively. Cumulative reward with respect to the rounds of the bandit algorithm is depicted. Results are averaged over 6 trials. In both settings, the WCK approach outperforms other baselines. For more details, see the textual description. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Section 1 briefs both contributions and scope of this work. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Section 6 reflects on the limitations of this work. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Proofs of all Claims and Theorems are provided in the Appendix. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Section 5 provides necessary details of bandit simulator and experimental setups considered in this work. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Our code can be accessed using this hyper-link. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Section 5 provides experimental details and a few remaining details are given in the Appendix C. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: All figures reported in this work have errorbars with them. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Section C provides relevant information about compute resources. 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: Section 6 reflects on the impact of this work. 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: this work does not release any such resource or asset. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: this work release only code with instructions for its usage. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA]