# noregret_online_prediction_with_strategic_experts__cf22ff02.pdf No-Regret Online Prediction with Strategic Experts Omid Sadeghi Maryam Fazel University of Washington Seattle, WA 98195 {omids,mfazel@uw.edu} We study a generalization of the online binary prediction with expert advice framework where at each round, the learner is allowed to pick m 1 experts from a pool of K experts and the overall utility is a modular or submodular function of the chosen experts. We focus on the setting in which experts act strategically and aim to maximize their influence on the algorithm s predictions by potentially misreporting their beliefs about the events. Among others, this setting finds applications in forecasting competitions where the learner seeks not only to make predictions by aggregating different forecasters but also to rank them according to their relative performance. Our goal is to design algorithms that satisfy the following two requirements: 1) Incentive-compatible: Incentivize the experts to report their beliefs truthfully, and 2) No-regret: Achieve sublinear regret with respect to the true beliefs of the best-fixed set of m experts in hindsight. Prior works have studied this framework when m = 1 and provided incentive-compatible no-regret algorithms for the problem. We first show that a simple reduction of our problem to the m = 1 setting is neither efficient nor effective. Then, we provide algorithms that utilize the specific structure of the utility functions to achieve the two desired goals. 1 Introduction Learning from a constant flow of information is one of the most prominent challenges in machine learning. In particular, online learning requires the learner to iteratively make decisions and at the time of making each decision, the outcome associated with it is unknown to the learner. The experts problem is perhaps the most well-known problem in online learning [1, 2, 3, 4]. In this problem, the learner aims to make predictions about a sequence of T binary events. To do so, the learner has access to the advice of K experts who each have internal beliefs about the likelihood of each event. At each round t [T], the learner has to choose one among the advice of K experts and upon making her choice, the t-th binary event is realized and a loss bounded between zero and one is revealed. The goal of the learner is to have no regret, i.e., to perform as well as the best-fixed expert in hindsight. In many applications, however, the experts are strategic and wish to be selected by the learner as often as possible. To this end, they may strategically misreport their beliefs about the events. For instance, Five Thirty Eight1 aggregates different pollsters according to their past performance to make a single prediction for elections and sports matches. To do so, Five Thirty Eight maintains publicly available pollster ratings2. A low rating can be harmful to the pollster s credibility and adversely impact their revenue opportunities in the future. Therefore, instead of maximizing their expected performance by reporting their predictions truthfully, the pollsters may decide to take risks and report more extreme beliefs to climb higher on the leaderboard. Therefore, it is important to design algorithms that not only achieve no-regret but also motivate the experts to report their true beliefs (incentive-compatible). 1https://fivethirtyeight.com/ 2https://projects.fivethirtyeight.com/pollster-ratings/ 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Otherwise, the quality of the learner s predictions may be harmed. As we have mentioned in Section 1.1, all the previous works on this topic have focused on the standard experts problem where the goal is to choose a single expert among the K experts. In the offline setting, this is equivalent to a forecasting competition in which only the single highest-ranked forecaster wins and receives prizes. However, in many applications, a set of top-performing forecasters are awarded perks and benefits. For instance, in the Good Judgement Project, a recent geopolitical forecasting tournament, the top 2% of forecasters were given the superforecaster status and received benefits such as paid conference travel and employment opportunities [5]. Similarly, in the latest edition of Kaggle s annual machine learning competition to predict the match outcomes of the NCAA March Madness college basketball tournament (called March Machine Learning Mania 2023 3), the top 8 forecasters on the leaderboard received monetary prizes. Variants of the m-experts problem have been previously studied in [6, 7, 8, 9], however, all of these works focused only on providing no-regret algorithms for the problem and the incentive compatibility considerations were not taken into account. To the best of our knowledge, this is the first work that focuses on the strategic m-experts problem where the experts may misreport their true beliefs to increase their chances of being chosen by the learner. For the setting with modular utilities, perhaps the simplest approach to learning well compared to the best-fixed set of m experts while maintaining incentive compatibility is to run the incentivecompatible WSU algorithm of [10] for the standard experts problem (the setting with m = 1) over the set of K m meta-experts where each meta-expert corresponds to one of the sets of size m. This approach has two major drawbacks: 1) There are exponentially many meta-experts and maintaining weights per each meta-expert and running the WSU algorithm is computationally expensive, and 2) The dependence of the regret bound on m is sub-optimal. Therefore, for our setting, it is preferable to design algorithms that are tailored for the m-experts problem. 1.1 Related work Prior works have studied the experts problem under incentive compatibility considerations for two feedback models: In the full information setting, the learner observes the reported prediction of all experts at each round. In the partial information setting, however, the learner is restricted to choosing a single expert at each round and does not observe the prediction of other experts. [11] considered algorithms that maintain weights over the experts and choose experts according to these weights. They assumed that experts incentives are only affected by the unnormalized weights of the algorithm over the experts. However, since the probability of an expert being chosen by the learner equals her normalized weight, the aforementioned assumption might not be suitable. Later on, [10] made the assumption that at each round t [T], incentives are tied to the expert s normalized weight (i.e., the probability of being chosen at round t+1) and studied this problem under both feedback models. They proposed the WSU and WSU-UX algorithms for the full information and partial information settings respectively where both algorithms are incentive-compatible and they obtained O( Tln K) and O(T 2/3(Kln K)1/3) regret bounds for the algorithms. Then, [12] considered non-myopic strategic experts where the goal of each expert is to maximize a conic combination of the probabilities of being chosen in all subsequent rounds (not just the very next round). They showed that the well-known Follow the Regularized Leader (FTRL) algorithm with the negative entropy regularizer obtains a regret bound of O( T ln K) while being Θ( 1 T ) approximately incentive-compatible, i.e., it is a strictly dominated strategy for any expert to make reports Θ( 1 T ) distant from their true beliefs. For the non-strategic m-experts problem with modular utilities, [7] proposed the Component Hedge (CH) algorithm and obtained a regret bound of q m) + m ln( K m) where ℓ is the cumulative loss of the best-chosen set in hindsight. They also gave a matching lower bound for this problem. [8] studied the FTPL algorithm with Gaussian noise distribution and provided an O(m q m)) regret bound for this setting. For the non-strategic m-experts problem with submodular utility functions, [13] proposed the online distorted greedy algorithm (with dual averaging algorithm as the algorithm Ai for i = 1, . . . , m) whose regret bound is O( q m)). More recently, [9] studied the m-experts problem under various choices of the utility function (sum-reward, max-reward, pairwise-reward and monotone reward). In particular, for the setting with modular utilities (sum-reward), they proposed 3https://www.kaggle.com/competitions/march-machine-learning-mania-2023/ an algorithm that matches the optimal regret bound of the CH algorithm while being computationally more efficient. 1.2 Contributions In this paper, we focus on a generalization of the experts problem (called m-experts problem ) where at each round, instead of picking a single expert, we are allowed to pick m 1 experts and our utility is either a modular or submodular function of the chosen experts. In particular, at round t [T] and for a set of experts St [K], the utility function is defined as ft(St) = |St| i St ℓi,t and ft(St) = 1 Q i St ℓi,t in the modular and submodular cases respectively where ℓi,t [0, 1] is the loss of expert i at round t. The goal is to design algorithms that perform as well as the best-fixed set of m experts in hindsight (no-regret) and incentivize the experts to report their beliefs about the events truthfully (incentive-compatible). Towards this goal, we build upon the study of the Follow the Perturbed Leader (FTPL) algorithm for the m-experts problem with modular utility functions by [8] and derive a sufficient condition for the perturbation distribution to guarantee approximate incentive compatibility. Furthermore, we show how this condition is related to the commonly used bounded hazard rate assumption for noise distribution. In particular, we show that while FTPL with Gaussian perturbations is not incentivecompatible, choosing Laplace or hyperbolic noise distribution guarantees approximate incentive compatibility. Moreover, inspired by Algorithm 1 of [13] for online monotone submodular maximization subject to a matroid constraint, we first introduce a simpler algorithm (called the online distorted greedy algorithm ) for the special case of cardinality constraints. This algorithm utilizes m incentivecompatible algorithms for the standard experts problem (i.e., m = 1 setting) and outputs their combined predictions. We provide (1 c e)-regret bounds for the algorithm where c [0, 1] is the average curvature of the submodular utility functions. Therefore, applying the algorithm to the setting where the utility functions are modular (i.e., c = 0), the approximation ratio is 1. For submodular utility functions, the algorithm achieves the optimal 1 c e approximation ratio. Finally, we validate our theoretical results through experiments on data gathered from a forecasting competition run by Five Thirty Eight in which forecasters make predictions about the match outcomes of the recent 2022 2023 National Football League (NFL). 2 Preliminaries Notation. The set {1, 2, . . . , n} is denoted by [n]. For vectors xt, y Rn, xi,t and yi denote their i-th entry respectively. Similarly, for a matrix A Rn n, we use Ai,j to indicate the entry in the matrix s i-th row and j-th column. The inner product of two vectors x, y Rn is denoted by either x, y or x T y. For a set function f, we use f(j|A) to denote f(A {j}) f(A). A set function f defined over the ground set V is monotone if for all A B V , f(A) f(B) holds. f is called submodular if for all A B V and j / B, f(j|A) f(j|B) holds. In other words, the marginal gain of adding the item j decreases as the base set gets larger. This property is known as the diminishing returns property. If equality holds in the above inequality for all A, B, and j, the function is called modular. As mentioned in Section 1.2, at round t [T] and for a set of experts S [K], the submodular utility function is defined as ft(St) = 1 Q i St ℓi,t where ℓi,t [0, 1] is the loss of expert i at round t. To show that this function is submodular, note that we have: f(A {j}) f(A) = (1 ℓj,t) Y i A ℓi,t (1 ℓj,t) Y i B ℓi,t = f(B {j}) f(B), where the inequality follows from ℓi,t [0, 1] for i B \ A. For a normalized monotone submodular set function f (i.e., f( ) = 0), the curvature cf is defined as [14]: cf = 1 min j V f(j|V \ {j}) It is easy to see that cf [0, 1] always holds. cf 1 is due to monotonicity of f and cf 0 follows from f being submodular. Curvature characterizes how submodular the function is. If cf = 0, the function is modular, and larger values of cf correspond to the function exhibiting a stronger diminishing returns structure. 3 m-experts problem We introduce the m-experts problem in this section. In this problem, there are K experts available and each expert makes probabilistic predictions about a sequence of T binary outcomes. At round t [T], expert i [K] has a private belief bi,t [0, 1] about the outcome rt {0, 1}, where rt and {bi,t}K i=1 are chosen arbitrarily and potentially adversarially. Expert i reports pi,t [0, 1] as her prediction to the learner. Then, the learner chooses a set St containing m of the experts. Upon committing to this action, the outcome rt is revealed, and expert i incurs a loss of ℓi,t = ℓ(bi,t, rt) where ℓ: [0, 1] {0, 1} [0, 1] is a bounded loss function. In this paper, we focus on the quadratic loss function defined as ℓ(b, r) = (b r)2. The utility of the learner at round t is one of the following: Modular utility function: ft(St) = |St| i St ℓi,t = |St| i St ℓ(bi,t, rt). Submodular utility function: ft(St) = 1 Q i St ℓi,t = 1 Q i St ℓ(bi,t, rt). It is easy to see that ft is monotone in both cases and ft(St) [0, 1] holds. Note that the utility at each round is defined with respect to the true beliefs of the chosen experts rather than their reported beliefs. The goal of the learner is twofold: 1) Minimize the α-regret defined as α-RT = E α max S [K]:|S|=m PT t=1 ft(S) PT t=1 ft(St) , where the expectation is taken with respect to the potential randomness of the algorithm. For the modular utility function, α = 1 and for the submodular setting, we set α = 1 cf e (where f = PT t=1 ft) which is the optimal approximation ratio for any algorithm making polynomially many queries to the objective function. 2) Incentivize experts to report their private beliefs truthfully. To be precise, at each round t [T], each expert i [K] acts strategically to maximize their probability of being chosen at round t + 1 and the learner s algorithm is called incentive-compatible if expert i maximizes this probability by reporting pi,t = bi,t. To be precise, we define the incentive compatibility property below. Definition 1. An online learning algorithm is incentive-compatible if for every t [T], every expert i [K] with belief bi,t, every report pi,t, reports of other experts p i,t, every history of reports {pj,s}j [K],s 0. We have ν(z) = |z| for Laplace distribution. Therefore, ν (z) = sign(z) and B = 1. For symmetric hyperbolic distribution, ν(z) = 1 + z2 holds. So, |ν (z)| = |z| 1+z2 1 and B = 1. Condition 1 is closely related to a boundedness assumption on the hazard rate of the perturbation distribution. We first define the hazard rate below. Definition 2. The hazard rate of D at z R is defined as haz D(z) = f D(z) 1 FD(z), where f D and FD are the probability density function (pdf) and the cumulative density function (cdf) of the noise distribution D. The maximum hazard rate of D is haz D = supz R haz D(z). The hazard rate is a statistical tool used in survival analysis that measures how fast the tail of a distribution decays. The theorem below shows the connection between Condition 1 and the bounded hazard rate assumption. Theorem 2. If Condition 1 holds for the perturbation distribution D with the constant B > 0, we have haz D B. However, there are distributions with bounded hazard rates for which maxz |ν (z)| is unbounded (i.e., Condition 1 does not hold). For instance, consider the standard Gumbel distribution. In this case, ν(z) = z + exp( z). Therefore, we have ν (z) = 1 exp( z). So, if z , |ν (z)| . Therefore, Condition 1 is strictly stronger than the bounded hazard rate assumption for the noise distribution D. We show how Condition 1 guarantees an approximate notion of incentive compatibility for FTPL. Theorem 3. For the FTPL algorithm with a noise distribution satisfying Condition 1 with a constant B > 0, at round t [T], for an expert i [K], the optimal report from the expert s perspective p i,t is at most 2B η 2B away from her belief bi,t, i.e., the following holds: |p i,t bi,t| 2B η 2B . Note that while we focused on the incentive structure in which at each round t [T], experts wish to maximize their probability of being chosen at round t + 1, the same argument could be applied to a more general setting where the goal is to maximize a conic combination of probabilities of being chosen at all subsequent round s > t. Therefore, FTPL is approximately incentive compatible with respect to this more general incentive structure as well. Theorem 3 allows us to bound the regret of the FTPL algorithm with respect to the true beliefs of the experts. First, note that FTPL obtains the following bound with respect to the reported beliefs of the experts. Theorem 4. For the FTPL algorithm with noise distribution D satisfying Condition 1 with the constant B > 0, if we set η = q m ), the following holds: i St ℓ(pi,t, rt) min S:|S|=m 1 m j S ℓ(pj,t, rt)] O( Using the result of Theorem 3, we have |pi,t bi,t| = |p i,t bi,t| 2B η 2B . Moreover, one can easily show that the quadratic loss function is 2-Lipschitz. Therefore, for all t [T] and i [K], we have: |ℓ(pi,t, rt) ℓ(bi,t, rt)| 4B η 2B . Putting the above results together, we can obtain the following regret bound for the FTPL algorithm. E[1-RT ] = E[ 1 i St ℓ(bi,t, rt) min S:|S|=m 1 m j S ℓ(bj,t, rt)] O( Given that η = q m ) in Theorem 4, the expected regret bound is O( q m)). This result is summarized in the following theorem. Theorem 5. For the FTPL algorithm with noise distribution D satisfying Condition 1 with the constant B > 0, if we set η = q m ), the regret bound is O( q In order to ensure approximate incentive compatibility, the probability density function of the noise distribution f needs to be such that f(z) f(z+1) does not grow to infinity for very large z. One way to enforce this condition is via a Lipschitzness assumption on ln f. Condition 1 implies that ln f is B-Lipschitz. That is why smaller values of B lead to better approximate incentive compatibility which in turn results in smaller regret bounds (given that the term 4TC appears in the regret bound where C is the bound on the approximate incentive-compatibility derived in Theorem 3). We can use the FTPL algorithm to obtain results for the partial information setting as well. [20] showed that if the hazard rate of the noise distribution is bounded by B, applying the FTPL algorithm to the partial information setting for the 1-expert problem leads to O( BKT ln K) regret bounds. Using the result of Theorem 2, we know that if Condition 1 holds, the hazard rate is bounded. Therefore, if the noise distribution satisfies Condition 1, FTPL applied to the m-experts problem is approximately incentive-compatible and achieves O( q m)) regret bound. 5 Online distorted greedy algorithm In this section, we study the setting where the utility function is submodular. In this case, we have ft(St) = 1 Q i St ℓi,t = 1 Q i St ℓ(bi,t, rt). The problem in this setting could be written as an online monotone submodular maximization problem subject to a cardinality constraint of size m. [21] proposed the online greedy algorithm for this problem whose (1 1 e)-regret is bounded by O( m T ln K). The algorithm works as follows: There are m instantiations A1, . . . , Am of no-regret algorithms for the 1-expert problem. At each round t [T], Ai selects an expert vi,t [K] and the set St = {v1,t, . . . , vm,t} is selected. Ai observes the reward ft(vi,t|Si 1,t) where Sj,t = {v1,t, . . . , vj,t}. Inspired by Algorithm 1 of [13] for online monotone submodular maximization subject to a matroid constraint, we propose the online distorted greedy in Algorithm 1 for the special case of a cardinality constraint. The algorithm is similar to the online greedy algorithm of [21] discussed above. However, in the online distorted greedy algorithm, after choosing the set St and observing the function ft, we first compute the modular lower bound ht defined as ht(S) = P i S ft(i|[K] \ {i}). We define gt = ft ht. Note that gt is monotone submodular as well. The reward of Ai for choosing vi,t at round t is (1 1 m)m igt(vi,t|Si 1,t) + ht(vi,t) (in case vi,t Si 1,t, we repeatedly take samples from the weight distribution of Ai over the experts until we observe an expert not in the set Si 1,t). This technique was first introduced by [22] for the corresponding offline problem and it allows us to obtain (1 cf e )-regret bounds (where f = PT t=1 ft) with the optimal approximation ratio (optimality was shown by [23]) instead of the (1 1 e)-regret bounds for the online greedy algorithm. One particular choice for {Ai}m i=1 is the WSU algorithm of [10]. We summarize the result for this choice in the theorem below. Theorem 6. For all i [m], let Ai be an instantiation of the WSU algorithm of [10] for the 1-expert problem and denote f = PT t=1 ft. The online distorted greedy algorithm applied to the m-experts problem obtains the following regret bound: i=1 R(i) T , where R(i) T is the regret of algorithm Ai. If cf = 0, the algorithm is incentive-compatible. If cf (0, 1], we can use the online greedy algorithm of [21] instead to ensure incentive compatibility while maintaining similar bounds for the (1 1 e)-regret. [10] provided O( T ln K) regret bounds for the WSU algorithm. If we plug in this bound in the result of Theorem 6, the regret bound of the online distorted greedy algorithm is O(m T ln K). However, this bound depends linearly on m which is suboptimal. To remedy this issue, we first provide an adaptive regret bound for the WSU algorithm below. Theorem 7. The regret of the WSU algorithm of [10] is bounded by O( p |LT | ln K + ln K) where |LT | is the cumulative absolute loss of the algorithm. Note that the bound in Theorem 7 adapts to the hardness of the problem. In the worst case, we have |LT | = T and recover the O( T ln K) bound proved in [10]. However, for smaller values of |LT |, the bound in Theorem 7 improves that of [10]. For the non-strategic setting with modular utilities, [7] proposed the Component Hedge (CH) algorithm and obtained a regret bound of q m) where ℓ is the cumulative loss of the best-chosen set in hindsight. They also gave a matching lower bound for this problem. Applying the same analysis as in Theorem 7 to the setting of the naive approach with K m meta-experts, we can show that the regret bound of WSU matches the aforementioned lower bound. We can use the above adaptive regret bound to improve the regret bound of the online distorted greedy algorithm. First, note that at round t [T], the sum of the absolute value of losses incurred by {Ai}m i=1 is bounded as follows: m)m igt(vi,t|Si 1,t) + ht(vi,t) gt(vi,t|Si 1,t) + ht(vi,t) | {z } =ft(vi,t|Si 1,t) = ft(St) 1. Therefore, if we denote the cumulative absolute losses incurred by Ai with |L(i) T |, we have: i=1 |L(i) T | T. Algorithm 1 Online distorted greedy algorithm Initialization: Initialize m instances A1, . . . , Am of online algorithms for the 1-expert problem. for t = 1, . . . , T do for i = 1, . . . , m do Ai chooses the expert vi,t and Si,t = {v1,t, . . . , vi,t}. end for Play the set St = Sm,t = {v1,t, . . . , vm,t} and observe ft. Compute the modular function ht(S) = P i S ft(i|[K] \ {i}) and set gt = ft ht. for i = 1, . . . , m do Feedback the cost (1 1 m)m igt(vi,t|Si 1,t) ht(vi,t) to Ai. end for end for Using the result of Theorem 7, we know that the regret bound of the online distorted greedy algorithm |L(i) T | ln K + m ln K. Pm i=1 |L(i) T | is maximized when |L(i) T | = T m for all i [m]. Thus, in the worst case, the expected (1 cf e )-regret bound is O( m T ln K + m ln K). While we focused on submodular utility functions in this section, we can also apply the online distorted greedy algorithm to the setting with modular utilities. In this case, we have cf = 0, and therefore, the algorithm is incentive-compatible and its 1-regret is bounded by O( m T ln K + m ln K). Unlike the FTPL algorithm that is only approximately incentive-compatible, the online distorted greedy algorithm applied to modular utility functions is incentive-compatible. However, this comes at the price of an extra m term in the regret bound. 6 Experiments In this section, we evaluate the performance of our proposed algorithms for modular utility functions on a publicly available dataset from a Five Thirty Eight forecasting competition4 in which forecasters make predictions about the match outcomes of the 2022 2023 National Football League (NFL). Before each match, Five Thirty Eight provides information on the past performance of the two opposing teams. Forecasters observe this information and make probabilistic predictions about the likelihood of each team winning the match. Considering that there are 284 different matches in the dataset, we set T = 284. Out of the 9982 forecasters who participated in this competition, only 274 made predictions for every single match. We consider two cases: K = 20 and K = 100. To reduce variance, for each case, we sample 5 groups of K forecasters from the 274 and run FTPL and Online Distorted Greedy (ODG) 10 times. We set m = 5. Given that FTPL is only approximately incentivecompatible and according to the result of Theorem 3, the reported beliefs could be 2B η 2B distant from the true beliefs, we add a uniformly random value in the range [ 2B η 2B , 2B η 2B ] to the true beliefs to model this fact. We use the standard Laplace distribution as the perturbation for FTPL. Hence, we set B = 1. For both algorithms, the step size η is chosen according to our theoretical results. In Figure 1, we plot the average regret 1 t E max S [K]:|S|=m Pt τ=1 fτ(S) Pt τ=1 fτ(Sτ) of the two algorithms over time (along with error bands corresponding to 20th and 80th percentiles) along with that of the Five Thirty Eight aggregated predictions for K = 20 and K = 100 settings. Note that while our proposed algorithms choose m predictions at each round t [T], the Five Thirty Eight aggregated prediction is a single scalar value. The plots suggest that while the regret of all three algorithms converges to zero as t gets larger, both our proposed algorithms have superior performance compared to that of the Five Thirty Eight predictions. 7 Conclusion and future directions In this paper, we studied the m-experts problem, a generalization of the standard binary prediction with expert advice problem where at each round t [T]: 1) The algorithm is allowed to pick m 1 experts and its utility is a modular or submodular function of the chosen experts, and 2) The experts are strategic and may misreport their true beliefs about the t-th event to increase their probability of 4https://github.com/fivethirtyeight/nfl-elo-game Figure 1: Running average of regret over time for (a) K = 20 and (b) K = 100. being chosen at the next round (round t + 1). The goal is to design algorithms that incentivize experts to report truthfully (i.e., incentive-compatible) and obtain sublinear regret bounds with respect to the true beliefs of the experts (i.e., no-regret). We proposed two algorithmic frameworks for this problem. In Section 4, we introduced the Follow the Perturbed Leader (FTPL) algorithm. Under a certain condition for the noise distribution (Condition 1), this algorithm is approximately incentivecompatible and achieves sublinear regret bounds for modular utility functions. Moreover, in Section 5, we proposed the online distorted greedy algorithm that applies to both modular and submodular utility functions. This algorithm is incentive-compatible but its regret bound is slightly worse than that of FTPL. This work could be extended in several interesting directions. First, none of the algorithms discussed here or in prior works have taken into account the properties of the quadratic loss function. In particular, this loss function is exp-concave, and [4] showed that for exp-concave loss functions in the 1-expert problem, the regret bound could be improved to O(ln K) using the Hedge algorithm without the incentive compatibility property. Designing incentive-compatible algorithms with similarly improved regret bounds for the 1-expert and m-experts problems is yet to be done. In order to obtain the O(ln K) regret bound for the 1-expert problem (with squared loss) using the Hedge algorithm, the algorithm makes a single prediction PK i=1 πi,tpi,t at round t [T] and its loss is (PK i=1 πi,tpi,t rt)2. In other words, choosing an expert i [K] with probability πi,t at round t is not good enough to obtain the improved O(ln K) regret bound. Moving on to the m-expert problem, the main challenge for obtaining regret bounds better than O( T) is to decide how to aggregate the K predictions as m scalar values. Second, while we focused on the particular choice of quadratic loss functions, the setting could be extended to other loss functions as well. It is not clear to what extent our results hold when moving beyond the quadratic loss function. Finally, [16] introduced a framework for augmenting online algorithms for various online problems with predictions or pieces of advice. An interesting future research direction is to extend this setting to the case where the predictions are given by strategic experts and study incentive compatibility guarantees for online problems beyond the m-experts problem. Acknowledgments and Disclosure of Funding This work was supported in part by the following grants: NSF TRIPODS II 2023166, NSF CIF 2212261, NSF CIF 2007036, NSF AI Inst 2112085. [1] Volodimir G Vovk. Aggregating strategies. In Annual Workshop on Computational Learning Theory: Proceedings of the third annual workshop on Computational learning theory, 1990. Association for Computing Machinery, Inc, 1990. [2] Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Manfred K Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427 485, 1997. [3] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994. [4] Jyrki Kivinen and Manfred K Warmuth. Averaging expert predictions. In European Conference on Computational Learning Theory, pages 153 167. Springer, 1999. [5] Philip E Tetlock and Dan Gardner. Superforecasting: The art and science of prediction. Random House, 2016. [6] Manfred K Warmuth and Dima Kuzmin. Randomized online pca algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9(Oct):2287 2320, 2008. [7] Wouter M Koolen, Manfred K Warmuth, Jyrki Kivinen, et al. Hedging structured concepts. In COLT, pages 93 105. Citeseer, 2010. [8] Alon Cohen and Tamir Hazan. Following the perturbed leader for online structured learning. In International Conference on Machine Learning, pages 1034 1042. PMLR, 2015. [9] Samrat Mukhopadhyay, Sourav Sahoo, and Abhishek Sinha. k-experts-online policies and fundamental limits. In International Conference on Artificial Intelligence and Statistics, pages 342 365. PMLR, 2022. [10] Rupert Freeman, David M Pennock, Chara Podimata, and Jennifer Wortman Vaughan. No-regret and incentive-compatible online learning. ar Xiv preprint ar Xiv:2002.08837, 2020. [11] Tim Roughgarden and Okke Schrijvers. Online prediction with selfish experts. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1300 1310. Curran Associates, Inc., 2017. [12] Rafael Frongillo, Robert Gomez, Anish Thilagar, and Bo Waggoner. Efficient competitions and online learning with strategic forecasters. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 479 496, 2021. [13] Nicholas Harvey, Christopher Liaw, and Tasuku Soma. Improved algorithms for online submodular maximization via first-order regret bounds. Advances in Neural Information Processing Systems, 33:123 133, 2020. [14] Michele Conforti and Gérard Cornuéjols. Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds and some generalizations of the rado-edmonds theorem. Discrete applied mathematics, 7(3):251 274, 1984. [15] Mark D Reid and Robert C Williamson. Surrogate regret bounds for proper losses. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 897 904, 2009. [16] Thodoris Lykouris and Sergei Vassilvitskii. Competitive caching with machine learned advice. Journal of the ACM (JACM), 68(4):1 25, 2021. [17] Nicolas S Lambert, John Langford, Jennifer Wortman, Yiling Chen, Daniel Reeves, Yoav Shoham, and David M Penno k. Self-financed wagering mechanisms for forecasting. In Proceedings of the 9th ACM conference on Electronic commerce, pages 170 179, 2008. [18] Nicolas S Lambert, John Langford, Jennifer Wortman Vaughan, Yiling Chen, Daniel M Reeves, Yoav Shoham, and David M Pennock. An axiomatic characterization of wagering mechanisms. Journal of Economic Theory, 156:389 416, 2015. [19] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of computing, 8(1):121 164, 2012. [20] Jacob D Abernethy, Chansoo Lee, and Ambuj Tewari. Fighting bandits with a new kind of smoothness. Advances in Neural Information Processing Systems, 28, 2015. [21] Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. Advances in Neural Information Processing Systems, 21, 2008. [22] Moran Feldman. Guess free maximization of submodular and linear sums. Algorithmica, 83(3):853 878, 2021. [23] Maxim Sviridenko, Jan Vondrák, and Justin Ward. Optimal approximation for submodular and supermodular optimization with bounded curvature. Mathematics of Operations Research, 42(4):1197 1218, 2017. A Missing proofs A.1 Proof of Theorem 1 At time t = 1, we set πS,1 = 1 ( K m) for each set S where |S| = m and update these weights as follows: πS,t+1 = πS,t(1 ηLS,t), where LS,t = ℓS,t P S :|S |=m πS ,tℓS ,t and ℓS,t = 1 i S ℓi,t. If we denote the optimal set as S , we can write: 1 πS ,T +1 = 1 K m t=1 (1 ηLS ,t). Taking the natural logarithm of both sides, we have: t=1 ln(1 ηLS ,t). We can use the inequality ln(1 x) x x2 for x 1 2 (we choose η later such that this inequality holds) to obtain: t=1 LS ,t η2 T X t=1 L2 S ,t. Rearranging the terms and dividing both sides by η, we have: t=1 LS ,t ln K m t=1 L2 S ,t. Using the fact that RT = PT t=1 LS ,t, the inequality K m ( Ke m )m and |LS,t| 1 S, t, we can write: RT m ln( Ke m ) η + ηT. Setting η = q m ) T , we obtain the regret bound O( q m)). Note that the assumption 2 (used in the proof) holds if T 4m ln( Ke m ). Therefore, we assume T is large enough to satisfy this inequality. A.2 Proof of Theorem 2 We can show that if B = maxz |ν (z)|, the hazard rate of the distribution is bounded above by B as well. To see this, fix x 0. Note that f( x) 1 F ( x) = f(x) 1 F ( x) f(x) 1 F (x) for a symmetric zero-mean distribution and therefore, we only need to bound the hazard rate at x > 0 to bound haz D. We can f D(x) 1 FD(x) = f D(x) R x f D(z)dz = exp( ν(x)) R x exp( ν(z))dz = 1 R x exp(ν(x) ν(z))dz (a)= 1 R x exp((x z)ν (zx))dz (b) 1 R x exp(B(x z))dz = 1 (1/B) = B, where we have used the mean-value theorem in (a) and zx is in the line segment between x and z. Also, note that since x, z > 0, zx > 0 holds as well and therefore, ν (zx) > 0. We have used this fact along with x z < 0 to obtain (b). Therefore, we have haz D B. A.3 Proof of Theorem 3 Let s fix round t [T] and expert i [K]. For j = i, denote x(t) j,0 = Pt 1 s=1 ℓj,s + p2 j,t + ηγj,t as the total losses of expert j up to round t plus noise if rt = 0. Similarly, we can define x(t) j,1 = Pt 1 s=1 ℓj,s+(1 pj,t)2+ηγj,t. Define X(t) 0 (X(t) 1 ) as the m-th smallest value in {x(t) j,0}j =i ({x(t) j,1}j =i). Note that |X(t) 0 X(t) 1 | 1 holds because for each j = i, we have |x(t) j,0 x(t) j,1| 1. Also, let L = Pt 1 s=1 ℓi,s. If rt = 0, expert i is chosen at round t + 1 if and only if L + p2 i,t + ηγi,t <= X(t) 0 . Similarly, for the case rt = 1, expert i is chosen if and only if L + (1 pi,t)2 + ηγi,t <= X(t) 1 . Rearranging the terms, we can write: ηγi,t + L X(t) 0 p2 i,t if rt = 0, ηγi,t + L X(t) 1 (1 pi,t)2 if rt = 1. Given that f(γi,t) exp( ν(γi,t)), if we define Y0 = ηγi,t + L X(t) 0 and Y1 = ηγi,t + L X(t) 1 , we have: f0(Y0) exp( ν(Y0 (L X(t) 0 ) η )), f1(Y1) exp( ν(Y1 (L X(t) 1 ) η )). Therefore, if F0 and F1 denote the cdf, we can write the probability of expert i being chosen at round t + 1 as F0( p2 i,t) and F1( (1 pi,t)2) for the cases rt = 0 and rt = 1 respectively. Putting the above results together, we can write the expected utility of expert i (according to her belief bi,t) at round t as follows: Ert Bernoulli(bi,t)[Ui,t] = (1 bi,t)F0( p2 i,t) + bi,t F1( (1 pi,t)2). Taking the derivative and setting it to zero, we have: d dpi,t Ert Bernoulli(bi,t)[Ui,t] = 2pi,t(1 bi,t)f0( p2 i,t) + 2(1 pi,t)bi,tf1( (1 pi,t)2) = 0, 2f1( (1 pi,t)2) pi,t(1 bi,t) f0( p2 i,t) f1( (1 pi,t)2) + (1 pi,t)bi,t = 0, pi,t(1 bi,t) exp( ν( p2 i,t (L X(t) 0 ) η )) exp( ν( (1 pi,t)2 (L X(t) 1 ) η )) + (1 pi,t)bi,t = 0, pi,t(1 bi,t) exp ν( (1 pi,t)2 (L X(t) 1 ) η ) ν( p2 i,t (L X(t) 0 ) η ) +(1 pi,t)bi,t = 0. Ideally, we want A to be as close to 1 as possible because if A = 1, p i,t = bi,t and the algorithm would be incentive-compatible. In general, we have: p i,t = bi,t bi,t + (1 bi,t)A. Note that this is not a closed-form solution for p i,t because A is also a function of pi,t. We can observe that for pi,t < p i,t, the derivative of the utility function is positive, and for pi,t > p i,t, the derivative is negative. We can bound A as follows: Let g(u) = f( p2 i,t a u(a a) η ) where a = L X(t) 1 + 1 2pi,t, a = L X(t) 0 and f(z) = exp( ν(z)). Taking derivative of g with respect to u, we obtain: g (u) = (a a)ν ( p2 i,t a u(a a) η f( p2 i,t a u(a a) η ) = (a a)ν ( p2 i,t a u(a a) Therefore, we can write: ln f( p2 i,t a η ) ln f( p2 i,t a η ) = ln g(1) ln g(0) (a a)ν ( p2 i,t a u(a a) We have a a = X(t) 1 X(t) 0 + 2pi,t 1. Therefore, 2 a a 2 holds. Moreover, we have B ν ( p2 i,t a u(a a) η ) B due to Condition 1. Putting the above results together, the following holds: η ln A = ln f( p2 i,t a η ) ln f( p2 i,t a Therefore, A [exp( 2B η ), exp( 2B η )]. Let h(pi,t) = pi,t p i,t = pi,t bi,t bi,t+(1 bi,t)A. Since p i,t [0, 1], we have h(0) < 0 and h(1) > 0. Taking the derivative of h with respect to pi,t, we have: dh dpi,t = 1 + bi,t(1 bi,t)A 2(1 pi,t) η ν ( (1 pi,t)2 (L X(t) 1 ) η ) + 2pi,t η ν ( p2 i,t (L X(t) 0 ) η ) (bi,t + (1 bi,t)A)2 . Using Condition 1, we have 1 2Bbi,t(1 bi,t)A η(bi,t+(1 bi,t)A)2 dh dpi,t . Therefore, we can write: dh dpi,t 1 2Bbi,t(1 bi,t)A η(bi,t + (1 bi,t)A)2 = 1 2Bbi,t(1 bi,t)A η(b2 i,t + (1 bi,t)2A2 + 2bi,t(1 bi,t)A) 1 2Bbi,t(1 bi,t)A 2ηbi,t(1 bi,t)A = 1 B We choose η later such that η > B for the last inequality to hold. So, h is strictly increasing, there is exactly one solution p i,t, and the derivative is positive below it and negative above it. If we replace A with something larger in p i,t, the value decreases and vice versa. Therefore, we can write: p i,t bi,t bi,t + (1 bi,t)exp( 2B = bi,t exp( 2B η ) + bi,t(1 exp( 2B bi,t exp( 2B = ηbi,t η 2B = bi,t + 2Bbi,t bi,t + 2B η 2B . Similarly, we can lower bound p i,t as follows: p i,t bi,t bi,t + (1 bi,t)exp( 2B = bi,t exp( 2B η ) bi,t(exp( 2B bi,t exp( 2B = bi,texp( 2B Putting the above results together, we conclude that |p i,t bi,t| 2B η 2B holds. A.4 Proof of Theorem 4 Let η > 0 and X be the set of K m feasible sets of size m for this problem. Denote Lt RK as the partial sum of losses for all experts before round t, i.e., [Lt]i = Pt 1 s=1 ℓi,s. The update rule of FTPL at round t [T] is πt = arg minx X x, Lt + ηγt . First, we analyze the expected regret of the algorithm below. Define φt(θ) = Eγt DK[minx X x, θ + ηγt ]. Then, we have φt(Lt) = Eγt DK[arg minx X x, Lt + ηγt ] = Eγt DK[πt] and φt(Lt), ℓt = Eγt DK[ πt, ℓt ]. Using the Taylor s expansion of φt, we can write: φt(Lt+1) = φt(Lt)+ φt(Lt), ℓt +1 2 ℓt, 2φt(L t)ℓt = φt(Lt)+Eγt DK[ πt, ℓt ]+1 2 ℓt, 2φt(L t)ℓt , where L t is in the line segment between Lt and Lt+1 = Lt + ℓt. By definition, φt is the minimum of linear functions and therefore, it is concave. Thus, 2φt(L t) is negative semidefinite. Note that since γt is simply K i.i.d. samples of the noise distribution for all t [T], we have φt(θ) = φ(θ) = Eγ DK[minx X x, θ + ηγ ]. Taking the sum over t [T] and rearranging the terms, we obtain: t=1 πt, ℓt ] = φ(LT +1) φ( L1 |{z} =0 ) 1 t=1 ℓt, 2φt(L t)ℓt . Using Jensen s inequality, we can write: φ(LT +1) = Eγ DK[min x X x, LT +1 + ηγ ] min x X Eγ DK[ x, LT +1 + ηγ ] = min x X x, LT +1 . Therefore, we can rearrange the terms and bound the expected regret as follows: t=1 πt, ℓt ] 1 m min x X x, LT +1 1 t=1 ℓt, 2φt(L t)ℓt . To bound the first term on the right hand side, we can use the fact that D is symmetric to write: φ(0) = Eγ DK[min x X x, ηγ ] = ηEγ DK[max x X x, γ ] = ηEγ DK[max x X x, γ ]. Now, we move on to bound the second term in the regret bound. Given that the losses of the experts at each round t [T] are bounded between 0 and 1, we have ℓt 1. Therefore, we have: ℓt, 2φt(L t)ℓt X i,j [K] | 2 i,jφt(L t)|. By definition, if we denote ˆπ(θ) = arg minx X x, θ , we have: 2 i,jφ(L t) = 1 η Eγ DK[ˆπ(L t + ηγ)i dν(γj) Given the concavity of φ, diagonal entries of the Hessian 2φ are non-positive. We can also use 1T ˆπ(θ) = m to show that the off-diagonal entries are non-negative and each row or column of the Hessian sums up to 0. Therefore, we have P i,j [K] | 2 i,jφ(L t)| = 2 PK i=1 2 i,iφ(L t). Putting the above results together, we can bound the regret as follows: m Eγ DK[max x X x, γ ] + 1 i=1 Eγ DK[ˆπ(L t ηγ)i dν(γi) We can bound the first term as follows: Eγ DK[max x X x, γ ] inf s:s>0 1 s ln X x X Eγ DK[exp(s x, γ )] = inf s:s>0 1 s ln X i=1 Eγi D[exp(sxiγi)] = inf s:s>0 1 s ln |X|(Eγ1 D[exp(sγ1)])m = inf s:s>0 1 s ln |X| + m s ln Eγ1 D[exp(sγ1)] inf s:s>0 m s ln Eγ1 D[exp(sγ1)] To bound the second term in the regret bound, we can use Condition 1 to write: i=1 Eγ D[ˆπ(L t ηγ)i dν(γi) i=1 Eγ D[ˆπ(L t ηγ)i] = BEγ D[ i=1 ˆπ(L t ηγ)i] = Bm. Putting the above results together, we have: E[RT ] O(η ln(K Therefore, if we set η = q m ), the regret bound is O( q m)). In the proof of Theorem 3, we assumed that η is chosen such that η > B. So, we assume T > B ln( K A.5 Proof of Theorem 6 For i [m] and t [T], define: φi,t(S) = (1 1 m)m igt(S) + ht(S). For all i [m], we can write φi,t(Si,t) φi 1,t(Si 1,t) = (1 1 m)m igt(Si,t) + ht(Si,t) (1 1 m)m (i 1)gt(Si 1,t) ht(Si 1,t) m)m i(gt(Si,t) gt(Si 1,t)) + ht(vi,t) + 1 m)m igt(Si 1,t). Taking the sum over t [T], we obtain: t=1 (φi,t(Si,t) φi 1,t(Si 1,t)) = (1 1 t=1 (gt(Si,t) gt(Si 1,t)) + t=1 ht(vi,t) + 1 t=1 gt(Si 1,t) t=1 gt(vi,t|Si 1,t) + t=1 ht(vi,t) + 1 t=1 gt(Si 1,t). If the regret of Ai is bounded above by R(i) T and the optimal benchmark solution is OPT = {v 1, . . . , v m}, for all j [m], we can write: m)m igt(v j |Si 1,t) + ht(v j ) m)m igt(vi,t|Si 1,t) + ht(vi,t) R(i) T . Putting the above inequalities together, we have: t=1 (φi,t(Si,t) φi 1,t(Si 1,t)) 1 j=1 gt(v j |Si 1,t) + 1 j=1 ht(v j ) t=1 gt(Si 1,t) R(i) T . We can use submodularity and monotonicity of gt to write: gt(OPT) gt(Si 1,t) gt(OPT Si 1,t) gt(Si 1,t) = j=1 gt(v j |Si 1,t {v 1, . . . , v j 1}) j=1 gt(v j |Si 1,t). We can combine the last two inequalities to write: t=1 (φi,t(Si,t) φi 1,t(Si 1,t)) 1 t=1 (gt(OPT) gt(Si 1,t)) + 1 j=1 ht(v j ) t=1 gt(Si 1,t) R(i) T t=1 gt(OPT) + 1 j=1 ht(v j ) R(i) T . Taking the sum over i [m], we have: i=1 (φi,t(Si,t) φi 1,t(Si 1,t)) 1 t=1 gt(OPT) + j=1 ht(v j ) t=1 gt(OPT) + t=1 ht(OPT) t=1 gt(OPT) + t=1 ht(OPT) i=1 R(i) T . On the other hand, we have: i=1 (φi,t(Si,t) φi 1,t(Si 1,t)) = t=1 (φm,t(Sm,t) φ0,t(S0,t)) = t=1 (gt(St)+ht(St)) = t=1 ft(St). Combining the last two inequalities, we obtain: t=1 ft(St) (1 1 t=1 gt(OPT) + t=1 ht(OPT) t=1 ft(OPT) (1 1 t=1 ht(OPT) + t=1 ht(OPT) t=1 ft(OPT) + 1 t=1 ht(OPT) t=1 ft(OPT) + 1 cf t=1 ft(OPT) t=1 ft(OPT) i=1 R(i) T , where f = PT t=1 ft. Rearranging the terms, we obtain the (1 cf e )-regret bound of the online distorted greedy algorithm as follows: t=1 ft(OPT) i=1 R(i) T . To see why the online greedy algorithm is incentive-compatible, note that at round t [T], the loss of expert j [K] for the instance Ai; i [K] is: ft(j|Si 1,t) = Y k Si 1,t ℓk,t(ℓj,t 1). If we use π(i) j,t+1 to denote the probability of expert j being chosen at round t + 1 by Ai, we can write: π(i) j,t+1 = π(i) j,t 1 η Y k Si 1,t ℓk,t(ℓj,t s=1 π(i) s,tℓs,t) . π(i) j,t+1 is linear in ℓj,t and expert j does not have control over the term Q k Si 1,t ℓk,t (because Si 1,t is unknown to the expert). Therefore, to maximize π(i) j,t+1, expert j can only aim to minimize Ert Bern(bj,t)[ℓj,t PK s=1 π(i) s,tℓs,t]. Given that the quadratic loss function is proper, we can conclude that Ai is incentive-compatible. Moreover, since the online greedy algorithm simply outputs the predictions of {Ai}K i=1, this algorithm is incentive-compatible as well. For the case with cf = 0, the loss of expert j [K] for each instance Ai; i [K] in both online greedy algorithm and online distorted greedy algorithm is simply 1 m(ℓj,t 1) and the same argument holds. A.6 Proof of Theorem 7 First, assume that the losses are non-negative. Using the update rule of the algorithm, we can write: 1 πi ,T +1 = πi ,1 t=1 (1 ηLi ,t), t:Li ,t 0 (1 ηLi ,t) Y t:Li ,t<0 (1 ηLi ,t). We can use the inequalities 1 ηx (1 η)x for x [0, 1] and 1 ηx (1 + η) x for x [ 1, 0] to write: t:Li ,t 0 (1 η)Li ,t. Y t:Li ,t<0 (1 + η) Li ,t, t:Li ,t 0 Li ,t ln(1 η) X t:Li ,t<0 Li ,t ln(1 + η). Given the inequalities ln(1 x) x x2 and ln(1 + x) x x2 for x 1 2, we have: 0 ln K + ( η η2) X t:Li ,t 0 Li ,t (η η2) X t:Li ,t<0 Li ,t, t=1 Li ,t η2 T X t=1 |Li ,t|, t=1 Li ,t ln K t=1 |Li ,t|. Given that Li ,t = ℓi ,t πT t ℓt, we can write: t=1 |ℓi ,t πT t ℓt| ln K t=1 |ℓi ,t| + t=1 |πT t ℓt|) = ln K η + η(|L T | + |LT |), where |L T | and |LT | are the cumulative absolute loss of the benchmark and the algorithm respectively. Therefore, if we set η = min{1, q ln K |LT |+|L T |}, we obtain an O( p max{|LT |, |L T |} ln K + ln K) regret bound. Given that RT = LT L T , if |LT | = LT < L T = |L T |, the regret is negative and if |LT | = LT L T = |L T |, the regret bound is O( p |LT | ln K + ln K). Therefore, the regret bound O( p |LT | ln K + ln K) always holds.