# featurebased_online_bilateral_trade__bff39e7f.pdf

Published as a conference paper at ICLR 2025

FEATURE-BASED ONLINE BILATERAL TRADE

Solenne Gaucher1 Martino Bernasconi2 Matteo Castiglioni3 Andrea Celli2

Vianney Perchet4,5

1 CMAP, École polytechnique, IPP 2 Bocconi University 3 Politecnico di Milano 4 Criteo AI Lab 5 Fairplay team, CREST, ENSAE, IPP {martino.bernasconi,andrea.celli2}@unibocconi.it, matteo.castiglioni@polimi.it, vianney.perchet@normalesup.org, solenne.gaucher@polytechnique.edu

Bilateral trade models the problem of facilitating trades between a seller and a buyer having private valuations for the item being sold. In the online version of the problem, the learner faces a new seller and buyer at each time step, and has to post a price for each of the two parties without any knowledge of their valuations. We consider a scenario where, at each time step, before posting prices the learner observes a context vector containing information about the features of the item for sale. The valuations of both the seller and the buyer follow an unknown linear function of the context. In this setting, the learner could leverage previous transactions in an attempt to estimate private valuations. We characterize the regret regimes of different settings, taking as a baseline the best context-dependent prices in hindsight. First, in the setting in which the learner has two-bit feedback and strong budget balance constraints, we propose an algorithm with O(log T) regret. Then, we study the same set-up with noisy valuations, providing a tight e O(T 2/3) regret upper bound. Finally, we show that loosening budget balance constraints allows the learner to operate under more restrictive feedback. Specifically, we show how to address the one-bit, global budget balance setting through a reduction from the two-bit, strong budget balance setup. This established a fundamental trade-off between the quality of the feedback and the strictness of the budget constraints.

1 INTRODUCTION

Bilateral trade models scenarios in which a seller and a buyer, both having a private valuation for a good, are interested in trading it in order to maximize their respective utilities (Vickrey, 1961; Myerson & Satterthwaite, 1983). We study the online bilateral trade problem introduced by Cesa Bianchi et al. (2021). At each time t, a new seller and buyer arrive, with private valuations st and bt, respectively. The seller s valuation st is the lowest price they are willing to accept for the item. Analogously, the buyer s valuation bt represents the highest price they are willing to pay for the item. The learner, without any knowledge about the private valuations at the current time t, posts two (possibly randomized) prices: pt to the seller and qt to the buyer. A trade happens when st pt and qt bt, so both agents agree to trade. The gain from trade (GFT) for a pair of prices (p, q) at time t is GFTt(p, q) := I{st p}I{q bt}(bt st). (1)

This represents the net utility gain generated by the trade. Following online bilateral trade literature (see, e.g., Cesa-Bianchi et al. (2021; 2023); Bernasconi et al. (2024)), we assume the learner aims to maximize trade gain or equivalently minimize regret relative to the best hindsight policy.1 A key challenge in online bilateral trade is the inherently limited feedback: under two-bit feedback, the learner receives feedback (I{st pt}, I{bt qt}), and under one-bit feedback the learner only observes whether the trade occurred, that is I{st pt} I{bt qt}. Both provide far less informa-

1We note that gain from trade and social welfare (i.e., the sum of the utilities of both agents if the trade occurs, or the seller s utility if it does not) result in the same notion of regret.

Published as a conference paper at ICLR 2025

tion than traditional bandit feedback, since the learner cannot even reconstruct the gain from trade received for the prices it posted.

In the standard model of bilateral trade, the platform lacks information about the seller, buyer, or the item being sold. However, this scenario is unrealistic in practice, where some information is usually available to the learner. In contexts like online marketplaces, where products are often differentiated, the learner can observe product features upfront and base pricing decisions on them, using past trade data to estimate feature values and inform future pricing.

We introduce the feature-based online bilateral trade problem, in which the learner observes a feature vector xt Rd before posting prices for round t. Building on the traditional feature-based dynamic pricing framework (see, e.g., Cohen et al. (2020); Javanmard (2017); Javanmard & Nazerzadeh (2019); Keskin & Zeevi (2014); Xu & Wang (2021)), we study the setting in which private valuations are of the form x t θ + ξt, where θ Rd is an unknown vector denoting the importance of each feature, and ξt is an i.i.d. noise term. Following the set-up of Cohen et al. (2020), feature vectors xt are chosen adversarially by nature. This ensures that our solution is robust to scenarios where features are correlated and where the set of relevant features evolves over time.

1.1 OUR CONTRIBUTIONS

In this paper, we introduce the feature-based online bilateral trade model, and characterize the regret for various scenarios with adversarially generated feature vectors.

First, we focus on the scenario in which the learner has two-bit feedback and strong budget balance constraints (the learner has to set the same price to the seller and to the buyer, i.e., pt = qt at each round t). As a preliminary warm-up, we begin by considering the deterministic setup where, at each t, the seller s valuation is st = x t θs and the buyer s valuation is bt = x t θb. We show that in this case, it is possible to adapt techniques proposed by Cohen et al. (2020) in the context of feature-based dynamic pricing. The main difference here is that we need to maintain separate ellipsoidal uncertainty sets for the seller and the buyer, respectively. Our analysis yields a regret of order O(log T) (Proposition 1).

Two-bit, noisy valuations. Then, we consider the case of noisy valuations in which st = x t θs +ξs t and bt = x t θb + ξb t, where ξs t and ξb t are i.i.d. noise terms independent from xt, with bounded support and densities. We start by decomposing the expected gain-from-trade into components that can be individually estimated using two-bit feedback (Lemma 1). Then, we describe an exploreor-commit (EOC) algorithm with regret e O(T 3/4) (Theorem 1) (this algorithm will be employed in the subsequent reduction to the one-bit setting). Finally, we devise a SCOUTING BANDITS WITH INFORMATION POOLING algorithm, which makes a more efficient use of the information collected during the learning phase. This algorithms achieves a regret e O(T 2/3)(Theorem 2), which is minimax optimal up to poly-logarithmic factors and dependence on d. Indeed, there exists a matching Ω(T 2/3) lower bound for the stochastic bilateral trade problem without features in the case of independent valuations with bounded densities and support Cesa-Bianchi et al. (2021).

One-bit reduction. Finally, we provide a general black-box reduction that demonstrates how the difficulty of maintaining a per-round budget balance can be traded for the ability to operate under more demanding feedback conditions. In particular, given a two-bit explore-or-commit algorithm guaranteeing strong budget balance and sublinear regret, we show that it is possible to construct a no-regret algorithm that works under one-bit feedback, and is global budget balanced (Theorem 3). The one-bit regret guarantees are dependent on a natural measure of the social welfare generated by the market.

1.2 RELATED WORKS

Bilateral trade. In the offline setting, Myerson & Satterthwaite (1983) showed the existence of instances where a fully efficient mechanism that satisfies incentive compatibility, individual rationality, and budget balance does not exist. Subsequent research focused on finding approximately efficient mechanisms in the Bayesian setting Blumrosen & Dobzinski (2014); Kang et al. (2022); Mc Afee (2008); Blumrosen & Mizrahi (2016); Brustle et al. (2017); Deng et al. (2022); Fei (2022).

Published as a conference paper at ICLR 2025

Online bilateral trade. Cesa-Bianchi et al. Cesa-Bianchi et al. (2021); Cesa-Bianchi et al. (2024) study the case in which valuations are drawn i.i.d. from some fixed unknown distribution and the learner has to enforce strong budget balance. They provide sublinear regret guarantees in the fullfeedback setting, and under partial feedback when valuations are i.i.d. samples from a smooth distribution, independently for the seller and the buyer. If the learner is only required to enforce weak budget balance (i.e., pt qt for each t), then Azar et al. (2022) provide an algorithm achieving a tight sublinear 2-regret when the sequence of valuation is generated by an oblivious adversary. Cesa-Bianchi et al. (2023) show that sublinear regret can be achieved beyond the i.i.d. stochastic setting under a σ-smooth adversary model. Bernasconi et al. (2024) show that sublinear regret can be achieved in the fully adversarial setting if the learner enforces global budget balance constraints (i.e., the constraint has to hold over the entire time horizon). A related setting to online bilateral trade is online brokerage (Boli c et al., 2024). This model has some key structural differences from the standard online bilateral trade framework by Cesa-Bianchi et al. (2021). For instance, traders can act as both buyers and sellers depending on market conditions, and their valuations are i.i.d. from the same unknown distribution. Bachoc et al. (2024) recently studied a contextual brokerage model where traders valuations are zero-mean perturbations of a market price that is linear in the feature vector.

Feature-based pricing. The one-dimensional version of the problem was introduced by Kleinberg & Leighton (2003), while Amin et al. (2014) introduced the problem in the contextual, multidimensional set-up under i.i.d. contexts. Cohen et al. (2020) study a model with adversarial contexts for which they provide e O(d2 log T) regret guarantees in the noiseless setting, improving over the e O(

T) guarantees obtainable with general-purpose contextual bandits algorithms Agarwal et al. (2014). Moreover, they provide e O(T 2/3) regret guarantees for scenarios with adversarial contexts and additive noise generated from a known sub-Gaussian distribution. Further improvements in the achievable rates were later provided by Lobel et al. (2018); Leme & Schneider (2018); Liu et al. (2021). Further instantiations of the feature-based pricing model can be found in the survey of Den Boer (2015).

2 PRELIMINARIES

For any k 1, we compactly denote the set {1, 2, . . . , k} as Jk K. We denote by LJ(H) the Löwner John ellipsoid of a convex body H (i.e., the minimal volume ellipsoid that contains H). For a positive definite matrix M Rd d and x Rd, we denote

x Mx by x M. In our proof sketches, we write AT BT if there exists a problem-dependent constant c such that AT c BT .

Learning protocol. At each round t, a new buyer and a seller arrive, characterized by private valuations bt and st, respectively. After observing a feature vector xt Rd, the learner posts two prices: pt to the seller and qt to the buyer. The trade happens if both agents accept the proposed prices, that is st pt and qt bt. When the trade happens the learner is awarded the gain from trade GFTt(pt, qt), defined as per Equation (1). Feature vectors xt are generated by an oblivious adversary.

Valuations. We consider two models to describe how features impact on the seller s and buyer s valuations. In the first setting, valuations are a deterministic function of the context. At each round t JTK, the valuations of the seller and of the buyer are given by

st = x t θs and bt = x t θb, (2)

respectively, where θs, θb Rd are unknown to the learner. We assume that both the contexts xt and the parameters θb and θs are bounded. Assumption 1. There exist A, B 0 such that max( θs , θb ) A, and xt B.

In the second setting, we consider noisy valuations. At each round t JTK, the valuations are given by bt = x t θb + ξb t and st = x t θs + ξs t , (3) where ξb t and ξs t are i.i.d. centered noise terms independent from the context xt, with densities f b and f s, respectively. We denote by F s the c.d.f. of ξs t , and by Db the demand function of ξb t: for all x R, Db(x) := P ξb t x . We make the following assumption on the densities of ξs t and ξb t.

Published as a conference paper at ICLR 2025

Algorithm 1 ELLIPSOIDPRICING FOR BILATERAL TRADE (EP-BT)

1: Input parameter ϵ > 0, bound A. 2: Initialize Ks 1, Kb 1 d-dimensional ball of radius A, Es 1 = LJ(Ks 1) and Eb 1 = LJ(Kb 1). 3: for t JTK do 4: Set st = minθ Es t x t θ and st = maxθ Es t x t θ and compute (bt, bt) analogously 5: if st < bt then post price pt = (st + bt)/2 6: else if st st ϵ then 7: Post price pt = (st + st)/2 8: Update Ks t+1 according to the seller s feedback, set Es t+1 = LJ(Ks t+1) 9: else if bt bt ϵ then 10: Post price pt = (bt + bt)/2 11: Update Kb t+1 according to the seller s feedback, set Eb t+1 = LJ(Kb t+1) 12: else post price pt = (st + bt)/2

Assumption 2. Both ξs t and ξb t are supported in [ C, C], and have densities bounded by L.

We observe that under Assumption 1 and 2, the valuations of the seller and the buyer are bounded in [ P, P], where P := C + AB.2 The bounded densities assumption is standard in repeated bilateral trade problems with stochastic valuations, and there exists linear regret lower bounds for the case in which this assumption is lifted (Cesa-Bianchi et al., 2024).

Feedback models. After posting prices (pt, qt), the learner does not observe directly the GFT or the valuations, but receives some feedback zt about the transaction. We focus on two feedback models: (i) Two-bit feedback, where both the buyer and the seller reveal their willingness to accept the prices offered by the learner (i.e., zt is composed by the two bits (I{st pt}, I{qt bt})); (ii) One-bit feedback: the learner only observes whether the trade happened or not (i.e., zt = I{st pt} I{qt bt}). Both feedback models have the desirable property of revealing minimal information about agent s private valuations.

Regret. Our objective is to develop dynamic policies that perform well in terms of minimizing regret. In the deterministic setting, the worst-case regret for learning algorithm A is defined as

RT (A) := max (θs,θb) [ A,A],x [ C,C]T X

x t (θb θs) + GFTt(pt, qt) .

In the noisy model, we compare against a benchmark that maximizes the expected gain from trade. Let EGFT(x, p, q) := E [GFTt(p, q) | xt = x]. The pseudo-regret for the noisy setting is defined as

RT (A) := X

t JT K max (p,q) [ P,P ]2

EGFT(xt, p, q) X

t JT K EGFT(xt, pt, qt).

A property that directly follows by definition is that, for any feature vector, there exists a pair of identical prices that maximize the expected gain from trade. To simplify the notation, we omit the argument q of GFTt and EGFT if the same price is posted to both agents.

3 WARM-UP: TWO-BIT FEEDBACK, STRONG BUDGET BALANCE, NO NOISE

In this section, we consider the simplest set-up in terms of the information available to the learner, by focusing on the setting with two-bit feedback and deterministic valuations following Equation (2). We adapt the ELLIPSOIDPRICING algorithm, originally proposed by Cohen et al. (2020) for featurebased dynamic pricing, to our bilateral trade problem. The key distinction lies in managing separate uncertainty sets for the seller and the buyer, while carefully setting prices consistent with both estimations. In this set-up, we can update these uncertainty sets separately. Algorithm 1 describes the main step of ELLIPSOIDPRICING FOR BILATERAL TRADE (EP-BT). Proposition 1. Assume that the seller s and buyer s valuations follow Equation (2). Under Assumption 1, Algorithm 1 with ϵ = ABd2/T has regret bounded by RT 10ABd2 log 20(d + 1)Td 2 .

2For simplicity, we work within [ P, P]. If non-negative prices are required, a translation suffices.

Published as a conference paper at ICLR 2025

Algorithm 2 ESTIMATION SUBROUTINES

1: procedure EST-PAR(T par) Update estimate of parameters θb, θs

2: Draw price pt U([ P, P]) and post (pt, pt) 3: V P l T par xlx l + Id 4: bθs 2PV 1 P

l T par I{pl sl} 1

2 xl 5: bθb 2PV 1P

l T par I{pl bl} 1

6: procedure EST-INT(T int) Update estimate of integrals I, J 7: Draw price pt U([ P, P]) and post (pt, pt) 8: for k K do 9: b Jk 2P |T int| P

l T int I n sl pl kϵ + x l bθso .

10: b Ik 2P |T int| P l T int I n kϵ + x l bθb pl bl o

11: procedure EST-F(T F k , grid point k K) Update estimate of F s

12: Post price pt = x t bθs + kϵ. Then update b F k P

l T F k I {sl pl}/|T F k |

13: procedure EST-D(T D k , grid point k K) Update estimate of Db

14: Post price pt = x t bθb + kϵ. Then update b Dk P

l T D k I {bl pl}/|T D k |

We observe that the algorithm explores only whenever the estimations are not precise enough. In the proof (Appendix B), we show that this happens in at most O(log T) rounds.

4 NOISY VALUATIONS WITH TWO-BIT FEEDBACK

In this section, we consider bilateral trade with two-bit feedback and noisy valuations. We introduce key components that we exploit in the analysis, before integrating them into an explore-or-commit framework which ensures a regret e O(T 3/4). Finally, we show a careful re-design of the last phase of the algorithm yields a tight regret upper bound of e O(T 2/3).

We rely on the following decomposition lemma, which extends Lemma 1 of Cesa-Bianchi et al. (2024) to contextual settings. The proof can be found in Appendix E.1.

Lemma 1. Under Assumptions 1 and 2, the expected gain from trade for x at price p is given by

EGFT(x, p) = I(δb)F s(δs) + J(δs)Db(δb), (4)

where δs := p x t θs, δb := p x t θb, I(δ) := R C δ Db(u)du, and J(δ) := R δ C F s(u)du.

Lemma 1 emphasizes that the expected gain from trade for a price p depends on the pair of price increments (δs, δb), or equivalently, on the pair (δs, t), where the difference in average valuations t is defined as t := x t (θb θs). This shows that items with the same t have the same optimal seller s price increment δ t : if p = x t θs + δ t maximizes EGFT(xt, ), and if x t (θb θs) = x t (θb θs), then p = x t θs + δ t maximizes EGFT(xt , ). However, knowing the optimal increment δ t for a specific t is not sufficient to determine the optimal increment for a different t (see Appendix F). Thus, finding the optimal price increment across various t values might require us to precise estimate the reward function over a broad range of increments. This is a notable departure from the non-contextual stochastic bilateral trade problem discussed in Cesa-Bianchi et al. (2024), where precise estimation of the reward function around the optimal price suffices. Since accurately estimating functions such as F s, Db, I, and J across a wide range of arguments complicates the problem, the regret might be higher than the e O(T 2/3) rate achieved in Cesa-Bianchi et al. (2024). Surprisingly, we show that with careful coordination of the various estimation procedures, the optimal rate for the non-contextual case can still be attained.

Published as a conference paper at ICLR 2025

Algorithm 3 EXPLORE-OR-COMMIT FOR BILATERAL TRADE (EOC-BT)

1: Input: parameter µ, length of estimation phases Tint, TFD, discretization error ϵ, confidence δ. 2: Initialize: K = 2P/ϵ + 3, K J K, KK, T par=T int=T F k =T D k = for all k K, V = Id. 3: while t T do 4: if xt V 1 > µ then T par T par {t}, EST-PAR(T par) 5: else if |T int| < Tint then T int T int {t}, EST-INT(T int) 6: else if for some k K, |T F k | < TFD or |T D k | < TFD then 7: if |T F k | < TFD then T F k T F k {t}, EST-F(T F k , k) 8: else if |T D k | < TFD then T D k T D k {t}, EST-D(T D k , k)

9: else Exploitation phase

10: At n (k, k ) K2 : k = j kϵ x t bθb t bθs t ϵ 1ko

11: Choose (kt, k t) arg max(k,k ) At b F k b Ik + b Dk b Jk and post price pt = x t bθs + ktϵ

4.1 BUILDING BLOCKS: SUBROUTINES FOR LEARNING PARAMETERS

We first present the sub-routines to estimate the parameters θs and θb, and the functions F s, Db, I and J used in the EGFT decomposition of Lemma 1. These sub-routines serve as the base components for the subsequent algorithms.

The first sub-routine, EST-PAR, estimates the parameters θs and θb. It uses the fact that, if pt U([ P, P]), then 2P (I{pt st} 1/2) is an unbiased estimate of x t θs (similarly for θb). Then, we can rely on classical results to estimate the parameters (see Lattimore & Szepesvári (2020, Chapter 19)). The second sub-routine, EST-INT, estimates the integrals I(δ) and J(δ) over a grid of price increments {kϵ : k K}. The third and fourth sub-routines, EST-F and EST-D, provide estimates of the c.d.f. F s and the demand function Db at kϵ for a given increment level k K, respectively.

4.2 EXPLORE-OR-COMMIT FRAMEWORK

First, we consider a natural Explore-Or-Commit (EOC) algorithm which uses each round either to compute estimates for the terms in Equation (4), or to greedily play the empirical best action. This strategy is described in Algorithm 3. Our algorithm is not a standard Explore-Then-Commit algorithm. Indeed, it resorts to the estimation subroutine EST-PAR (Line 4) whenever the estimation is not good enough and not only in the first rounds. The other estimation subroutines are executed until a certain number of updates is reached. We show that the number of necessary exploration rounds is upper bounded by e O(T 3/4). This will be useful in section Section 5 to handle one-bit feedback. Finally, in Lines 10 and 11 the algorithm selects the best price increments according to the current estimates (i.e., commit rounds), and posts price pt build accordingly to both agents. Theorem Theorem 1 bounds the regret of EOC-BT. Its proof is postponed to Appendix C, where we provide explicit bounds on e T and e C.

Theorem 1. Set ϵ = log(T )

T 1/4 , δ = T(74 + 32Pϵ 1) 1, µ = ϵ P q

d log 1+B2T

Tint = 8P 2 log(1/δ)

ϵ2 , and TFD = 2 log(1/δ)

ϵ2 . Then, under Assumptions 1 and 2, Algorithm 3 verifies RT e CT 3/4 log(T)1/4

with probability at least 1 T 1 when T e T, where e T and e C are constants depending respectively on A, B, and C, and on A, B, C, and d.

Sketch of the proof. For some ϵ > 0 to be chosen later, let K = J P/ϵ, P/ϵK. Using standard arguments Abbasi-Yadkori et al. (2011); Carpentier et al. (2020), we can show that e O(ϵ 2) samples are sufficient for EST-PAR to ensure that, with high probability, for all t / T par the errors |x t (θs bθs t )| and |x t (θb bθb t)| are smaller than ϵ. When this happens, classical concentration arguments ensure that e O(ϵ 2) samples are sufficient for EST-INT to estimate I(kϵ) and J(kϵ) with precision

Published as a conference paper at ICLR 2025

Algorithm 4 SCOUTING BANDIT WITH INFORMATION POOLING (SBIP)

1: Input: parameter µ > 0, length of scouting phase Tint, discretization size ϵ, confidence δ > 0. 2: Initialize: T par = T int = , V = Id, K = 2P/ϵ + 3, K = J K, KK, eϵ = (12PL + 7)ϵ, 3: N k,s = N k,b = b F k = b Dk = 0 for all k K. Let β : N n 7 p

2 log(δ 1)/n 4: while t T do 5: if xt V 1 > µ then 6: T par T par {t}, EST-PAR(T par) 7: else if |T int| < Tint then 8: T int T int {t}, EST-INT(T int) 9: else Run Successive Elimination 10: for (k, k ) At := n (k, k ) K2 : k = j x t bθs t bθb t + kϵ ϵ 1ko do

11: UCBt(k, k ) b Ik b F k t + b Jk b Dk t + eϵ + 2P β(N k,s) + β(N k ,b)

12: LCBt(k, k ) b Ik b F k t + b Jk b Dk t eϵ 2P β(N k,s) + β(N k ,b)

13: Kt := (k, k ) At : UCBt(k, k ) max(l,l ) At LCBt(l, l )

14: Choose (kt, k t) arg min(k,k ) Kt min{N k,s, N k ,b} and post price pt = x t bθs t + ktϵ 15: Observe feedback and update:

b F kt N kt,s b F kt + I {st pt}

N kt,s + 1 and b Dk t N k t,b b Dk t + I {bt pt} N k t,b + 1 ,

16: N kt,s N kt,s + 1 and N k t,b N k t,b + 1

ϵ uniformly for k K. By contrast, e O(|K|ϵ 2) samples are necessary for EST-F and EST-F to estimate F s(kϵ) and Db(kϵ) with precision ϵ uniformly for k K. Thus, with high probability, after e O(|K|ϵ 2) rounds of exploration, the expected gain from trade for prices x t bθs t + kϵ is estimated with precision O(ϵ) uniformly for k K. Now, Assumption 2 ensures that the reward function is Lipschitz-continuous, so the discretization error is also O(ϵ). By choosing ϵ = T 1/4, we balance the estimation and discretization errors with the regret of the exploration phase, thereby obtaining a regret e O(T 3/4).

4.3 CLOSING THE GAP: SCOUTING BANDITS WITH INFORMATION POOLING

The e O(T 3/4) regret of the EOC strategy is mainly due to the cost of estimating the c.d.f. F s and demand function Db. To bypass this costly estimation, an alternative approach is to first estimate θb, θs, I, and J, then run independent scouting bandit algorithms (as described in Cesa-Bianchi et al. (2024)) for each (rounded) value of t = x t (θb θs). As shown in Appendix H, this strategy also achieves e O(T 3/4) regret. This regret rate is higher than the rate e O(T 2/3) achieved in non-contextual stochastic bilateral trade. Surprisingly, by combining the strengths of both strategies, we achieve a regret of e O(T 2/3). This reveals that the contextual stochastic bilateral trade problem is no more difficult, in terms of regret, than its non-contextual counterpart, although the comparison is against a stronger benchmark.

To achieve this optimal regret rate, we design a scouting bandit strategy with information pooling across different values of t, presented in Algorithm 4. Information pooling is related to crosslearning, developed for bandits with graph feedback Balseiro et al. (2019). In particular, after having estimated x t θs, x t θb, I and J, we run a successive elimination algorithm for each value of t. Then, to estimate the reward of a price corresponding to increments δs and δb, we use the feedback from rounds where these increments have been selected across all values of t. Theorem 2. Under Assumptions 1 and 2, Algorithm 4 with ϵ = (d2 log(T)2/T) 1/3, δ = (38 + 16P

ϵ )(T + 1)2 1, µ = ϵ P q

d log 1+B2T

δ + A 1, and Tint = 8P 2 log(1/δ)

ϵ2 verifies

RT e C(d log(T)T)2/3

with probability at least 1 1/T when T e T, where e T and e C are constants depending respectively on A, B, and C, and on A, B, C, and d.

Published as a conference paper at ICLR 2025

The regret rate of Algorithm 4 matches the lower bound Ω(T 2/3) established by Cesa-Bianchi et al. (2021) for the simpler problem of non-contextual stochastic bilateral trade under similar assumptions (i.e., independent valuations with bounded densities and support). Thus, aside from the constant e C, the dependence on the dimension d, and logarithmic factors, Algorithm 4 achieves minimax-optimal regret. We outline the proof of this result below, with the full formal proof as well as bounds on e T and e C deferred to Appendix D.

Sketch of the proof. Let r(x, p) := maxp EGFT(x, p ) EGFT(x, p), and Tpar = |T par|. The regret is bounded by 2PTpar + 2PTint + P

t/ T par T int r(xt, pt). The elliptical potential Lemma (Carpentier et al., 2020) implies that Tpar d2 log(T)2ϵ 2, so Tpar + Tint d2 log(T)2ϵ 2. By using concentration arguments and exploiting the Lipschitz continuity of the reward, we can show that, for all t / T par T int and (k, k ) At, [LCBt(k, k ), UCBt(k, k )] is a valid confidence interval for EGFT(xt, x t bθs + kϵ). Now, we want to show that r(xt, pt) e O p

1/Nkt,s + p

1/Nk t,b + ϵ . To prove this claim, we define (k t , k t ) arg max(k,k ) At EGFT(xt, x t bθs + kϵ). The Lipschitz-continuity of the reward ensures that the discretization error is of order O(ϵ). Moreover, [LCBt(kt, k t), UCBt(kt, k t)] and [LCBt(k t , k t ), UCBt(k t , k t )] are confidence intervals for EGFT(xt, pt) and EGFT(xt, x t bθs + k t ϵ), respectively, and LCBt(k t , k t ) UCBt(kt, k t) by Line 13. Then, it holds that EGFT(xt, x t bθs + k t ϵ) EGFT(xt, pt) is of order UCBt(kt, k t) LCBt(kt, k t)+UCBt(k t , k t ) LCBt(k t , k t ). Our choice of (kt, k t) ensures that min{N kt,s, N k t,b} min{N k t ,s, N k t ,s} (Line 14). Then, by definition of UCB and LCB, and since the discretization is at most O(ϵ), we get r(xt, pt) e O p

1/Nkt,s + p

1/Nk t,b + ϵ .

To bound the regret of the Successive Elimination phase, we consider separately the rounds (indexed by T s k ) where the bound is dominated by p

1/Nk,s and kt = k, and the rounds (indexed by T b k ) where it is dominated by p

1/N k ,b and k t = k . Then, we decompose the regret of this phase as P

t/ T par T int r(xt, pt) = P

t T s k r(xt, pt) + P

t T b k r(xt, pt)). Choosing ra = 2 a and ra ϵ, we consider the decreasing sequence of intervals (ra)a a. The previous result implies that if N k,s [ e O(1/r2 a), e O(1/r2 a+1)], then r(xt, pt) ra. Thus,

t T s k r(xt, pt) e O r2 1 + X

1 a a 1 ra r2 a+1 r2 a + ra|T s k | .

This sum is of order Ts kϵ+log(1/δ)/ϵ. We sum over (k, k ) K, and use the fact that P k K Ts k+Tb k T while |K| ϵ 1, to conclude that RT = e O(T 2/3).

5 TRADING OFF BUDGET BALANCE CONSTRAINTS FOR FEEDBACK

In this section, we show that it is possible to relax the budget balance constraints in order to handle scenarios with limited feedback. In particular, we show that any EOC-like algorithm for two-bit feedback (for instance, Algorithms 1 and 3) can be suitably adapted to handle settings with one-bit feedback. To achieve this, we resort to the notion of global budget balance recently introduced by Bernasconi et al. (2024) and we show that budget can be exploited to compensate for the lack of feedback. The notion of global budget balance requires the learner to be budget balanced only overall (i.e., over the whole time horizon). In particular, let PROFITt(p, q) := I{st p}I{q bt}(q p) be the profit extracted by the learner at time t by posting prices (p, q). Then, the learning algorithm is global budget balanced if the following inequality holds: PT t=1 PROFIT(pt, qt) 0.

Given an EOC-like algorithm for the two-bit strong budget balance set-up, the idea is to simulate its exploration rounds by doing separate updates for the seller and the buyer, each using a single bit of feedback. In particular, each time in which the two-bit algorithm uses the seller s feedback I{st p}, the one-bit algorithm needs to actively collect that information by posting (p, P) instead of (p, p). Notice that, even under one-bit feedback, if the learner posts price (p, P) it is able to observe I{st p} since the buyer is always going to accept the trade. Analogously, when the two-bit algorithm would employ I{bt p}, the one-bit algorithm has to collect that information by

Published as a conference paper at ICLR 2025

posting (P, p). For instance, instead of executing EST-PAR, the one-bit algorithm runs two exploration phases: one in which it updates only bθs (Line 4 of EST-PAR) by posting (pt, P), and one in which it updates only bθb by posting (P, pt ). The same happens for the other estimation subroutines.

The one-bit algorithm simulates finer-grained feedback by posting pairs of prices (P, p), (p, P), resulting in a negative GFT. By enforcing global budget balance, we allow the learning algorithm to set prices that are not budget balanced individually, as long as the overall budget balance is maintained. To do that, we add an initial budget-collection phase, allowing the algorithm to accumulate enough profit for subsequent exploration rounds. In the following, we demonstrate how sufficient profit can be accumulated during the budget-collection phase without incurring excessive regret relative to GFT. This allows us to provide sublinear regret guarantees under one-bit feedback, with rates depending on some natural characteristics of the underlying market. This approach is specifically suited for EOC-like algorithms, which require a limited number of learning rounds before starting to set greedy prices without requiring further feedback. By contrast, algorithms such as Algorithm 4 rely on two-bit feedback in each round to continuously refine their estimates of D and F. Accumulating sufficient budget to emulate two-bit feedback throughout the entire time horizon would yield linear regret, rendering the reduction scheme inapplicable.

Collecting budget. Let B be the budget required by the one-bit algorithm to cover the rounds in which it must post either (P, p) or (p, P) to collect information (the value of B will be set later). Therefore, the budget collection phase ends as soon as P

t PROFIT(pt, qt) B. We denote by τ the random variable indicating the last round of the budget-collection phase. Moreover, let T B := JτK be the random set of rounds designated for budget collection. For each round t T B, the algorithm samples a price pt uniformly from [ P, P], and a value it uniformly from [0, log T]. Then, it posts a pair of prices (pt, qt), where qt = pt + 2 it. We start by providing a lower bound to the per-round expected profit obtained in this way.

Lemma 2. For each round t T B such that bt st, it holds: E[PROFITt(pt, qt)] (bt st)2

T , where the expectation is with respect to the choice of (pt, qt).

Our guarantees for the one-bit feedback case will depend on instance-dependent properties of the market. We measure this by looking at the per-round GFT that be extracted on sufficiently long windows of time.

Definition 1. Given a sequence of valuations {(st, bt)}T t=1, α is defined as

α := min t log T

Intuitively, α measures how easily the learner can accumulate budget. The budget-collection phase will have at least length of order log T since τ B/2P, and B = Ω(log T) even in the setting without noise. Then, we can show that the profit accumulated by time τ is at least a factor of 1/ log T of the cumulative GFT up to that point, assuming all trades up to τ had occurred.

Lemma 3. For τ log T, it holds with probability at least 1 1/T that

t JτK PROFITt(pt, qt) α 8P log(T)

t JτK [bt st]+ s

4P 2 log(T) X

t JτK [bt st]+ 2.

Now let A be a two-bit algorithm such that, with probability at least 1 1/T, it requires at most T E

exploration rounds, and has regret for the commit phase of R(2 bit) T . We set the overall budget to be collected to B = max 2048P 4α 2 log3 T, 2PT E .3 (5)

The first argument is used in the following proof to establish an upper bound on concentration terms, while the second accounts for the exploration costs over T E rounds. Then, the resulting one-bit learning algorithm has the following guarantees.

3With probability at most 1/T, the budget B may be insufficient for T E exploration rounds. In this case, we halt exploration and begin posting arbitrary strong budget-balanced prices.

Published as a conference paper at ICLR 2025

Theorem 3. Given the two-bit algorithm A, the corresponding one-bit learning algorithm satisfies global budget balance and, with probability at least 1 1/T, has regret

R(1 bit) T O α 3T E log4 T + R(2) T .

The guarantees above rely on the specific characteristics of the market, represented by the parameter α. When α = Ω(1) the market is well-behaved, meaning that trade opportunities arise frequently enough. This occurs, for example, when buyers are willing to pay more than the seller s asking price, even if only by a small margin. When α = Ω(1), the guarantees of Theorem 3 can be effectively leveraged. For instance, under the set-up of Section 4, we can use Algorithm 3 as the two-bid algorithm A. Then, B 4PT E = 4P 2P(|T par T +1|+Tint+2|K|TFD), which is of order e O(P 4d T 3/4). Therefore, the final regret bound in the noisy, one-bit setting with global budget balance is of order e O(T 3/4). A similar argument shows that in the noiseless setting of Section 3 the regret is of order polylog(T).

When the market is not well-behaved, we can still establish sublinear regret guarantees under one-bit feedback and global budget balance. Indeed, through steps analogous to those of Lemma 3, we can show that P

t JτK PROFITt(pt, qt) Ω τ 1 P

t JτK([bt st)]+ 2 e O(

Moreover, we can exploit the fact that we can decompose the one-bit regret as

R(1 bit) T P

t JτK[bt st]+ + p

16P 2τ log(T) + 2T E + R(2) T (see proof of Theorem 3).

Then, if we take the budget B (or, equivalently, T E) to be log T, the expression above yields P

t JτK[bt st]+ e O(τ 1/2T 1/4). In the worst case we have τ = T (i.e., the market does not

allow to collect enough budget), which yields a regret of order e O(T 3/4) in the deterministic set-up of Section 3. By setting B T 3/4 we achieve a regret of e O(T 7/8) in the noisy set-up of Section 4. Clearly, the rates are worse compared to the case where α = Ω(1), due to the difficulty in accumulating sufficient budget to cover the cost of posting non-budget balanced prices.

ACKNOWLEDGMENTS

MB and MC are partially supported by the FAIR (Future Artificial Intelligence Research) project PE0000013, funded by the Next Generation EU program within the PNRR-PE-AI scheme (M4C2, investment 1.3, line on Artificial Intelligence). MC is also partially supported by the EU Horizon project ELIAS (European Lighthouse of AI for Sustainability, No. 101120237). AC is partially supported by ERC grant (Project 101165466 PLA-STEER). VP acknowledges support from the French National Research Agency (ANR) under grant number ANR-19-CE230026 as well as from the grant Investissements d Avenir (Lab Ex Ecodec/ANR-11-LABX-0047). SG gratefully acknowledges funding from the Jacques Hadamard Foundation (FMJH).

Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, volume 24, 2011.

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646. PMLR, 2014.

Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Repeated contextual auctions with strategic buyers. Advances in Neural Information Processing Systems, 27, 2014.

Published as a conference paper at ICLR 2025

Yossi Azar, Amos Fiat, and Federico Fusco. An α-regret analysis of adversarial bilateral trade. In Advances in Neural Information Processing Systems, 2022.

François Bachoc, Tommaso Cesari, and Roberto Colomboni. A contextual online learning theory of brokerage. ar Xiv preprint ar Xiv:2407.01566, 2024.

Santiago Balseiro, Negin Golrezaei, Mohammad Mahdian, Vahab Mirrokni, and Jon Schneider. Contextual bandits with cross-learning. Advances in Neural Information Processing Systems, 32, 2019.

Martino Bernasconi, Matteo Castiglioni, Andrea Celli, and Federico Fusco. No-regret learning in bilateral trade via global budget balance. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pp. 247 258, 2024.

Liad Blumrosen and Shahar Dobzinski. Reallocation mechanisms. In Proceedings of the ACM Conference on Economics and Computation, pp. 617. ACM, 2014.

Liad Blumrosen and Yehonatan Mizrahi. Approximating gains-from-trade in bilateral trading. In WINE, volume 10123 of Lecture Notes in Computer Science, pp. 400 413. Springer, 2016.

Natasa Boli c, Tommaso Cesari, and Roberto Colomboni. An online learning theory of brokerage. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pp. 216 224, 2024.

Johannes Brustle, Yang Cai, Fa Wu, and Mingfei Zhao. Approximating gains from trade in two-sided markets via simple mechanisms. In Proceedings of the 2017 ACM Conference on Economics and Computation, pp. 589 590, 2017.

Alexandra Carpentier, Claire Vernade, and Yasin Abbasi-Yadkori. The elliptical potential lemma revisited. ar Xiv preprint ar Xiv:2010.10182, 2020.

Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Federico Fusco, and Stefano Leonardi. A regret analysis of bilateral trade. In Proceedings of the ACM Conference on Economics and Computation, pp. 289 309. ACM, 2021.

Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Federico Fusco, and Stefano Leonardi. Repeated bilateral trade against a smoothed adversary. In COLT, volume 195 of Proceedings of Machine Learning Research, pp. 1095 1130. PMLR, 2023.

Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Federico Fusco, and Stefano Leonardi. Bilateral trade: A regret minimization perspective. Mathematics of Operations Research, 49(1): 171 203, 2024.

Maxime C Cohen, Ilan Lobel, and Renato Paes Leme. Feature-based dynamic pricing. Management Science, 66(11):4921 4943, 2020.

Arnoud V Den Boer. Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in operations research and management science, 20(1):1 18, 2015.

Yuan Deng, Jieming Mao, Balasubramanian Sivan, and Kangning Wang. Approximately efficient bilateral trade. In STOC, pp. 718 721. ACM, 2022.

Yumou Fei. Improved approximation to first-best gains-from-trade. In International Conference on Web and Internet Economics, pp. 204 218. Springer, 2022.

Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization. Springer Science & Business Media, 1993.

Adel Javanmard. Perishability of data: dynamic pricing under varying-coefficient models. Journal of Machine Learning Research, 18(53):1 31, 2017.

Adel Javanmard and Hamid Nazerzadeh. Dynamic pricing in high-dimensions. Journal of Machine Learning Research, 20(9):1 49, 2019.

Published as a conference paper at ICLR 2025

Zi Yang Kang, Francisco Pernice, and Jan Vondrák. Fixed-price approximations in bilateral trade. In SODA, pp. 2964 2985. SIAM, 2022.

N Bora Keskin and Assaf Zeevi. Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations research, 62(5):1142 1167, 2014.

R. Kleinberg and T. Leighton. The value of knowing a demand curve: bounds on regret for online posted-price auctions. In 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 594 605, 2003.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

Renato Paes Leme and Jon Schneider. Contextual search via intrinsic volumes. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 268 282. IEEE, 2018.

Allen Liu, Renato Paes Leme, and Jon Schneider. Optimal contextual pricing and extensions. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1059 1078. SIAM, 2021.

Ilan Lobel, Renato Paes Leme, and Adrian Vladu. Multidimensional binary search for contextual decision-making. Operations Research, 66(5):1346 1361, 2018.

R Preston Mc Afee. The gains from trade under fixed price mechanisms. Applied economics research bulletin, 1(1):1 10, 2008.

Roger B Myerson and Mark A Satterthwaite. Efficient mechanisms for bilateral trading. Journal of economic theory, 29(2):265 281, 1983.

William Vickrey. Counterspeculation, auctions, and competitive sealed tenders. The Journal of finance, 16(1):8 37, 1961.

Jianyu Xu and Yu-Xiang Wang. Logarithmic regret in feature-based dynamic pricing. Advances in Neural Information Processing Systems, 34:13898 13910, 2021.

Published as a conference paper at ICLR 2025

A BASIC FACTS ABOUT ELLIPSOIDS

An ellipsoid E with center c Rd and positive definite matrix M Rd d is defined as

E(M, c) := x Rd : x c M 1 1 .

Given an hyperplane H(a, c) := {x Rd : a (x c) 0}, the minimal volume ellipsoid containing K = E(M, c) H(a, c) can be computed in closed form as follows. The new center is

c = c 1 d + 1 1

and the new positive definite matrix is

M 2 d + 1 Maa M

Then, the Löwner-John ellipsoid LJ(K) of the half ellipsoid K is the ellipsoid E(M , c ).

Going from E(M, c) to E(M , c ) the volume shrinks by the following amount:

vol(E(M , c )) e 1/2dvol(E(M, c)).

An in-depth analysis of these results can be found in Grötschel et al. (1993).

B PROOF OF PROPOSITION 1

Before proving Proposition 1, we begin by provide a high-level view of Algorithm 1. This algorithm maintains two ellipsoidal uncertainty sets Es t and Eb t (see Appendix A for some basic facts about ellipsoids). The key idea is that, at each update, the volume of an uncertainty shrinks fast enough to yield a good estimate of the true parameter in a small number of iterations. The problems of computing the maximum and minimum possible valuations of the seller and buyer (Line 4) admit a simple closed-form solution (see, e.g., , Grötschel et al. (1993, Chapter 3)).

At each round t, given the current uncertainty sets we compute the maximum and minimum possible valuations of the seller for the current product as

st = minθ Es t x t θ and st = maxθ Es t x t θ. (6)

The maximum and minimum possible valuations of the buyer, denoted by bt and bt, are computed analogously. All of these optimization problems admit a simple closed form solution (see, e.g., , Grötschel et al. (1993, Chapter 3)).

If the smallest possible valuation for the buyer bt is above the highest possible valuation for the seller st (Line 5), the algorithm can post any price between these two values, ensuring that the trade will occur. If st and st are far apart (Line 6), the algorithm performs a binary search step (a.k.a. explore step), and posts a price pt which is halfway between the seller s minimum and the maximum possible valuations. Then, we update the seller s uncertainty set as follows: if the seller accepts the sale, it implies that x t θs pt, and we define the half-ellipsoid Ks t+1 = Es t {θ : x t θs pt}. Otherwise, we set Ks t+1 = Es t {θ : x t θs > pt}. Finally, we round this half-ellipsoid by replacing it by its Löwner-John ellipsoid, i.e., the ellipsoid containing Ks t+1 with the smallest volume. We proceed similarly for the buyer (Line 9). If the highest and lowest possible valuations of both the buyer and seller are close to each other, and the buyer s minimum valuation is lower than the seller s maximum valuation, then the difference of valuations between the two parties is small, resulting in a negligible gain from trade. Therefore, in this case, we can safely post an arbitrary price (Line 12).

Proposition 1. Assume that the seller s and buyer s valuations follow Equation (2). Under Assumption 1, Algorithm 1 with ϵ = ABd2/T has regret bounded by RT 10ABd2 log 20(d + 1)Td 2 .

Proof. The proof of Proposition 1 relies on the fact that when the algorithm updates Es t or Eb t , it behaves as the ELLIPSOIDPRICING algorithm of Cohen et al. (2020). More precisely, we first note that under Assumption 1, Es 1 and Eb 1 contain θs and θb, respectively. Moreover, straightforward

Published as a conference paper at ICLR 2025

induction shows that if Es t contains θs, then Ks t+1 also contains θs, and so does Es t+1 (similar reasoning applies to θb and Eb t ).

This implies that for all rounds, st st st, and bt bt bt. In particular, if the condition in Line 5 is verified, we have st pt bt, and so the instantaneous regret at that round is 0.

We also underline that Algorithm 1 only explores to estimate θs (Line 6) if the condition max θ Es t x t θ min θ Es t x t θ ϵ

is verified, or equivalently if

x t B θ min θ Es t

Since the normalized contexts x t B are bounded in norm by 1 by Assumption 1, we can apply Lemma 1 by Cohen et al. (2020). This Lemma states that Algorithm 1 will execute the steps in Lines 7-8 at most 2d2 log(20A(d+1)/ ϵ

B ) times, after which the condition in Line 6 will never be verified again. A similar reasoning allows us to bound the number of exploration rounds for the buyer s parameter θb (Line 9). Therefore, the total number of exploration rounds is bounded by 2 2d2 log(20A(d + 1)/ ϵ

B ). We notice that Assumption 1 also implies that the valuations are in [ AB, AB], and so the instantaneous regret is bounded by 2AB. Thus, the regret of the exploration phase (Lines 6 and 9) is bounded by 8ABd2 log(20A(d + 1)/ ϵ

Finally, if the conditions on Lines 5, 6 and 9 are not verified, it implies that bt st bt st

bt + ϵ st + ϵ 2ϵ. Then, the instantaneous regret at that round is at most 2ϵ, and the total regret for these steps is at most 2ϵT. Choosing ϵ = ABd2

T yields the result.

C PROOF OF THEOREM 1

In order to prove Theorem 1, we provide a more detailed version of the EXPLORE-OR-COMMIT algorithm in Algorithm 5. Note that we adopt the convention 1/0 = + .

Before analyzing the EXPLORE-OR-COMMIT algorithm, we define the various quantities used within it. Specifically, T par t represents the set of indices from previous rounds that were dedicated to estimating the parameters θs and θb (Line 8). We denote by bθs t and bθb t the estimates for the parameters, and Vt is the corresponding empirical covariance matrix at the beginning of round t. The set T int contains the indices of the rounds used to estimate the integrals I and J (Line 13). Moreover, b Ik is the estimate of I(kϵ), and b Jk is the estimate of J(kϵ). The quantities b F k and b Dk represent our estimates of F s(kϵ) and Db(kϵ), respectively.

We use the following lemma to bound the total duration of the parameter exploration phase (the proof can be found in Appendix E.2). Lemma 4. Almost surely, the length of exploration phase T par T +1 is bounded as

|T par T +1| d log T +d

The following lemma provides bounds on the estimation errors for θs, θb, I, J, F s, and Db for the values of µ, Tint, and TFD specified in Theorem 2. Let E be the event

E := n k K, t / T par T T int x t (bθb t θb) ϵ and x t (bθs t θs) ϵ and b Ik I (kϵ) 2ϵ and b Jk J(kϵ) 2ϵ o .

Published as a conference paper at ICLR 2025

Algorithm 5 EXPLORE-OR-COMMIT (detailed version)

1: Input: parameter µ > 0, length of exploration phases Tint and TFD, discretization size ϵ, confidence level δ. 2: Initialize: T par 1 = T int = , bθs 1 = 0d, bθb 1 = 0d, V1 = Id, K = 2P/ϵ + 3, K = J K, KK, T F k = T D k = for all k K, b F k = b Dk = 0 for all k K.

3: while t T do 4: if xt V 1 t > µ then Estimate the parameters θb and θs

5: Draw pt U([ P, P]) and post prices (pt, pt) 6: Update T par t+1 T par t {t}

7: Update Vt+1 =

l T par t+1 xlx l + Id

8: Update parameter estimates

bθs t+1 = 2PVt+1 X

l T par t+1

xl, bθb t+1 = 2PVt+1 X

l T par t+1

9: else if |T int| < Tint then Estimate the integrals I and J 10: Draw pt U([ P, P]) and post prices (pt, pt) 11: Update T int T int {t} 12: if |T int| = Tint then 13: Compute estimate of integrals I and J 14: for k K do

l T int I n kϵ + x l bθb pl bl o , b Jk = 2P

l T int I n sl pl kϵ + x l bθso .

15: else if for some k K, |T F k | < TFD then Estimate Fk 16: Set pt = x t bθs + kϵ and post (pt, pt) 17: Update T F k T F k {t} 18: if |T F k | = TFD then set b F k = 2P

TFD P l T F k I {sl pl}

19: else if for some k K, |T D k | < TFD then Estimate Fk 20: Set pt = x t bθb + kϵ and post (pt, pt) 21: Update T D k T D k {t} 22: if |T D k | = TFD then set

l T D k I {pl bl}

23: else Post greedy price

24: set At =

(k, k ) K2 : k = x t (bθs t bθb t)+kϵ ϵ

25: kt = arg maxk: k ,(k,k ) At b Ik b F k + b Jk b Dk .

26: Set pt = x t bθs t + ktϵ and post (pt, pt)

Published as a conference paper at ICLR 2025

Let us define the exploitation (i.e., commit) phase as

T C := JTK \

T par T T int [

T F k T D k !

Moreover, let EEOC be the event

EEOC := E n k K, t T C, b F k F s(kϵ) (L + 1)ϵ and b Dk Ds(kϵ) (L + 1)ϵ o .

Then, we can lower bound the probability of event EEOC as follows (the proof is postponed to Appendix E.3).

Lemma 5. For the choice µ = ϵ P q

d log 1+B2T

δ + A 1 , Tint = 8P 2 log(1/δ)/ϵ2, and

TFD = 2 log(1/δ)ϵ 2, it holds that P EEOC 1 2δ 8δ(2K + 1).

Note that the choice δ = T(74 + 32Pϵ 1) 1 ensures that the event EEOC happens with probability at least 1 T 1.

The following Lemma bounds the error for estimating the gain from trade of a price p = x t bθs t + kϵ on the high-probability event EEOC (the proof is presented in Appendix E.4). Lemma 6. On the event EEOC, for all t T C, and all (k, k ) At, b Ik b F k + b Jk b Dk EGFT(xt, x t bθs t + kϵ) (10PL + 4P + 7)ϵ.

Next, we bound the discretization error. Let us define (k t , k t ) arg max (k,k ) At EGFT(xt, x t bθs t + kϵ).

Then, the following result holds (see Appendix E.5 for its proof). Lemma 7. On the event EEOC, we have that max p EGFT(xt, p) EGFT(xt, x t bθs t + k t ϵ) 2LPϵ.

We are now ready to prove Theorem 1. For p R, x Rd, we define (x, p) := max p EGFT(x, p ) EGFT(x, p).

We begin by decomposing the regret as

t T par T +1

(xt, pt) + X

t T int (xt, pt)+

(xt, pt) + X

t T C (xt, pt).

Using the fact that (xt, pt) 2P, we obtain

RT 2P |T par T +1| + Tint + 2|K|TFD + X

t T C (xt, pt).

On the one hand, Lemma 4 implies that |T par T +1| d log( T +d

d ) µ2 . For the choice µ =

ϵ P d log 1+B2T

δ 1/2 + A 1 , Tint = 8P 2 log(1/δ)ϵ 2, and TFD = 2 log(1/δ)ϵ 2, this im-

RT 2Pd log T + d

d log 1 + B2T

ϵ 2 + 16P 3 log(1/δ)ϵ 2+

ϵ + 9 log(1/δ)ϵ 2 + X

t T C (xt, pt).

Published as a conference paper at ICLR 2025

Note that the first term is of order ϵ 3 when ϵ is small enough. On the other hand, on the event EEOC, by Lemma 7 we have X

t T C (xt, pt)

t T C (xt, x t bθs t + k t ϵ) + X

t T C EGFT(xt, x t bθs t + k t ϵ) EGFT(xt, x t bθs t + ktϵ)

EGFT(xt, x t bθs t + k t ϵ) EGFT(xt, x t bθs t + ktϵ) ,

Moreover, our choice of kt ensures that for k t such that (kt, k t) At,

b Ik t b F kt + b Jkt b Dk t b Ik t b F k t + b Jk t b Dk t .

By Lemma 6, this implies that, on event EEOC,

EGFT(xt, x t bθs t + k t ϵ) EGFT(xt, x t bθs t + ktϵ) (20PL + 8P + 14)ϵ.

Thus, on the event EEOC,

RT 2Pd log T + d

d log 1 + B2T

ϵ 2 + 16P 3 log(1/δ)/ϵ2

ϵ + 3 log(1/δ)ϵ 2 + 2TLPϵ + T(20PL + 8P + 14)ϵ.

Substituting the values of ϵ and δ as per Theorem 1 and using Lemma 5 allows us to conclude the proof. In particular, the result in Theorem 1 holds with C = 22PL + 26P + 15, provided that T 74 + 32P, that T 1 + B2, that T d, 5P 2d log(T) 1, and that 160P 2d2 + 36P 3 T 1/4 log(T)5/4.

D PROOF OF THEOREM 2

D.1 DETAILED ALGORITHM

In order to prove Theorem 2, we provide a more detailed version of the SCOUTING BANDIT WITH INFORMATION POOLING algorithm in Algorithm 6. Note that in Lines 16 an 17, we adopt the convention 1/0 = + when computing the upperand lowerconfidence bounds for increment levels k, k that have not yet been selected.

Before analyzing the SBIP algorithm, we introduce the different quantities appearing in the algorithm. In words, T par t is the set of indices of the previous rounds that have been spent estimating the parameters θs and θb at the beginning of round t. Similarly, bθs t , bθb t, and Vt are the estimates of the parameters and the corresponding empirical covariance matrix at the beginning of round t. The set T int consists of the indices of the rounds used to estimate the integrals I and J. Moreover, b Ik estimates I(kϵ), and b Jk estimates J(kϵ). The quantities b F k t and b Dk t represent our estimates of F s(kϵ) and Db(kϵ) at the beginning of round t. Similarly, N s,k t (resp., N b,k t ) counts the number of rounds in the successive elimination phase where the increment pl x l bθs l was equal to kϵ (resp., where the increment pl x l bθb l was close to kϵ), and where we gained information on F s(kϵ) (resp., on Db(kϵ)). The set At collects the pairs (k, k ) K such that x t bθs t +kϵ x t bθb t +k ϵ is a possible price (within [ P, P]). Then, for (k, k ) At, the quantities UCBt(k, k ) and LCBt(k, k ) provide upper and lower confidence bounds on the expected gain from trade corresponding to this price.

The third phase is a successive elimination phase: at each round, we consider a set of possible optimal prices, with corresponding upper confidence bound larger than the highest lower confidence bound. This set is denoted by Kt. In order to ensure sufficient exploration of all potentially optimal price increments (kϵ, k ϵ) for (k, k ) Kt, we choose the price increment with the widest confidence interval: this corresponds to choosing the pair (k, k ) Kt such that either N s,k t or N b,k t

Published as a conference paper at ICLR 2025

Algorithm 6 SCOUTING BANDITS WITH INFORMATION POOLING (SBIP) (detailed version)

1: Input: parameter µ > 0, length of scouting phase Tint, discretization size ϵ, confidence level δ. 2: Initialize: T par 1 = T int = , bθs 1 = 0d, bθb 1 = 0d, V1 = Id, K = 2P/ϵ + 3, K = J K, KK, T SE s,k = T SE b,k = for all k K, N s,k 1 = N b,k 1 = 0 for all k K, b F k 1 = b Dk 1 = 0 for all k K.

3: while t T do 4: if xt V 1 t > µ then Estimate the parameters θb and θs

5: Draw pt U([ P, P]) and post (pt, pt) 6: Update T par t+1 = T par t {t}

7: Update Vt+1 = P

l T par t+1 xlx l + Id 1

8: Update parameter estimates

bθs t+1 = 2PVt+1 P

l T par t+1 I{pl sl} 1

2 xl, bθb t+1 = 2PVt+1 P

l T par t+1 I{pl bl} 1

9: else if |T int| < Tint then Estimate the integrals I and J 10: Draw pt U([ P, P]) and post (pt, pt) 11: Update T int = T int {t} 12: if |T int| = Tint then 13: for k K do

l T int I n kϵ + x l bθb pl bl o , b Jk = 2P

l T int I n sl pl kϵ + x l bθso

14: else Run Successive Elimination 15: Set At = (k, k ) K2 : k = (x t (bθs t bθb t)+kϵ)/ϵ

16: for (k, k ) At do

UCBt(k, k ) = b Ik b F k t + b Jk b Dk t + (12PL + 7)ϵ + 2P p

2 log(1/δ)/Ns,k t + q

2 log(1/δ)/N b,k t

LCBt(k, k ) = b Ik b F k t + b Jk b Dk t (12PL + 7)ϵ 2P p

2 log(1/δ)/Ns,k t + q

2 log(1/δ)/N b,k t

17: Set Kt = {(k, k ) At : UCBt(k, k ) max{LCBt(l, l ) : (l, l ) At}}

18: Set ks t = arg min n N s,k t : (k, k ) Kt o , and kb t = arg min n N b,k t : (k, k ) Kt o

19: if N s,ks t t N b,kb t t then 20: kt = ks t , and set k t to be such that (kt, k t) Kt 21: T SE s,kt = T SE s,kt {t} 22: else 23: k t = kb t, and set kt to be such that (kt, k t) Kt 24: T SE b,k t = T SE b,k t {t}

25: Set pt = x t bθs t + ktϵ and post (pt, pt)

26: Update b F kt t+1 = Nkt,s t b F kt t +I{st pt} N kt,s t +1 , b Dk t t+1 = N k t,b t b D k t t +I{bt pt}

N k t,b t +1

27: Update N kt,s t+1 = N kt,s t + 1, and N k t,b t+1 = N k t,b t + 1

28: Quantities that have not been updated during round t are kept the same for round t + 1

Published as a conference paper at ICLR 2025

is the lowest of the set. We denote (kt, k t) the pair of increments chosen in such way. In order to analyze this phase, we store the indexes of the rounds where we chose this pair because N s,kt t was the smallest in the set T SE s,kt. Analogously, the rounds where we chose this pair because N b,k t t was the smallest are stored in the set T SE b,k t).

D.2 REGRET ANALYSIS

The beginning of the proof of Theorem 2 is similar to that of Theorem 1. As in the previous case, we define the event

E := n k K, t / T par T +1 T int, x t (bθb t θb) ϵ, x t (bθs t θs) ϵ, b Ik I (kϵ) 2ϵ, b Jk J(kϵ) 2ϵ o .

Moreover, we define a new event ESBIP as

ESBIP := E \ n k K, t / T par T +1 T int, b F k t F s(kϵ) q

Ns,k t + Lϵ, b Dk Ds(kϵ) q

Nb,k t + 2Lϵ o .

The following Lemma shows that the event ESBIP happens with large probability. The detailed proof can be found in Appendix E.6.

Lemma 8. For the choice µ = ϵ P d log 1+B2T

δ 1/2 + A 1 and Tint = 8P 2 log(1/δ)/ϵ2, it

holds that P ESBIP 1 2δ 4δ(2K + 1) 4δ(2K + 1)T.

Note that the choice δ = T(38 + 16Pϵ 1 + 16Pϵ 1T + 36T) 1 ensures that the event ESBIP

happens with probability at least 1 T 1.

The following Lemma (whose proof is located in Appendix E.7) shows that the upper and lower confidence bounds used in Algorithm 6 hold conditioned on the high-probability event ESBIP.

Lemma 9. Under the assumptions of Lemma 8 and conditioned on the event ESBIP, we have that for all t / T par T +1 T int, and all (k, k ) At;

LCBt(k, k ) EGFT(xt, x t bθs t + kϵ) UCBt(k, k ).

Moreover, it holds that (k t , k t ) Kt, where we recall that

(k t , k t ) arg max (k,k ) At EGFT(xt, x t bθs t + kϵ).

Finally, we bound the number of times a sub-optimal price increment kϵ can be selected. For p R, x Rd, recall that we defined

r(x, p) = max p EGFT(x, p ) EGFT(x, p).

Lemma 10. Conditioned on the event ESBIP, if t T SE s,k, then the following condition holds

r(xt, pt) (50PL + 28)ϵ + 16P

Similarly, if t T SE b,k , then the following condition holds

r(xt, pt) (50PL + 28)ϵ + 16P

Published as a conference paper at ICLR 2025

The proof for this result can be found in Appendix E.8.

We are now ready to bound the regret of SBIP. We begin by decomposing the regret as

t T par T +1

r(xt, pt) + X

t T int r(xt, pt) + X

t/ T par T +1 T int r(xt, pt).

Since the gain from trade is bounded by 2P, this implies

RT 2P|T par T +1| + 2P|T int| + X

t/ T par T +1 T int r(xt, pt).

Then, using Lemma 4, by setting µ = ϵ P d log 1+B2T

δ 1/2 + A 1 and Tint =

8P 2 log(1/δ)/ϵ2 we have

2Pd log T + d

d log 1 + B2T

16ϵ 2P 3 log(1/δ) + X

t/ T par T +1 T int r(xt, pt). (7)

To bound the third term (i.e., the regret of the successive elimination phase), we decompose it as X

t/ T par T +1 T int r(xt, ptϵ) = X

r(xt, pt) + X

t T SE b,k r(xt, pt).

For k K, let us partition T SE s,k as follows. We denote by T SE s,k = |T SE s,k|, and we define t1, . . . , t T SE s,k the rounds where t T SE s,k. More formally, we have t1 < t2 < ... < t T SE s,k, and {t1, ..., t T SE s,k} = T SE s,k. We define a = log2 (2(50PL + 28)ϵ) , and for a J1, a K, we define

ta = 2 log(1/δ) (32P2a)2 .

With these notation, we have X

r(xt, pt) = X

i t1 T SE s,k

r(xti, pti)

ta T SE s,k<i ta+1 T SE s,k

r(xti, pti) + X

(ta+1) T SE s,k i T SE s,k

r(xti, pti).

For 1 a a, if ta i T SE s,k, by definition of ta, we have

(50PL + 28)ϵ 16P

N s,k ti 2 a

and Lemma 10 implies that, conditioned on the high-probability event ESBIP,

r(xti, pti) 2 a.

Using the fact that r(xt, pt) 2P, we obtain

r(xt, pt) 2Pt1 +

a=1 2 a (ta+1 ta + 1) + 2 a T SE s,k

a=1 2 a (ta+1 ta + 1) + 2 a T SE s,k.

Published as a conference paper at ICLR 2025

a=1 2 a (ta+1 ta) = 6 log(1/δ) (32P)2

= 6 log(1/δ) (32P)2 2a

3 log(1/δ) (32P)2

(50PL + 28)ϵ

Moreover, we have that

2 a T SE s,k 2(50PL + 28)ϵT SE s,k.

Therefore, X

r(xt, pt) 4P log(1/δ) (128P)2 + 3 log(1/δ) (32P)2

(50PL + 28)ϵ + 1 + 2(50PL + 28)ϵT SE s,k.

We proceed similarly to bound P

t T SE b,k r(xt, pt), and we obtain that

r(xt, pt) 4P log(1/δ) (128P)2 + 3 log(1/δ) (32P)2

(50PL + 28)ϵ + 1 + 2(50PL + 28)ϵT SE b,k.

Summing over k K and k K, and using the fact that P

k K T SE s,k + T SE b,k T, we find

t/ T par T +1 T int r(xt, ptϵ) 2|K|

4P log(1/δ) (128P)2 + 3 log(1/δ) (32P)2

(50PL + 28)ϵ + 1

+ 2(50PL + 28)ϵT. (8)

Combining Equations (7) and (8), we find that, conditioned on the event ESBIP,

RT 2Pd log T +d

d log 1+B2T

ϵ2 + 18P 3 log(1/δ)

4P log(1/δ) (128P)2 + 3 log(1/δ) (32P)2

(50PL + 28)ϵ + 1

+ 2(50PL + 28)ϵT.

By setting ϵ = (d2 log(T)2/T) 1 3 and δ = (38 + 16Pϵ 1)(T + 1)2 1, and using the definition of K, we find that there exists a constant e C depending on A, B C, and L such that with probability 1 20T 2/3,

RT e C (d log(T)T) 2/3.

This concludes the proof of Theorem 2. Note that we can choose e C = 461P 3 + 100PL + 56, and the result holds provided that T is large enough (namely, that T 2(38 + 16P), that 10 P/ϵ, that P 2 16

3 d log(T) A, that P 3 log(T) 1, that B2 + 1 T, and that (T/ log(T)2)1/3 3312/102 P 2).

E PROOFS OF AUXILIARY LEMMAS

E.1 PROOF OF LEMMA 1

Lemma 1. Under Assumptions 1 and 2, the expected gain from trade for x at price p is given by

EGFT(x, p) = I(δb)F s(δs) + J(δs)Db(δb), (4)

where δs := p x t θs, δb := p x t θb, I(δ) := R C δ Db(u)du, and J(δ) := R δ C F s(u)du.

Published as a conference paper at ICLR 2025

Proof. The proof relies on the characterization of the gain from trade given by Lemma 4.1 in Cesa Bianchi et al. (2021). Recall that, under Assumption 1, we have ξs t [ C, C], and ξb t [ C, C]. Then, denoting f s and f b the densities of ξs t and ξb t, we have EGFT(x, p)

(s,b) [x θs C,x θs+C] [x θb C,x θb+C] (b s)I{s p b}f b(x θb s)f s(x θs s) ds db.

Let f b : b 7 f b(x θb s) and f s : s 7 f s(x θs s) be the densities of st and bt conditionnally on xt = x, and note that outside of [x θb C, x θb +C] [x θs C, x θs +C], f s(s) f b(b) = 0. With these notation, we have

EGFT(x, p) = Z

(s,b) [ P,P ]2(b s)I{s p b} f b(b) f s(s) ds db

(s,b) [ P,p] [p,P ]

! f s(s) f b(b) ds db.

EGFT(x, p) = Z

(s,b) [ P,p] [p u,P ] f s(s) f b(b) ds db

(s,b) [u,p] [p,P ] f s(s) f b(b) ds db

This yields

EGFT(x, p) = Z

(s,b) [ P,p] [p,P ] f s(s) f b(b) ds db

(s,b) [ P,p] [u,P ] f s(s) f b(b) ds db

(s,b) [u,p] [p,P ] f s(s) f b(b) ds db

This, in turn, implies that

EGFT(x, p) = Z

(s,b) [ P,u] [p,P ] f s(s) f b(b) ds db

(s,b) [ P,p] [u,P ] f s(s) f b(b) ds db

[p,P ] f b(b) dd Z

[ P,u] f s(s) ds du

[ P,p] f s(s) ds Z

[u,P ] f b(b) db du.

By using the change in variables ξs = s x θs, us = u x θs, ξb = b x θb, and ub = u x θb, and the definition of f s and f b, we get

EGFT(x, p) = Z

[p x θb,P x θb] f(ξb) dd Z

[ P x θs,u x θs] f(ξs) dξs du

[ P x θs,p x θs] f(ξs) dξs Z

[u x θb,P x θb] f(ξb) dξb du

[p x θb,P x θb] f(ξb) dξb Z

[ P x θs,p x θs]

[ P x θs,us] f(ξs) dξs dus

[ P x θs,p x θs] f(ξs) dξs Z

[p x θb,P x θb]

[ub,P x θb] f(ξb) dξb dub.

Published as a conference paper at ICLR 2025

Now, under Assumption 1, f b and f s are null outside of [ C, C]. Moreover, P x θb C and P x θs C. Thus,

EGFT(x, p) = Z

[p x θb,C] f(ξb) dd Z

[ C,p x θs]

[ C,us] f(ξs) dξs dus

[ C,p x θs] f(ξs) dξs Z

[ub,C] f(ξb) dξb dub

=Db(p x θb) Z p x θs

C F s(us) dus + F s(p x θs) Z C

p x θb Db(ub) dub.

Using the definition of δs = p x θs and δb = p x θb, we obtain the desired result.

E.2 PROOF OF LEMMA 4

Lemma 4. Almost surely, the length of exploration phase T par T +1 is bounded as

|T par T +1| d log T +d

Proof. The elliptical potential Lemma (see, e.g., Proposition 1 in Carpentier et al. (2020)) implies that almost surely,

t T par T +1

l T par t xlx l +Id 1

v u u t|T par T +1|d log

|T par T +1| + d

Since for all t T par T +1, xt P

l T par t xlx l +Id 1 µ, this implies that

|T par T +1|µ

v u u t|T par T +1|d log

|T par T +1| + d

Now, almost surely, |T par T +1| T. This implies in particular that

|T par T +1| d log T +d

which concludes the proof.

E.3 PROOF OF LEMMA 5

Lemma 5. For the choice µ = ϵ P q

d log 1+B2T

δ + A 1 , Tint = 8P 2 log(1/δ)/ϵ2, and

TFD = 2 log(1/δ)ϵ 2, it holds that

P EEOC 1 2δ 8δ(2K + 1).

Proof. The Lemma is a consequence of the following two auxiliary results, bounding respectively the error for estimating the parameters θs and θb, and the integrals I and J.

Lemma 11. Let E1 be the event :

E1 := n t / T par T +1, x t (bθb t θb) ϵ and x t (bθs t θs) ϵ o .

Then, for the choice µ = ϵ P d log 1+B2T

δ 1/2 + A 1 , we have P (E1) 1 2δ.

The proof of the above lemma is deferred to Appendix E.9.

Published as a conference paper at ICLR 2025

Lemma 12. Let E2 be the event:

E2 := n |T int| < Tint or k K b Ik I (kϵ) 2ϵ and b Jk J(kϵ) 2ϵ o E1.

Then, for the choice µ = ϵ P d log 1+B2T

δ 1/2 + A 1 and Tint = 8P 2 log(1/δ)/ϵ2, we have

P (E2) 1 2δ 4δ(2K + 1).

The proof of the above lemma is deferred to Appendix E.10.

Next, we control the error | b F k F s(kϵ)| uniformly for k K. The results for | b Dk Db(kϵ)| follow from similar arguments. To do so, we rely on the following well known result (for the sake of completeness, we provide a proof in Appendix E.11).

Lemma 13. Let (yt)t 1 be a sequence of random variables adapted for a filtration Ft, such that yt E [yt|Ft 1] [m, M]. Assume that for t N , ιt {0, 1} is Ft 1-measurable, and define Nt := P l t ιl, and bµt := N 1 t P l t ιl(yl E [yl|Fl 1]) if Nt 1. Then, for any t N and δ (0, 1),

Nt = 0 or |bµt| (M m)

Moreover, for any t > 0 and δ (0, 1),

Nt = t and |bµt| (M m)

For t T, we define ιt := I t T F k , yt := I {st pt}, and we observe that for Ft = σ ((x1, . . . , xt+1, (I{s1 p1}, I{b1 p1}), . . . , (I{st pt}, I{bt pt})), ιt is Ft 1-measurable, and yt is Ft adapted. Moreover,

ιt E [yt | Ft 1] = ιt P x t θs + ξs t x t bθs t + kϵ

= ιt F s x t bθs t θs + kϵ .

Using Lemma 13, we find that with probability 1 2δ, either |T F k | < TFD, or b F k t

s t ιt F s x t bθs t θs + kϵ

Moreover, on the event E1, for all t / T par, |x t (bθs t θs)| ϵ. Using the fact that F s is L-Lipschitz, we find that F s x t bθs t θs + kϵ F s (kϵ) Lϵ.

Thus, with probability 1 2δ, either |T F k | < TFD, or

b F k t F s(kϵ)

Using Lemma 11 and 12, along with a union bound over k K yields the desired result for the choice TFD = 2 log(1/δ)ϵ 2.

Published as a conference paper at ICLR 2025

E.4 PROOF OF LEMMA 6

Lemma 6. On the event EEOC, for all t T C, and all (k, k ) At, b Ik b F k + b Jk b Dk EGFT(xt, x t bθs t + kϵ) (10PL + 4P + 7)ϵ.

Proof. The proof relies on the following result :

Lemma 14. On the event EETC, for all (k, k ) At, and p = x t bθs t + kϵ, we have EGFT(xt, p) b Ik b F k + b Jk b Dk 6PLϵ + 3ϵ + 2P |b F| + |b D| + |b I| + |b J|

where we define b I = I(k ϵ) b Ik , b J = J(kϵ) b Jk, b F = F(kϵ) b F k, b D = D(k ϵ) b Dk .

The proof of this intermediate result is in Appendix E.12.

To conclude the proof of Lemma 6, it remains to bound the gaps b F, b D, b I, and b J on the event EETC. By definition of the event EETC, on this event b I 2ϵ, b D (L + 1)ϵ, b J 2ϵ and b F (L + 1)ϵ. Thus, on the event EETC, EGFT(xt, p) b Ik b F k + b Jk b Dk 6PLϵ + 3ϵ + 4P(L + 1)ϵ + 4ϵ.

This concludes the proof.

E.5 PROOF OF LEMMA 7

Lemma 7. On the event EEOC, we have that max p EGFT(xt, p) EGFT(xt, x t bθs t + k t ϵ) 2LPϵ.

Proof. By definition of At,

sup p [ P,P ] EGFT(xt, p) EGFT(xt, x t bθs t + k t ϵ).

Since the mapping p 7 EGFT(x, p) is continuous, we can define

p t arg max p EGFT(xt, p), kt =

$ p t x t bθs t ϵ

, and k t =

x t bθs t bθb t + kt

We now show that ( kt, k t) At. By Lemma 1, we have that p t x t θs C, and p t x t θb C (otherwise EGFT(xt, p t ) = 0). This implies that p t x t θs [ 2P, 2P], and similarly that p t x t θb [ 2P, 2P].

On the one hand, on the event EETC, x t θs bθs t ϵ, so p t x t bθs t [ 2P ϵ, 2P +ϵ]. Thus, kt K. On the other hand,

x t bθs t bθb t + ktϵ = x t bθs t + ktϵ p t + (p t x t θb) + x t θb bθb t .

Then, on the event EETC,

x t bθs t bθb t + ktϵ [ 2P 2ϵ, 2P + 2ϵ],

, so k t [ 2P 3ϵ, 2P + 3ϵ]. Therefore, k t K. This implies that ( kt, k t) At, so

EGFT(xt, x t bθs t + k t ϵ) EGFT(xt, x t bθs t + ktϵ).

Finally, Lemma 1 and Assumption 2 imply that the function p 7 EGFT(xt, p) is 2LP-Lipschitz continuous. This, in turn, implies that

EGFT(xt, p t ) EGFT(xt, x t bθs t + ktϵ) 2LPϵ.

This proves the statement.

Published as a conference paper at ICLR 2025

E.6 PROOF OF LEMMA 8

Lemma 8. For the choice µ = ϵ P d log 1+B2T

δ 1/2 + A 1 and Tint = 8P 2 log(1/δ)/ϵ2, it

holds that P ESBIP 1 2δ 4δ(2K + 1) 4δ(2K + 1)T.

Proof. The proof relies on Lemma 11 and 12. On top of these results, we have to control the error | b F k t F s(kϵ)| uniformly for k K. To do so, we rely on Lemma 13. For t T, we define ιt = I t / T int T par T +1 and kt = k , yt = I {st pt}, and we note that for Ft = σ ((x1, . . . , xt+1, (I{s1 p1}, I{b1 p1}), . . . , (I{st pt}, I{bt pt})), ιt is Ft 1-measurable, and yt is Ft adapted. Moreover,

ιt E [yt | Ft 1] = ιt P x t θs + ξs t x t bθs t + kϵ

= ιt F s x t bθs t θs + kϵ .

Using Lemma 13, we find that with probability 1 2δt, either N s,k t = 0 or b F k t

s t ιt F s x t bθs t θs + kϵ

(where we recall that we adopt the convention 1/0 = ). Moreover, on the event E1, for all t / T par, |x t (bθs t θs)| ϵ. Since F s is L-Lipschitz, this implies

|F s x t bθs t θs + kϵ F s (kϵ) | Lϵ.

Thus, with probability 1 2δt, b F k t F s(kϵ)

N s,k t + Lϵ.

Using a union bound over k K and t / T int T par T +1 yields the desired result.

Next, we can bound | b Dk t Db(kϵ)| using similar arguments. In particular, we define ιt = I t / T int T par T +1 and k t = k , yt = I {bt pt} and note that

ιt E [yt | Ft 1] = ιt P x t θb + ξb t x t bθs t + ktϵ Ft 1

x t bθb t θb + kϵ + ϵ

x t bθs t x t bθb t + ktϵ ϵ k

By definition of At, we have that if ιt = 1, then x t bθs t x t bθb t +ktϵ ϵ k 1. Using the Lipschitz

continuity of Db, this implies that, conditioned on the event E1,

ιt E [yt | Ft 1] Db(kϵ) 2Lϵ.

The rest of the proof follows similarly.

E.7 PROOF OF LEMMA 9

Lemma 9. Under the assumptions of Lemma 8 and conditioned on the event ESBIP, we have that for all t / T par T +1 T int, and all (k, k ) At;

LCBt(k, k ) EGFT(xt, x t bθs t + kϵ) UCBt(k, k ).

Moreover, it holds that (k t , k t ) Kt, where we recall that

(k t , k t ) arg max (k,k ) At EGFT(xt, x t bθs t + kϵ).

Published as a conference paper at ICLR 2025

Proof. Let us prove the first part of Lemma 9. By Lemma 14, and by definition of the event ESBIP, we have for all (k, k ) At, and all t / T par T +1 T int, EGFT(xt, p) b Ik b F k + b Jk b Dk

6PLϵ + 3ϵ + 2P

N b,k t + 3Lϵ

(12PL + 7)ϵ + 2P

This concludes the first part of Lemma 9. The second claim follows immediately by noticing that by definition, (k t , k t ) At, and that, for all (k, k ) At,

LCBt(k, k ) EGFT(xt, x t bθs t + kϵ) EGFT(xt, x t bθs t + k t ϵ) UCBt(k t , k t ).

This concludes the proof.

E.8 PROOF OF LEMMA 10

Lemma 10. Conditioned on the event ESBIP, if t T SE s,k, then the following condition holds

r(xt, pt) (50PL + 28)ϵ + 16P

Similarly, if t T SE b,k , then the following condition holds

r(xt, pt) (50PL + 28)ϵ + 16P

Proof. Assume that t T SE s,k. Then, our choice of kt together with Lemma 9 ensures that, conditioned on the event ESBIP, UCBt(kt, k t) LCBt(k t , k t ). This implies that

LCBt(kt, k t) + (UCBt(kt, k t) LCBt(kt, k t))

UCBt(k t , k t ) + (LCBt(k t , k t ) UCBt(k t , k t )) .

Using again Lemma 9, this implies that

EGFT(xt, x t bθs + ktϵ) + (UCBt(kt, k t) LCBt(kt, k t)) + (UCBt(k t , k t ) LCBt(k t , k t ))

EGFT(xt, x t bθs + k t ϵ).

EGFT(xt, x t bθs + k t ϵ) EGFT(xt, x t bθs + ktϵ)

(UCBt(kt, k t) LCBt(kt, k t)) + (UCBt(k t , k t ) LCBt(k t , k t )) .

Then, conditioned on the event ESBIP,

UCBt(kt, k t) LCBt(kt, k t) 2(12PL + 7)ϵ + 4P

Since t T SE s,k, we know that N b,k t t N s,kt t . Then, we have that

UCBt(kt, k t) LCBt(kt, k t) 2(12PL + 7)ϵ + 8P

Published as a conference paper at ICLR 2025

Similarly, since t T SE s,k, we have that N b,k t t N s,kt t , and N s,k t t N s,kt t , so we also have

UCBt(k t , k t ) LCBt(k t , k t ) 2(12PL + 7)ϵ + 8P

EGFT(xt, x t bθs + k t ϵ) EGFT(xt, x t bθs + ktϵ) 4(12PL + 7)ϵ + 16P

By Lemma 7, this implies

r(xt, x t bθs + ktϵ) (50PL + 28)ϵ + 16P

The proof of the second claim follows from similar arguments.

E.9 PROOF OF LEMMA 11

Lemma 11. Let E1 be the event :

E1 := n t / T par T +1, x t (bθb t θb) ϵ and x t (bθs t θs) ϵ o .

Then, for the choice µ = ϵ P d log 1+B2T

δ 1/2 + A 1 , we have P (E1) 1 2δ.

Proof. Let us prove the bound |x t (bθs t θs)| ϵ with high probability; the bound on |x t (bθb t θb)| will follow from similar arguments.

We introduce the variables

xt := xt I t T par t+1 and yt := 2P I t T par t+1 I {pt st} 1

and the σ-algebra Ft = σ ((x1, . . . , xt+1, (I{s1 p1}, I{b1 p1}), . . . , (I{st pt}, I{bt pt})) . With these notation, we have that xt and {t T par t+1} are Ft-measurable. Moreover,

E [ yt|Ft 1] = I t T par t+1

P P [u st|Ft 1] du

= I t T par t+1

C I u x t θs + ξ f s(ξ) dξ du P

= I t T par t+1

P duf s(ξ) dξ P

= I t T par t+1

x t θs + Z C

C ξf s(ξ) dξ

= I t T par t+1 x t θs

where in the last equality we used that R C C ξf s(ξ) dξ = E [ξs t ] = 0. Thus, conditionally on Ft 1, yt x t θs is centered and it belongs to [ P, P], which implies that it is P-subgaussian. Now, for all t {1, . . . , T}, we have

l T par T +1

s T par T +1

I {pt st} 1

l<t xl x l + Id

Published as a conference paper at ICLR 2025

Using the fact that, for all t 1, xt B and θs A, and applying Theorem 2 by Abbasi Yadkori et al. (2011), we find that for all t 0, with probability 1 δ,

s<t xl x l +Id P

d log 1 + B2T

Note that our definitions of xt and yt ensure that

s T par T +1 xlx l +Id = bθs t θs P

s<t xl x l +Id.

Moreover, for all t,

|x t (bθs t θs)| x t P

s T par T +1 xlx l +Id

1 bθs t θs P

s T par T +1 xlx l +Id

In particular, if t / T par T +1, x t P

s T par T +1 xlx l +Id

1 µ, so with probability 1 δ,

|x t (bθs t θs)| µ

d log 1 + B2T

For the choice µ = ϵ P q

d log 1+B2T

δ + A 1 , this implies that

|x t (bθs t θs)| ϵ,

which concludes the proof.

E.10 PROOF OF LEMMA 12

Lemma 12. Let E2 be the event:

E2 := n |T int| < Tint or k K b Ik I (kϵ) 2ϵ and b Jk J(kϵ) 2ϵ o E1.

Then, for the choice µ = ϵ P d log 1+B2T

δ 1/2 + A 1 and Tint = 8P 2 log(1/δ)/ϵ2, we have

P (E2) 1 2δ 4δ(2K + 1).

Proof. We control the error b Ik I (kϵ) uniformly for k K; the result for b Jk J(kϵ) can be proved analogously.

For k K, and t T, let us define ιt = I t T int , yt = 2PI n kϵ + x t bθb t pl bt o and

note that for Ft = σ ((x1, . . . , xt+1, (I{s1 p1}, I{b1 p1}), . . . , (I{st pt}, I{bt pt})), ιt is Ft 1-measurable, and yt is Ft-adapted. Moreover,

E [yt|Ft 1] = 2P P h kϵ + x t bθb t pt bt | Ft 1 i

kϵ+x t bθb t

C I x t θb + ξ u f b(ξ) dξ du

kϵ+x t bθb t Db u x t θb du.

Using the change in variables u = u x t θb, this implies that

E [yt|Ft 1] = Z P x t θb

kϵ+x t bθb t x t θb Db (u ) du .

Published as a conference paper at ICLR 2025

Moreover, under Assumptions 1 and 2, P x t θb C and, for u [C, P x t θb], Db(u ) = 0. Thus,

E [yt|Ft 1] = Z C

kϵ+x t bθb t x t θb Db (u ) du

= I(kϵ + x t bθb t x t θb).

Finally, note that yt E [yt|Ft 1] is in [ 2P, 2P]. Then, using Lemma 13, we find that

|T int| = Tint and

P t T int I kϵ + x t bθb t θb

To conclude our proof, note that on the event E1, for all t T int, |x t (bθb t θb)| ϵ, so with probability 1 2δ, either |T int| < Tint or

I kϵ + x t bθb t θb I(kϵ) =

Z kϵ+x t (bθb t θb)

1 F b(λ) dλ

Z kϵ+x t (bθb t θb)

xt bθb t θb

Using the same reasoning to control the error in estimating J, taking a union bound for k K, and using Lemma 11, we find that on an event E2 E1 of probability larger than 1 2δ 4δ(2K + 1), either |T int| < Tint or

b Ik I(kϵ) 4P

and b Jk J(kϵ) 4P

simultaneously for all k K. For the choice Tint = 8P 2 log(1/δ)/ϵ2, we obtain the desired result.

E.11 PROOF OF LEMMA 13

Lemma 13. Let (yt)t 1 be a sequence of random variables adapted for a filtration Ft, such that yt E [yt|Ft 1] [m, M]. Assume that for t N , ιt {0, 1} is Ft 1-measurable, and define Nt := P

l t ιl, and bµt := N 1 t P

l t ιl(yl E [yl|Fl 1]) if Nt 1. Then, for any t N and δ (0, 1),

Nt = 0 or |bµt| (M m)

Moreover, for any t > 0 and δ (0, 1),

Nt = t and |bµt| (M m)

Proof. Let us define Zt := P

l t ιl(yl E [yl|Fl 1]), and for all x R let Mt := exp x Zt 1

8x2(M m)2Nt . We begin by showing that Mt is a super-martingale. Indeed,

Published as a conference paper at ICLR 2025

we have that

E h exp{x ιt(yt E [yt|Ft 1])} Ft 1 i = E h ιt exp{x(yt E [yt|Ft 1])} + (1 ιt) Ft 1 i

ιt exp x2(M m)2

exp x2(M m)2ιt

where we use the fact that (yt E [yt | Ft 1]) is bounded in [m, M] together with the conditional version of Hoeffding s Lemma. Noticing that

Mt = Mt 1 exp x ιt(yt E [yt|Ft 1]) x2(M m)2ιt

this proves that Mt is a super-martingale, and so E [Mt] E [M0] = 1.

Now, for all ϵ > 0 and all l N, and all x > 0, by a Markov-Chernoff argument,

P (Zt ϵ and Nt = l) = P I {Nt = l} ex Zt eϵx

e ϵx E I {Nt = l} ex Zt

= e ϵx+ x2(M m)2l

8 E I {Nt = l} ex Zt x2(M m)2l

Using the previous result, we have that

E I {Nt = l} ex Zt x2(M m)2l

8 = E I {Nt = l} ex Zt x2(M m)2Nt

E ex Zt x2(M m)2Nt

= E(Mt) E(M0) = 1.

This yields

P (Zt ϵ and Nt = l) e ϵx+ x2(M m)2l

In particular, for ϵ = (M m) q

2 and x = 4ϵ l(M m)2 ,

2 and Nt = l

This proves the first part of the Lemma. Summing over the values of l from 1 to t, we find that

Nt log(1/δ)

Similar arguments can be used to prove that

Nt log(1/δ)

In order to conclude the proof we observe that Zt = ˆµt Nt, we normalize by Nt, and observe that adding the case Nt = 0 can only increase the probability.

Published as a conference paper at ICLR 2025

E.12 PROOF OF LEMMA 14

Lemma 14. On the event EETC, for all (k, k ) At, and p = x t bθs t + kϵ, we have EGFT(xt, p) b Ik b F k + b Jk b Dk 6PLϵ + 3ϵ + 2P |b F| + |b D| + |b I| + |b J|

where we define b I = I(k ϵ) b Ik , b J = J(kϵ) b Jk, b F = F(kϵ) b F k, b D = D(k ϵ) b Dk .

Proof. By Lemma 1, for any price p = x t bθs t +kϵ and k such that (k, k ) At, and δs = p x t θs, δb = p x t θb, we have that

EGFT(xt, p) = I(δb)F s(δs) + J(δs)Db(δb).

EGFT(xt, p) = I(k ϵ) + I(δb) I(k ϵ) (F s(kϵ) + F s(δs) F s(kϵ)) +

(J(kϵ) + J(δs) J(kϵ)) Db(k ϵ) + Db(δb) Db(k ϵ) .

Moreover, by letting I = I(δb) I(k ϵ), F = F s(δs) F s(kϵ), J = J(δs) J(kϵ) and D = Db(δb) Db(k ϵ), this yields

EGFT(xt, p) =I(k ϵ)F s(kϵ) + J(kϵ)Db(k ϵ) + I(δb) F+

F s(kϵ) I + J(δs) D + Db(k ϵ) J.

Since I and J are bounded by 2P, and F and D are bounded by 1, this implies that EGFT(xt, p) I(k ϵ)F s(kϵ) + J(kϵ)Db(k ϵ) 2P| F| + | I| + 2P| D| + | J|.

Now, let us introduce es = δs kϵ, and eb = δb k ϵ. Then, we have

es = x t bθs t + kϵ kϵ + x t θs

= x t (bθs t θs).

On event EETC we have |es| ϵ. Since F is L-Lipschitz continuous, and F = F s(kϵ + es) F s(kϵ), this implies | F| Lϵ. Similarly, J is 1-Lipschitz continuous, and J = J(kϵ + es) J(kϵ), so | J| ϵ. Similarly,

eb = x t bθs t + kϵ x t θb + k ϵ

= x t (bθs t bθb t) + kϵ k ϵ + x t (bθb t θb).

By definition of At, under the event EETC we have |eb| 2ϵ. Since D is L-Lipschitz continuous and D = Db(k ϵ + eb) Db(k ϵ), | D| 2Lϵ. Similarly, I is 1-Lipschitz continuous, I = I(k ϵ + eb) I(k ϵ), so this implies that | I| 2ϵ. Putting everything together, we find that on the event EETC, EGFT(xt, p) I(k ϵ)F s(kϵ) + J(kϵ)Db(k ϵ) 6PLϵ + 3ϵ.

Similarly, denoting b I = I(k ϵ) b Ik , b D = D(k ϵ) b Dk , b J = J(kϵ) b Jk and b F = F(kϵ) b F k, we have I(k ϵ)F s(kϵ) + J(kϵ)Db(k ϵ) b Ik b F k + b Jk b Dk 2P|b F| + |b I| + 2P|b D| + |b J|.

This concludes the proof.

Published as a conference paper at ICLR 2025

F MOTIVATING EXAMPLE

Lemma 1 emphasizes that the expected gain from trade at a given price p depends on the quantities δs = p x t θs and δb = p x t θb. Remember that the difference in average valuations is given by = x t θb x t θs, and with this notation, δb = δs . Therefore, the expected gain from trade can be rewritten as a function of the pair (δs, ). As an immediate consequence, we see that the optimal increment δs only depends on the difference in average valuations : if p = x θs + δs maximizes EGFT(x, p), and if x is such that x θb x θs = x θb x θs, then p = x θs + δs maximizes EGFT(x , p ).

On the other hand, the following example shows that there no explicit dependence of the optimal price increment δs on the difference in average valuations . In words, when is small, we might prefer to choose an increment δs that leads to trade happening with lower probabilities but corresponds to higher rewards. By contrast, as increases, similar values of the increment δs will correspond to higher gains if the trade happens. Then, we might choose to post prices corresponding to a smaller increment δs to increase the probability that the trade happens.

This reasoning demonstrates that knowing that an increment δs is optimal for a difference in average valuations does not allow us to determine the optimal increment δ corresponding to a different value of the difference in average valuations. This implies that to precisely identify the optimal price increment δs for all differences in average valuations , it may be necessary to have accurate estimates of the functions F s and I for a broad range of values of δs. Similar arguments can be employed to argue that precise estimates of the value of Db and J for a wide range of values of δb = δs might also be necessary.

To illustrate this phenomenon, we construct an example where different levels of lead to entirely different choices of the optimal price increment δs. Specifically, we consider a scenario where, for certain values of (s1, s2, s3) R3, (b1, b2, b3) R3, (α1, α2, α3) in the simplex, and θ > 0 (to be defined later), the density f s (resp. f b) of the seller s noise ξs (resp. buyer s noise ξb) is given by:

f s(δ) = I{x [s1, s1 + θ] [s2, s2 + θ] [s3, s3 + θ]}

f b(δ) = α1I{x [b1, b1 + θ]} + α2I{x [b2, b2 + θ]} + α3I{x [b3, b3 + θ]}

The setting in illustrated in figure 1.

We assume that (s1, s2, s3) and (b1, b2, b3) verify s1+θ < b1, b1+θ < s2, s2+θ < b2, b2+θ < s3, and s3 + θ < b3. Then, it is straightforward to see that while > 0, and is small enough so that b1 + θ + < s2 and b2 + θ + < s3, the optimal increment belongs to the set {δ1, δ2, δ3}, where δ1, δ2, and δ3 belong respectively to [s1 + θ, b1], [s2 + θ, b2], and [s3 + θ, b3].

We also assume that α1 is much larger than α2, which in turn is much larger than α3. Then, the increment δ1 is such that the trade happens with the highest probability: indeed, if δs = δb > b1, the buyer rejects the trade with a high probability. For the same reasons, the probability of a trade happening at an increment δ2 is much lower, and the probability of a trade happening at increment δ3 is the lowest.

Finally, we assume that b3 s3 is much larger than b2 s2, which in turn is much larger than b1 s1. Then, the gain from any trade happening at increment δ1 is small compared to the gain from trades happening at increment δ2, which in turn is small compared to the gain from a trade happening at increment δ3.

For some well-chosen values of the parameters specified below, when the average gain from trade of the seller and the buyer are both zero (x t θs = x t θb = 0), the most profitable increment is δ3: the probability of the trade happening is less likely, but when it occurs, it is more profitable.

Now, to obtain the probability of a price p = x θs+δs being accepted by the buyer, it is sufficient to translate the cumulative distribution function (c.d.f.) of ξb (represented in Figure 1) by the quantity = x (θb θs). In particular, when > 0 is small enough so that b1 + θ + < s2 and b2 + θ + < s3, we observe that, on the one hand, the probability that the trade happens for the increments δs δ1, δ2, δ3 remains unchanged; however, the corresponding gains if the trade occurs all increase with . For some well-chosen values of the parameters, if is positive but sufficiently

Published as a conference paper at ICLR 2025

Figure 1: Illustration for the c.d.f. f s and f b corresponding to the example described in Section F. The c.d.f. f s is represented in red, the c.d.f. f b is represented in blue.

small, the expected gain from trade is maximized by δs = δ2. When is large, the expected gain from trade is maximized by choosing δs = δ1: in other words, this choice of increment ensures that the sale happens with the highest probability, and each sale leads to a reward of at least . In Figure 2, we plot the expected gain from trade for different values of as a function of δs. The values chosen for the parameters are as follows: (s1, s2, s3) = (0, 2, 6), (b1, b2, b3) = (0.01, 3, 20), (α1, α2, α3) = (0.85, 0.11, 0.04), and θ = 0.001. With these values, for = 0, the optimal increment is δ3 = 10; for = 1, it is δ2 = 2.5; and for = 1.5, it is δ3 = 0.01.

Figure 2: Expected gain from trade for different values of increment δs and difference in average valuations .

G PROOFS OF SECTION 5

Lemma 2. For each round t T B such that bt st, it holds: E[PROFITt(pt, qt)] (bt st)2

T , where the expectation is with respect to the choice of (pt, qt).

Proof. To simplify notation, and since we focus on a single round t T B, we omit the explicit dependence on t from pt, qt, it, st, and bt. We consider two cases. First, if b s 2/T, the inequality is immediately satisfied since the lhs is non-negative while the rhs is 0. Second, if b s > 2/T,

Published as a conference paper at ICLR 2025

let E be the event in which p [s, (s + b)/2]. Since p U([0, P]) we have P(E) = (b s)/2P. Moreover, we have

P i = log 1 b p

| A = 1 log T ,

which is well defined since p s + b

Let E denote the event in which i = log(1/(b p)) . Under E and E we get

q = p + 2 i = p + 2 log(1/(b p)) p + 2 (log(1/(b p))+1) = p + b p

Therefore, when b s + 2/T,

E[PROFIT(p, q)] = (q p) b s 2P log T

2 p b s 2P log T

This concludes the proof.

Lemma 3. For τ log T, it holds with probability at least 1 1/T that X

t JτK PROFITt(pt, qt) α 8P log(T)

t JτK [bt st]+ s

4P 2 log(T) X

t JτK [bt st]+ 2.

Proof. First, we observe that given the sequence (bt, st)t JT K, and a time step τ, by Azuma Hoeffding inequality we have that, with probability at least 1 δ, X

t JτK PROFITt(pt, qt) E[PROFITt(p, q)] s

2 log(1/δ) X

t JτK ([bt st]+)2.

Then, by applying a union bound, the above inequality holds simultaneously for all possible τ with probability at least 1 δT. Then, by setting δ = T 2, with probability at least 1 1/T, it holds: X

t JτK PROFITt(pt, qt)

t JτK E[PROFITt(p, q)] s

t JτK ([bt st]+)2

([bt st]+)2

4 log(1/δ) X

t JτK ([bt st]+)2 (by Lemma 2)

([bt st]+)2

t JτK ([bt st]+)2 2

1 τ ([bt st]+)2

t JτK ([bt st]+)2 2

t JτK ([bt st]+)2 2 (by Jensen s Inequality)

α 8P log(T)

t JτK [bt st]+ s

t JτK ([bt st]+)2 2 (by Definition 1)

α 8P log(T)

t JτK [bt st]+ s

4P 2 log(T) X

t JτK [bt st]+ 2.

Published as a conference paper at ICLR 2025

This concludes the proof.

Theorem 3. Given the two-bit algorithm A, the corresponding one-bit learning algorithm satisfies global budget balance and, with probability at least 1 1/T, has regret

R(1 bit) T O α 3T E log4 T + R(2) T .

Proof. The one-bit algorithm is global budget balanced by construction (see choice of B).

Then, we condition the high probability regret bound on the following events:

With probability 1 1/T the two-bit EOC algorithm guarantees a number of exploration rounds smaller than |T E| and regret at most R(2) T ;

With probability 1 1/T, it holds the inequality in Lemma 3

By Azuma-Hoeffding, with probability at least 1 1/T it holds

t=1 max p EGFT(xt, p)

t=1 [bt st]+ + p

16P 2τ log(T).

Then, the regret can be bounded as follows

t=1 max p EGFT(xt, p)

t=1 EGFT(xt, pt)

t=1 max p EGFT(xt, p) +

max p EGFT(xt, p) EGFT(xt, pt)

t=1 max p EGFT(xt, p) + 2T E + R(2) T

t=1 [bt st]+ + p

16P 2τ log(T) + 2T E + R(2) T

t=1 [bt st]+ + ατ + 2T E + R(2) T

t=1 [bt st]+ + 2T E + R(2) T ,

where in the second-to-last inequality we use that, since τ B/2P, it holds that τ 16P 2α 2 log T (by Equation (5)). The last inequality is by definition of α on the interval JτK.

Then, by Lemma 3 and since Pτ t=1[bt st]+ Pτ t=1 PROFITt(pt, qt) 2048P 4α 2 log3 T, we have

t JτK PROFITt(pt, qt) α 8P log(T)

t JτK [bt st]+ s

4P 2 log(T) X

t JτK [bt st]+ 2

α 8P log(T)

t JτK [bt st]+ s

8P 2 log(T) X

t JτK [bt st]+

α 16P log(T)

t JτK [bt st]+.

Published as a conference paper at ICLR 2025

Then, since B 2048P 4α 2T E log3 T and Pτ t=1 PROFITt(pt, qt) B +2P, by substituting in the expression above we obtain the desired bound on P

t JτK[bt st]+. In particular, we have

t=1 [bt st]+ + 2|T E| + R(2) T

t JτK PROFITt(pt, qt) + 2T E + R(2) T

M P 5 log4 T

α3 T E + 2T E + R(2) T ,

with M is a numeric constant independent on the problem instance.

H ALTERNATIVE ALGORITHM FOR NOISY VALUATIONS WITH TWO-BIT FEEDBACK

The regret of Algorithm 3 is primarily driven by the estimation of the c.d.f. F s and demand function Db uniformly over a grid of increments. We now present an alternative sub-optimal approach. Specifically, to leverage the fact that the optimal increment δs only depends on the difference in average valuations t, we could first execute the sub-routines EST-PAR and EST-INT, yielding estimates of x t θs, x t θb, I and J up to precision ϵ using e O(ϵ 2) samples. Then, we could round the value of t on a grid of size ϵ 1, and run independent Scouting Bandit algorithms (as described in Cesa-Bianchi et al. (2024)) for each of the ϵ 1 rounded values.

The grid size implies a discretization error of order O(ϵ). The highest regret occurs when each of the ϵ 1 independent Scouting Bandit algorithms runs for the same number of rounds Tϵ = Tϵ. Each algorithm incurs a regret e O(T 2/3 ϵ ), so their combined regret is e O(T 2/3 ϵ ϵ 1) = e O(T 2/3ϵ 1/3). For ϵ = T 1/4, this strategy also results in a regret of order e O(T 3/4).