# multislots_online_matching_with_high_entropy__81d9684b.pdf

Multi-slots Online Matching with High Entropy

Xingyu Lu 1 Qintong Wu 1 Wenliang Zhong 1

Online matching with diversity and fairness pursuit, a common building block in the recommendation and advertising, can be modeled as constrained convex programming with high entropy. While most existing approaches are based on the single slot assumption (i.e., assigning one item per iteration), they cannot be directly applied to cases with multiple slots, e.g., stock-aware top N recommendation and advertising at multiple places. Particularly, the gradient computation and resource allocation are both challenging under this setting due to the absence of a closed-form solution. To overcome these obstacles, we develop a novel algorithm named Online sub Gradient descent for Multi-slots Allocation (OG-MA). It uses an efficient pooling algorithm to compute closedform of the gradient then performs a roulette swapping for allocation, yielding a sub-linear regret with linear cost per iteration. Extensive experiments on synthetic and industrial data sets demonstrate that OG-MA is a fast and promising method for multi-slots online matching.

1. Introduction

Online matching plays a crucial role in many real-world applications, such as online advertising with guaranteed contracts (Mehta et al., 2007; Feldman et al., 2009) and goods recommendation under inventory constraints (Talluri & Van Ryzin, 1998; Zhong et al., 2015). Many works have been done to investigate the theoretical properties of this problem, as well as the industrial deployment. Most of them are built on a bipartite graph G(A, T, E) with the single-slot assumption. Here A denotes nodes resource to assign, while T refers to the node of users arriving one by one. And E is the edge set representing valid assignments. For each node

1Ant Group, Hangzhou, China. Correspondence to: Xingyu Lu <sing.lxy@antgroup.com>, Qintong Wu <qintong.wqt@antgroup.com>, Wenliang Zhong <yice.zwl@antgroup.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Figure 1: Left figure illustrates the typical online matching application that presents an advertisement in a single slot, and right figure shows a practical search engine results page where several advertisements are blended in multiple slots.

t T, a decision-maker takes an action to match one node from t s neighbors in A to maximize the sum weights of the matched subset edges under the resource constraints on A.

Pioneered works modeled online matching as linear programming (LP) with a primal-dual framework that maintains dual variables induced by resource constraints (Devanur & Hayes, 2009; Agrawal et al., 2014; Li et al., 2020). When a user node arrives, the primal solution is recovered (i.e. picking one item) , and then a gradient descent step is applied to the dual. However, the LP approaches do not take the diversity and fairness into consideration, and thus not suitable for specific tasks. To handle the above issue, the high entropy regularization, encouraging desired properties in advertising and recommendation (Qin & Zhu, 2013; Ahmed et al., 2017; Di Noia et al., 2017), is introduced to online matching (Zhong et al., 2015; Agrawal et al., 2018).

While the above works focus on the single-slots setting, there are a large number of applications involving multislots , such as the product feed recommendation and search result pages with ads (see Figure 1). However, it is nontrivial to extend existing methods to the multi-slots setting with diversity pursuit (Zhang, 2009; Yan et al., 2020). They either rely on a closed-form of gradient computation (Zhong et al., 2015; Agrawal et al., 2018) or assume the primal recovery is computationally lightweight (Balseiro et al., 2020; 2021) , which fail to hold in the presence of N > 1 slots. To the best of our knowledge, a high entropy online matching in the multi-slots setting remains unexplored and

Submission and Formatting Instructions for ICML 2022

would benefit a broad class of applications.

In this paper, we adopt a random permutation input model. That is, the arriving nodes T can be picked adversarially at the start, and the arrival order is allowed to be uniformly distributed over all permutations. Consider the problem on a bipartite graph GN(A, T, E) with N-slots of capacity c. In detail, ci is associated to some properties of the i-th slot, e.g. position bias (Chen et al., 2020) , and has a big impact on the cumulative reward and resources constraints. In each iteration, the decision-maker only observes the revenue and resource consumption in the hindsight for each node arrival t T, and then takes an action Xt [0, 1]A N to allocate N distinct nodes in A to N slots. Our goal is to propose a simple and efficient algorithm that generates maximum matching with high entropy and attains sub-linear regret.

1.1. Our Contribution

We develop a novel algorithm, Online sub Gradient descent for Multi-slots Allocation (OG-MA), to maximize a concave objective consisting of cumulative revenues and high entropy regularizer. In detail, the proposed algorithm first employs the primal-dual framework to update dual variables with online gradient descent(OGD) (Shalev-Shwartz et al., 2011). Next, the proposed algorithm decomposes the recovery of primal solution into a two-step approach. In the first step, we develop a fast Efficiency Pooling Projection (EPP) algorithm to compute gradients of dual variables. In the second step, we propose a Roulette Swapping Allocation (RSA) algorithm to generate the allocation Xt [0, 1]A N.

The proposed OG-MA is simple and efficient for large-scale applications. For each arriving request, our proposed twosteps approach presents an optimal allocation with linear complexity w.r.t. the number of slots N. Meanwhile, the update rule of dual variables is computationally lightweight, which only requires initial dual variables and step size as inputs. We show the effectiveness of the two-steps approach in Proposition 3.1 and prove the optimality of the EPP method in Theorem 5.1.

Under the assumption of the random permutation model, we derive the upper bound for the regret analysis of the proposed algorithm. In detail, our algorithm achieves O((

T) regret, where T is the number of arriving requests and K is the number of resources constraints. In Theorem 5.3, we show that our regret analysis is valid for general concave revenue functions. When the input model degenerates to a stochastic input model with request i.i.d. drawn, the proposed algorithm still attains O(

KT) regret, which is at the optimal order of existing results.

Extensive experiments on synthetic and industrial data sets validate our computation complexity analysis and sub-linear regret results. Further, we show that OG-MA algorithm

achieves positive diversity uplift by the trade-off between the entropy regularizer and the matching revenue.

2. Related Work

The online matching problem has been studied extensively in both theoretical analyses as well as practical implementations. The comparison between our work and the existing literature is summarized in Table 1.

Previous works mainly consider linear revenue functions in the online matching problem. Two prevalent models are adopted: the stochastic input model and the random permutation model. The stochastic input model, a special case of the random permutation model, assumes an online request to be drawn i.i.d. from an unknown distribution. (Devanur et al., 2011) presented an online algorithm with the knowledge of the value of benchmark to attain O(

T) regret, and (Arlotto & Gurvich, 2019) further confirmed that O(

T) is the lowest attainable regret under such setting. (Li & Ye, 2019) improved the regret order to O(log T) by periodically solving a linear program with the strongly convex assumption. Compared with the stochastic input model, the random permutation model in our setting is more general to match practical applications. (Devanur & Hayes, 2009; Feldman et al., 2010) presented a similar two-phase dual training algorithm which attains O(T 2/3) regret. (Agrawal et al., 2014; Kesselheim et al., 2014) developed a dual-based and a primal-beats-dual-based algorithm, respectively. Although their methods achieved O(

T) regret, an additional linear program is required to solve periodically. Our proposed algorithm only requires a single pass through the input data with straightforward update rules and still enjoys linear complexity. (Li et al., 2020) studied a simple and fast algorithm that attains O(log T

T)) regret order. Compared to (Li et al., 2020), our results further improve the regret dependency of the number of resources K from O(K) to O(

K), and work for general concave revenue functions.

Research interests of high entropy matching mainly focus on the single-slot setting. (Zhong et al., 2015) developed a fractional allocation using the dual-descent algorithm in stock constrained recommendation. (Agrawal et al., 2018) provided an offline multi-rounds proportional matching algorithm to maintain the diversity. The sub-problem of recovering primal solution has corresponding closed-formed solution in the aforementioned literature, but the desired properties cannot be extended to the multi-slots setting directly. This paper extends this line of research to the multi-slots setting and presents a fast and efficient online algorithm.

The high entropy matching also extends to a line of works on online allocation problems with concave revenues. Previous works studied general concave maximum objectives with convex feasibility constraints. (Agrawal & Devanur,

Submission and Formatting Instructions for ICML 2022

Table 1: Comparsion of our work with the existing literature on online matching problems

Papers Objective Input model1 Violation risk Linear complexity Regret

(Devanur & Hayes, 2009; Feldman et al., 2010) Linear R.P. No No O(T 2/3)

(Agrawal et al., 2014; Kesselheim et al., 2014) O(

(Li et al., 2020) Linear R.P. Yes Yes O(log T

(Devanur et al., 2011) Linear I.I.D. No No O(

(Li & Ye, 2019) O(log T)

(Agrawal & Devanur, 2014) Concave R.P. Yes No O(

(Balseiro et al., 2020; 2021) Concave I.I.D. No Yes O(

Our paper Concave R.P. No Yes O(log T

1 R.P. denotes the random permutation model and I.I.D. denotes the stochastic input model with samples i.i.d. drawn.

2014) adopted the random permutation model and developed a fast dual-decent online algorithm when taking the knowledge of the value benchmark. Although their results attain better regret order than ours, their algorithm requires a complex step to periodically solve a convex optimization program if the benchmark is unknown. In addition, their work allows the violation of resource constraints, which is hardly acceptable in real-world applications. Instead of the random permutation model, (Balseiro et al., 2020) solved similar problems using a dual mirror descent algorithm with the assumption of i.i.d. input model. We extend the regret analysis to the random permutation model. When the input model degenerates to the stochastic i.i.d. model, our analysis admits the same regret O(

T) but is more concise from a sub-gradient descent perspective. (Balseiro et al., 2021) studied a class of regularized online allocation problems that explicitly considers fairness among the resource consumption. However, their algorithms cannot extend to handle the diversity within each request in the multi-slots setting. Last but not least, the aforementioned researches all assumed that the recovery of the primal solution is computationally lightweight under the primal-dual framework, which is often contrary in real-world applications.

3. Problem Formulation

For the purpose of description, we adopt the terminology from online advertising, where advertisements and users represent the two sides of a bipartite graph GN(A, T, E).

First, we define T as a set of arriving users and A as the advertisements. The candidate advertisements for any t T can be represented by a subset Et of edge set E = {(t, a) : t T, a Et} with size At. Next, we define the decision matrix Xt [0, 1]A N that allocates N advertisements to

N slots for t-th user, where each entry xt,a,n is indexed by an advertisement a and a slot n. Then, the following conditions must be met during a user s visit:

Each slot must be assigned with an advertisement.

An advertisement can be allocated to at most one slot.

The above conditions can be transformed to following constraints:

a Et xt,a,n = 1, t T, 1 n N P

1 n N xt,a,n 1, t T, a Et

To characterize the capacity of impressions among different slots, we introduce vector c = (c1, c2, . . . , c N) (0, 1]N

to incorporate the N-slots information, where cn denotes the impressions capacity for the slot n. To simplify the description, the slots capacity vector c is assumed to be the same for all arriving users and satisfies the following non-increasing condition without losing generality:

1 c1 c2 c N > 0

We construct the following multi-slots matching problem with slots impressions capacity, resource constraints, and a high entropy regularizer:

t=1 r t Xtc + αH(Xtc)

t=1 M t Xtc TB

where rt RA + is the revenue vector of user t s candidate advertisements with the elements setting to zero if a / Et, Mt RA K ++ is the entry-wise positive cost matrix on K

Submission and Formatting Instructions for ICML 2022

resources, and B RK ++ is the resources constraint vector. In order to guarantee a feasible solution given the number of slots N for each t T, we assume that there exists at least N advertisements with revenue entries rt,a = 0 and resource consumption entries mt,a,k = 0. In addition, the revenue rt and resource consumption Mt of the matching problem in (2) are all scaled by the advertisements allocated impressions Xtc for each t T. The diversity of the matching is introduced by an entropy regularizer H RA R defined as:

a A xa log(xa) (3)

In equation (2), α denotes a parameter to control the tradeoff between revenue and diversity.

In the online setting, we adopt the random permutation model (further discussed in Assumption 5.2). In each time t, the algorithm receives a request (rt, Mt) and estimates a fractional matching ˆXt. The cumulative revenue is defined under the random permutations σ over 1, 2, . . . , T:

R( ˆX1, ˆX2, . . . , ˆXT |σ) = Eσ[

t=1 r t ˆXtc + αH( ˆXtc)]

The remaining resources are considered in the way that Pt 1 s=1 M s e Xsc + M t ˆXtc TB must be satisfied when generating a fractional matching ˆXt. Here, f Xt {0, 1}A N is a realization of ˆ Xt by the decision-maker A. Denote θ as the random variable that determines the realization of decision-maker A, the expect revenue of A can be represented by:

R(A|σ) = Eθ[R( ˆX1, ˆX2, . . . , ˆXT |σ)]

When all parameters of T users (i.e. {(rt, Mt)}T t=1) are known in advance, we can solve the optimal matching and obtain the optimal revenue:

R = max X X

( PT t=1 r t Xtc + αH(Xtc), s.t. PT t=1 M t Xtc TB

The regret of the decision-maker is defined as the following:

Regret(A|σ) = R R(A|σ) (5)

Our goal is to design an algorithm that efficiently solves the multi-slots matching problem (2) and constructs a fast real-time allocation. Meanwhile, low regret must be attained without violating the intended constraints.

3.1. The Dual Problem

Our main algorithm (Algorithm 1 in Section 4) is derived upon the dual-descent algorithm. The primary challenge lies in the complexity of solving a sub-problem in dual formation of (2). Given dual variables, the recovery of Xt requires O(N 3A3 t) complexity in the worst case where O(NAt) is

the number of variables in Xt. We address the challenge by introducing an intermediate variable yt := Xtc to decompose the sub-problem into a two-step optimization. In step one, we develop a formation of yt in which the number of variables is reduced to O(At). In step two, an real-time allocation Xt is recovered given yt.

Step-One (Optimization of yt). The yt characterizes advertisements expected impressions in all slots given Xt. Then, we can transform the problem (2) as following:

t=1 r t yt + αH(yt)

t=1 M t yt TB

where the domain constraints Y is defined as:

a Et(n) yt,a Pn s=1 cs, t T, n < N P

a Et yt,a = PN n=1 cn, t T

and Et(n) represents a subset of advertisements Et with size n. The constraints Y implies that cumulative impressions of the top-n advertisements cannot exceed the top-n slots impressions capacity within a multi-slots allocation.

Step-Two (Recovery of Xt). A fractional matching action Xt is recovered with an estimated ˆyt. The recovery procedure is characterized by a linear system: ˆXt = arg Xt X {Xtc = ˆyt} (7)

Proposition 3.1. The optimal X t of problem (2) is equivalent to the ˆXt solved by the two-step problem (6) and (7).

Next, we introduce dual variables λ RK + of the resources constraints to obtain the Lagrangian dual of (6) in a maxmin form. Since the primal objective is concave, we can swap the min and max under the strong duality as following:

min λ 0 max y Y L(λ, y) =

t=1 [r t yt + αH(yt)]

t=1 M t yt)

3.2. Properties of the Optimal Primal Solution of (8)

We present the analysis of (8) to elaborate an efficient algorithm. Given λ, we define vt,a as the contribution value to the maximum objective of advertisement a for user t, and let et,a be the efficiency value for primal solution yt,a by:

vt,a := exp(rt,a P

k λkmt,a,k α ); et,a := vt,a

Submission and Formatting Instructions for ICML 2022

Next, we arrange the advertisements index Et with size At in the order:

vt,1 vt,2 vt,At, t T. (9)

The following proposition shows that the optimal y t,a and its efficiency value e t,a are in the same order as vt,a.

Proposition 3.2. For the optimal solution y t (λ) of (8) given λ, it holds for all t that:

y t,1 y t,2 y t,At; e t,1 e t,2 e t,At.

Then, we can divide the Et into blocks that share the same efficiency value. The following gives a formal definition.

Definition 3.3. For any feasible primal solution yt of (8), there exists a unique disjoint block set {Br}l r=1 that can divide Et (sorted by (9)) into l blocks with:

1. et,i = et,j, i, j Br, 1 r l;

2. et,i = et,j, i Br, j Br+1, 1 r l 1.

The efficiency value of block Br is defined as :

a Br vt,a P

a Br yt,a . (10)

Further, each block Br can be formed by its elements:

B(p,q] := {a|p < a q, a Et}

Then, we present two propositions to show additional properties of optimal solution y t (λ) which help to design the algorithm. Proofs are shown in Appendix C and D.

Proposition 3.4. (Equal efficiency within a block). {Br} is the block set of y t (λ), it holds for any block Br := B(p,q] {Br} :

E(B (p,r]) = E(B (p,q]), p < r < q.

Proposition 3.5. (Decreasing efficiency between blocks). {Br} is the block set of y t (λ), if yt,r, yt,r+1 belong to different blocks, the following inequality holds:

E(B (p,r]) > E(B (r,q]), p < r < q.

Remark 3.6. When slot N = 1 and capacity c = 1, y t,a(λ) = vt,a/ P

a vt,a is a closed-formed solution of the dual problem (8), and it satisfies Proposition 3.2, 3.4, and 3.5.

Remark 3.7. We introduce the advantage of pool adjacent violators algorithm (De Leeuw et al., 2010), a widely used method in the isotonic regression with a quadratic minimum objective, into our algorithm design. Although the ideas are similar, the objective in our setting is essentially different. In the next section, we propose a fast algorithm through the analysis in Proposition 3.2, 3.4, and 3.5.

Algorithm 1 Online sub Gradient descent for Multi-slots Allocation (OG-MA)

Input: User set T, the step-size η; and initialize dual variables λ0 = 0. for t = 1 to T do

Receive a stochastic request with (rt, Mt). Solve the expected impressions ˆyt for all advertisements using efficiency pooling projection; Update the allocated impressions under the remaining resources:

( ˆyt, if Pt 1 s=1 M s e Xsc + M t ˆyt TB, 0, otherwise. (11)

Make the realization e Xt of primal solution Xt with given eyt by roulette swapping allocation. Compute gradients g(λt) of λt where:

g(λt) := B M t ˆyt.

Update λ by projected subgradient descent:

λt+1 = Projλ 0{λt ηg(λt)} (12)

4. Multi-Slots Online Algorithm

The proposed OG-MA algorithm is summarized in Algorithm 1 consisting of three steps. First, the Efficiency Pooling Projection (EPP) method is proposed to solve an estimated ˆyt(λ) of the dual problem (8). The allocated impressions eyt of advertisements is updated by (11) given quantities of remaining resources. Next, a Roulette Swapping Allocation (RSA) algorithm is adopted to solve step-two problem (7) and make a realization e Xt. At last, since the dual (8) is convex with respect to λ, OGD is employed to update the dual variables λ with project operation.

Efficiency Pooling Projection (Given λ and obtain yt). We assume that the candidate advertisements Et have been sorted by vt,1 vt,2 vt,At. In the initialization phase l = 0, we allocate the i-th advertisements to the i-th slot, so that we can decompose all advertisements Et into N + 1 blocks {B(0) r }N+1 r=1 as:

( {r}, 1 r N; {a|N < a At}, r = N + 1. (13)

The efficiency of each block is initialized as:

a Br vt,a P

a Br ca I(1 a N). (14)

Algorithm 2 shows how to generate an optimal block set {Br}. In each iteration l, if there exists two adjacent blocks

Submission and Formatting Instructions for ICML 2022

Algorithm 2 Efficiency Pooling Projection (EPP)

Input: User request (rt, Mt), dual variable λ. Sort Et in decreasing order by vt,a. Initialize Et into blocks {B(0) r }N+1 r=1 by (13), compute efficiency value E(B0 r) by (14) and set l = 0. repeat

Step1. Merge B(l)-blocks if E(B(l) r ) E(B(l) r+1). Step2. Update the merged blocks B(l+1) r := B(l) r and efficiency value E(B(l+1) r ) for all r, i.e.. Step3. If exists E(B(l+1) r ) E(B(l+1) r+1 ), then increase l = l + 1 and go back to Step1. until E(B(l) r ) > E(B(l) r+1) for all block r. Output: ˆyt,a = vt,a/E(Br), a Br and block index r.

with increasing efficiency E(B(l) r ) E(B(l) r+1), a pooling action is taken to merge blocks B(l) r = B(l) r B(l) r+1 and to enforce all elements within the new block share the same efficiency by (14). The algorithm stops when all the blocks efficiency decreases, e.g., E(B(l) r ) > E(B(l) r+1). At this point, properties in Proposition 3.4 and Proposition 3.5 are properly ensured. Finally, the assigned impressions ˆyt are obtained with minimal cost by ˆyt,a = vt,a/E(Br), a Br in the Definition 3.3.

Roulette Swapping Allocation(Given yt and recover Xt) To achieve the expected impressions eyt, Algorithm 3 is proposed to allocate advertisements to slots by swapping their positions. Let r(a) be the ranking order and yt,a be the actual allocated impressions of advertisement a. We allocate the i-th advertisement sorted by vt,a to the i-th slot in the initialize step. Then, we have r(a) = a, 1 a At and yt,a = ca when a N, otherwise, yt,a = 0. In each round, we compare the advertisement j s impressions yt,j with the expected value eyt,j, and then we perform the swapping operation on s S with probability ps computed by (15). This step in Algorithm 3 utilizes the excess impressions of advertisements in index set S to make up for under-allocated advertisements .

The swapping operation in Algorithm 3 is always feasible. Given yt,s > eyt,s eyt,j, the swapping probability ps and the cumulative value P

s S ps both lie in [0, 1].

Projected sub Gradient Descent (Online update λ). As shown in Algorithm 1, once an allocation e Xt is recovered, the gradient g(λt) of λt is readily obtained by g(λt) := B M t ˆyt. Then, the descent step of dual variables λt+1 is given by (12) with step-size η.

4.1. Time Complexity

The proposed OG-MA enjoys linear complexity w.r.t. the number of slots N. For each request t, the online allocation

Algorithm 3 Roulette Swapping Allocation (RSA)

Input: Expected impressions eyt computed by (11). Initialize position order r(a) = a, the expectation of allocated impressions yt,a = ca I(1 a N), a Et and index set S = {}. for j = 1 to At do

if yt,j > eyt,j then

Put j into the index set: S = S {j} else

Swap r(j) and r(s), s S with probability:

ps = eyt,j yt,j

eyt,s yt,s P

s S(eyt,s yt,s ). (15)

Update the allocated impressions of j by yt,j = eyt,j Update the allocated impressions of s S by:

yt,s = (1 ps)yt,s + psyt,j,

and then remove s from S if yt,s = eyt,s. end if end for Output: Allocate a Et to r(a)-th slot if r(a) N.

starts by sorting Et, and then applies EPP and RSA. The complexity of sorting Et is O(At log At). In EPP method, N + 1 blocks are initialized at first and then at least two adjacent blocks are merged in each iteration. The merge operation obtains O(1) complexity and the number of iterations is at most N, so that the EPP achieves O(N) linear time complexity. In RSA algorithm, the update rule yt,j = eyt,j in each round j ensures that the swap operation is taken at most At times. In each swapping operation, the time cost comes from computing the probabilities ps S (S contains at most N elements). Thus, the time complexity of the RSA is O(NAt). Combining the above results, the total complexity of OG-MA is O(N + NAt + At log At) for each t, outperforming the O(N 3A3 t) complexity of directly solving the optimal Xt by a general solver.

5. Regret Analysis

In this section, we explain the optimality of the solution ˆyt solved by EPP method, and present the regret bound of OG-MA under the random permutation model.

5.1. The Optimality of EPP

Theorem 5.1. The block set {Br} and the primal solution ˆyt solved by EPP method is optimal with given λ. Suppose Lt is the block size of {Br}, the dual problem (8) can be formulated as the convex optimization of dual variable λ:

Submission and Formatting Instructions for ICML 2022

min λ 0 L(λ) = Tλ B+

r=1 αI(Br)(ln V (Br) ln I(Br))

where I(Br) = P

a Br ˆyt,a and V (Br) = P

Proof sketch. The proof of Theorem 5.1 consists of two parts. First, we apply the results in Proposition 3.2 and further present the existence of an optimal {Br} to attain the results in Theorem (5.1). Then, we use the previous two propositions to show that the block set generated by Algorithm 2 is optimal in that it can neither be divided (conflicted with Proposition 3.5) nor merged (conflicted with Proposition 3.4) further. Complete proof is shown in Appendix E.

5.2. Regret Bound

To derive regret bound, we adopt the random permutation model. In this model, arriving users t T with {(rt, Mt)}T t=1 can be picked adversarially at the start, and the arrival order of each (rt, Mt) is uniformly distributed over all the permutations. The random permutation model is more general than the stochastic input model in which users arrive i.i.d. from an unknown distribution.

We state the following assumptions to formalize the random permutation model in the multi-slots setting:

Assumption 5.2. (Assumption on random permutation.) We assume that:

(i). The request (rt, Mt) arrives in a random order σ.

(ii). For each request (rt, Mt), there exits r R++ and m R++ such that ||rt|| r and ||M t Xtc|| m.

(iii). There exists b, b such that b Bk b, 1 k K.

Assumption (i) states the permuted observation of the input requests. Assumptions (ii) and (iii) provide a bound of the input request and resources constraints respectively. In the next theorem, r, m, b, and b will appear in the analysis of regret bound.

Theorem 5.3. Consider the OG-MA with step-size η > 0 and initial dual solution λ0 = 0. Suppose Assumption 5.2 is satisfied, the regret can be upper bounded by:

Regret(A|σ) K(m2 + b 2) 2 ηT + C2

b2η + KC log T

T + (K + 2m

where C = (PN n=1 cn)(r + α log A) and A is the number of advertisements.

Proof sketch. We prove the theorem in three steps. First, although our algorithm can not exceed the resource constraints in decision-making, we prove that the loss of primal objective is well bounded by the analysis technique of the stopping time (Balseiro et al., 2020). In other words, resources will not be depleted too early. Second, we consider the cumulative revenue without the concern of resources violation. Given λt in each arrival t, there exists an objective gap between the coming unobserved requests and the total requests in the random permutation model. We present two lemmas F.3 and F.4 to further bound the primal performance of our algorithm. Finally, we put the above two steps together to obtain the result of the theorem. All proofs are shown in Appendix F.

When the step-size η is chosen by O(1/

KT) in Theorem 5.3, we show that our algorithm attains a O((

T) regret, and therefore our proposed OG-MA achieves sub-linear regret as the number of arrivals and resource constraints vary. Although the algorithms (Agrawal et al., 2014; Kesselheim et al., 2014; Agrawal & Devanur, 2014) attain better regret than ours, they require known estimation on the benchmark or costly computation of periodically solving large-scale optimization problems. Adopting the stochastic input model, we can obtain the same O(

KT) regret as (Balseiro et al., 2020) with a more concise analysis by eliding the Lemma F.3. Moreover, the regret analysis in high entropy matching can extend to general matching with a concave objective.

Remark 5.4. By choosing step-size η = O(C/

KT), the regret order is O(C(

T). The C in Theorem 5.3 represents an upper bound to the maximum primal objective of an allocation. It is related to the top-N advertisements revenue and the total impressions capacity PN n=1 cn. Adopting the position-based click model (Craswell et al., 2008) with cn = 1/nγ, which states the fact that users generally have a higher probability of focusing on the top slots over the bottom slots, we imply that our regret is sub-linear w.r.t. N:

When γ = 1, our regret is of order O(log N);

2, our regret is of order O(

6. Experiments

In this section, we elaborately design the numeric experiment to reveal the difficulty of directly solving the primal problem when the number of slots increases. Next, we consider a large-scale problem, namely display ads, that aims at maximizing total revenue under budget constraints in the search system. The experiment results verify our regret analysis and confirm the diversity uplift by incorporating the high entropy regularizer.

Submission and Formatting Instructions for ICML 2022

(100,2) (100,5) (100,10) (500,10) (1000,10) Paramters (A_t, N)

Elapsed Time (s)

7.2x10 -2 7.3x10 -2 7.7x10 -2 4.4x10 -1 9.4x10 -1 1.4x10 2 1.5x10 2 2.1x10 2

3.2x10 3 Methods OG-MA DSDA with CVX

Figure 2: Computing time required escalating the scale of the problem by tweaking N and At.

6.1. Experimental Setup

Synthetic Dataset. We generate the synthetic data to simulate the request with (rt, Mt) at time t = 1, 2, . . . , T. In detail, we sample the rt following the approach of (Zhong et al., 2015) and set the amount of resource consumption to the constant mt,a,k = 1 for each a and k. Detailed data pre-processing procedure can be found in Appendix G.1.

Industrial Dataset. We utilize Ali Express Searching System Dataset - France to evaluate the performance of the proposed algorithm on a large-scale problem. The dataset contains 570,288 users and 8,822,801 samples in total. Each user t has a corresponding revenue function rt,a = f(t, a) for evaluating a list of advertisements in A. Next, we artificially construct 20 budget constraints B in total and limit N = 10 slots for each result page where advertisements are expected to be allocated. The k-th constraint value is defined as following: bk = C T/

A, where C is the total impression capacity C = Pn i ci, T and A denote the number of users and ads, respectively. The amount of resource consumption is set to the constant mt,a,k = 1 for each advertisement. The description of estimated revenue function f(t, a) can be found in Appendix G.2.

Experiment Protocol. We construct the user set T by randomly sampling from the dataset, and the size of T is set to T = 10000. Next, we randomly permute the set T to generate the random permutation σ. In the online setting, users arrive sequentially following the order of σ with the estimated revenue vector rt and a global resource matrix Mt. And the decision-maker is expected to solve the multi-slots matching with full knowledge of the capacity c. The choice of the impression capacity c can be found in Appendix G.3. Given the above random permutation model σ and the user t with corresponding rt and Mt, we solve the online allocation problem at each time t and compute the cumulative revenue. Note that we no longer allocate any impressions to advertisements whose resources are depleted, though the corresponding dual variable is continually updated.

2 4 6 8 10 # of slot

Elapsed Time (ns)

103 (a) real-world dataset

0 1000 2000 3000 4000 # of slot

Elapsed Time (ns)

108 (b) synthetic dataset

Figure 3: Elapsed inference time for different N.

6.2. Efficiency of OG-MA

Utilizing the synthetic dataset, we compare OG-MA to the extension of DSDA (Balseiro et al., 2021) in the multislots setting. Concretely, we implement DSDA via using CVX package (Diamond & Boyd, 2016) with MOSEK solver (Ap S, 2022), and apply the DSDA to directly solve the problem (2). We simulate the scale of the industrial applications by setting (At, N) to the moderate size (Huang et al., 2020) and report elapsed time averaging from 10 trials. As can be seen from Figure 2, the OG-MA is 3 4 order faster than DSDA with CVX in general. In particular, for the case (At = 100, N = 2), the baseline method requires hundreds of seconds to obtain the solution (Note that T = 10000 in our experiment), which is already unacceptable in real-world applications. Further, the baseline method takes 3200 seconds to optimize the problem with (At = 1000, N = 10). The above results are far beyond satisfactory since an online server has to deal with thousands of requests within seconds. On the opposite, the computation cost is only 1 second for OG-MA given (At = 1000, N = 10).

6.3. Inference Efficiency for Different N

Figure 3 plots the elapsed time for real-time inference with different slots N in both the industrial dataset and the synthetic dataset. We choose the regularization level α = 0.01 and decay factor γ = 1

2. The number of advertisements At for the industrial and the synthetic dataset is set to 20 and 215, respectively. In the industrial dataset, the left part of Figure 3 shows that the inference procedure is within a few milliseconds where the number of N is relatively small in real-world applications. Besides, in the synthetic dataset, we set N from 1 to 212 to evaluate the complexity for completeness. The right part of Figure 3 clearly shows that our OG-MA algorithm efficiently makes an online allocation, where the complexity grows sub-linearly w.r.t. the number of slots N. In addition, we argue that the input data leads to the gap between empirical (sub-linear) results as shown in Figure 3 and theoretical (linear) results in Section 4.1. Precisely, O(N + NAt + At log At) is the worst complex-

Submission and Formatting Instructions for ICML 2022

2000 4000 6000 8000 10000 T

2000 4000 6000 8000 10000 T

Figure 4: Empirical regret along horizon T for different N.

ity order of our OG-MA algorithm given fully adversarially designed input data.

6.4. Verification of Regret Bound

In this experiment, the regret analysis is validated using the Ali Express dataset. We plot the mean regret with 95% confidence interval over 100 random trials for each value of T and set α = 0.01. Along the horizon, our algorithm attains sub-linear regret consistently. As stated in Remark 5.4, the choice of click models affect the total impression capacity, which further determines the order of the upper regret bound. Figure 4 demonstrates regret variations of different hyperparameters γ for the choice of click models. Tuning the number of slots, γ = 1 leads to O(log N) regret while γ = 1

2 leads to O(

N) regret, which coincides with the theoretical analysis in Theorem 5.3.

6.5. Trade-off between Revenue and Diversity

1e-5 1e-4 1e-3 1e-2 1e-1 1e0 Regularization level

103 (a) revenue and entropy

1e-5 1e-4 1e-3 1e-2 1e-1 1e0 Regularization level

# of a specific ads

(b) distribution of a top ads

Figure 5: (a) shows the trade-off between entropy and revenue; (b) illustrates the diversity of a top ads allocation.

Figure 5 demonstrates how the revenue and entropy change in different regularization levels and presents the distribution of specific top advertisements allocated in different slots. We choose the regularization level α {10 5, 10 4, 10 3, 10 2, 10 1, 1} in the experiment. The results suggest that higher entropy leads to better diversity in matching while potentially decreasing the revenue. In detail, by setting α = 0.01, the algorithm achieves a good trade-off that matching diversity is noticeably improved with acceptable revenue reduction.

7. Conclusion

This paper proposes a new online matching framework for a multi-slots online matching problem. We introduce the high entropy in the matching objective, which generates diversified results, and prove that our algorithm enjoys a sublinear expected regret bound without constraint violations. Besides, we conduct extensive experiments to validate the superior performance of the proposed framework.

Submission and Formatting Instructions for ICML 2022

Agrawal, S. and Devanur, N. R. Fast algorithms for online stochastic convex programming. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 1405 1424. SIAM, 2014.

Agrawal, S., Wang, Z., and Ye, Y. A dynamic near-optimal algorithm for online linear programming. Operations Research, 62(4):876 890, 2014.

Agrawal, S., Zadimoghaddam, M., and Mirrokni, V. Proportional allocation: Simple, distributed, and diverse matching with high entropy. In International Conference on Machine Learning, pp. 99 108. PMLR, 2018.

Ahmed, F., Dickerson, J. P., and Fuge, M. Diverse weighted bipartite b-matching. ar Xiv preprint ar Xiv:1702.07134, 2017.

Ap S, M. MOSEK Fusion API for Python. Release 9.3.20, 2022. URL https://docs.mosek.com/latest/ pythonfusion/index.html.

Arlotto, A. and Gurvich, I. Uniformly bounded regret in the multisecretary problem. Stochastic Systems, 9(3): 231 260, 2019.

Balseiro, S., Lu, H., and Mirrokni, V. Dual mirror descent for online allocation problems. In International Conference on Machine Learning, pp. 613 628. PMLR, 2020.

Balseiro, S., Lu, H., and Mirrokni, V. Regularized online allocation problems: Fairness and beyond. In International Conference on Machine Learning, pp. 630 639. PMLR, 2021.

Chapelle, O. and Zhang, Y. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web, pp. 1 10, 2009.

Chen, J., Dong, H., Wang, X., Feng, F., Wang, M., and He, X. Bias and debias in recommender system: A survey and future directions. ar Xiv preprint ar Xiv:2010.03240, 2020.

Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining, pp. 87 94, 2008.

De Leeuw, J., Hornik, K., and Mair, P. Isotone optimization in r: pool-adjacent-violators algorithm (pava) and active set methods. Journal of statistical software, 32:1 24, 2010.

Devanur, N. R. and Hayes, T. P. The adwords problem: online keyword matching with budgeted bidders under random permutations. In Proceedings of the 10th ACM conference on Electronic commerce, pp. 71 78, 2009.

Devanur, N. R., Jain, K., Sivan, B., and Wilkens, C. A. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In Proceedings of the 12th ACM conference on Electronic commerce, pp. 29 38, 2011.

Di Noia, T., Rosati, J., Tomeo, P., and Di Sciascio, E. Adaptive multi-attribute diversity for recommender systems. Information Sciences, 382:234 253, 2017.

Diamond, S. and Boyd, S. CVXPY: A Pythonembedded modeling language for convex optimization. Journal of Machine Learning Research, 2016. URL https://stanford.edu/ boyd/papers/ pdf/cvxpy_paper.pdf. To appear.

Feldman, J., Korula, N., Mirrokni, V., Muthukrishnan, S., and P al, M. Online ad assignment with free disposal. In International workshop on internet and network economics, pp. 374 385. Springer, 2009.

Feldman, J., Henzinger, M., Korula, N., Mirrokni, V. S., and Stein, C. Online stochastic packing applied to display ad allocation. In European Symposium on Algorithms, pp. 182 194. Springer, 2010.

Guo, F., Liu, C., and Wang, Y. M. Efficient multiple-click models in web search. In Proceedings of the second acm international conference on web search and data mining, pp. 124 131, 2009.

He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015.

Huang, J.-T., Sharma, A., Sun, S., Xia, L., Zhang, D., Pronin, P., Padmanabhan, J., Ottaviano, G., and Yang, L. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2553 2561, 2020.

Kesselheim, T., T onnis, A., Radke, K., and V ocking, B. Primal beats dual on online packing lps in the randomorder model. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 303 312, 2014.

Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pp. 767 776. PMLR, 2015.

Submission and Formatting Instructions for ICML 2022

Li, X. and Ye, Y. Online linear programming: Dual convergence, new algorithms, and regret bounds. ar Xiv preprint ar Xiv:1909.05499, 2019.

Li, X., Sun, C., and Ye, Y. Simple and fast algorithm for binary integer and online linear programming. ar Xiv preprint ar Xiv:2003.02513, 2020.

Mehta, A., Saberi, A., Vazirani, U., and Vazirani, V. Adwords and generalized online matching. Journal of the ACM (JACM), 54(5):22 es, 2007.

Qin, L. and Zhu, X. Promoting diversity in recommendation by entropy regularizer. In Twenty-Third International Joint Conference on Artificial Intelligence, 2013.

Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107 194, 2011.

Talluri, K. and Van Ryzin, G. An analysis of bid-price controls for network revenue management. Management science, 44(11-part-1):1577 1593, 1998.

Van Der Vaart, A. W., van der Vaart, A. W., van der Vaart, A., and Wellner, J. Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media, 1996.

Yan, J., Xu, Z., Tiwana, B., and Chatterjee, S. Ads allocation in feed via constrained optimization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3386 3394, 2020.

Zhang, M. Enhancing diversity in top-n recommendation. In Proceedings of the third ACM conference on Recommender systems, pp. 397 400, 2009.

Zhong, W., Jin, R., Yang, C., Yan, X., Zhang, Q., and Li, Q. Stock constrained recommendation in tmall. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2287 2296, 2015.

Submission and Formatting Instructions for ICML 2022

A. Proof of Proposition 3.1.

The proof of the proposition consists of two components. First, we obtain that the optimal solution X t to problem (2) is also a feasible solution to the two-step problems (6)(7) because X t satisfies the domain constraints Y in problem (6). Then, the inequality holds:

t=1 r t y t + αH(y t )

t=1 r t X t c + αH(X t c)

where y t is the optimal solution to problem(6). In the next step, we show that the linear equations(7) always have a feasible solution Xt with given optimal y t . This can be easily shown in the feasibility analysis of the swapping probabilities in Algorithm 3. Consequently, it brings the following inequality:

t=1 r t y t + αH(y t )

t=1 r t X t c + αH(X t c)

Combining above two inequalities, we conclude the equivalent result.

B. Proof of Proposition 3.2

We further introduce τt as the dual variables of domain constraints Y to obtain the Lagrangian dual formulation:

min λ 0,τ 0 max y Y

t=1 [r t yt + αH(yt) +

s=1 cs sup Et(n) { X

a Et(n) yt,a})] + λ (TB

t=1 M t yt) (16)

With the K.K.T. conditions, we can further obtain that the optimal y t can be formulated given λ and τt:

n=1 cn) vt,a QN 1 n=1 Pt,a,n P

a [vt,a QN 1 n=1 Pt,a,n] (17)

where Pt,a,n = exp( τt,n I(a S(Et,n))

α ) is a penalty term to enforce the cumulative top-n value of y t under domain constraints Y and I( ) is the indicator function. Here, S(yt, n) represents the top-n elements in yt.

We first prove that the top-n elements inverted by (9) are the top-n elements of optimal y t in the first half of Proposition 3.2. By contradiction, if there exists vt,i > vt,j and y t,i < y t,j, then we have Pt,i,n > Pt,j,n, n < N by the definition of Pt,a,n and non-negative properties of τt. This incurs vt,i QN 1 n=1 Pt,i,n > vt,j QN 1 n=1 Pt,j,n given vt,i > vt,j. Bring it into the formula (17), we have y t,i > y t,j which is conflict with the assumption. Hence, we have y t,a in the same order as vt,a.

In the next step, we show that the e t,i also in the same order as vt,a. Using the above result, Pt,a,n in (17) can be further

represented as Pt,a,n = exp( τt,n I(a n)

α ). For any y t,i y t,j, we have Pt,i,n Pt,j,n n, implying QN 1 n=1 Pt,i,n QN 1 n=1 Pt,j,n. Furthermore, the following holds: P

a [vt,a QN 1 n=1 Pt,a,n] QN 1 n=1 Pt,i,n P

a [vt,a QN 1 n=1 Pt,a,n] QN 1 n=1 Pt,j,n which implies e t,i e t,j and the proof is concluded.

C. Proof of Proposition 3.4.

Using the result in Proposition 3.2, we conclude that the efficiency value et,a and contribution value vt,a are in the same order for an optimal solution y t . Then, we gather elements together if et,a = et,a+1 for all a Ea sorted by vt,a, and thus we can formulate a unique optimal block set {Br} as defined in Definition 3.3. In each block, all elements share the same efficiency and the poof is concluded.

Submission and Formatting Instructions for ICML 2022

D. Proof of Proposition 3.5.

As stated in Proposition 3.5, if yt,r, yt,r+1 belong to different blocks in the optimal block set {Br} , then we have:

E(B (p,r]) =

p<a r vt,a P

p<a r y t,a

a [vt,a QN 1 n=1 Pt,a,n] PN n=1 cn

p<a r vt,a P

p<a r[vt,a QN 1 n=1 Pt,a,n]

a [vt,a QN 1 n=1 Pt,a,n] PN n=1 cn 1 QN 1 n=1 Pt,r,n

a [vt,a QN 1 n=1 Pt,a,n] PN n=1 cn 1 QN 1 n=1 Pt,r+1,n

r<a q vt,a P

r<a q y t,a = E(B (r,q])

The first and last equations come from the definition of block s efficiency in Definition (10). By substituting (17) into the first equation, we derive the second equation. The third, fourth, and fifth lines hold based on the analysis of the penalty term Pt,a,n in Proposition 3.2.

E. Proof of Theorem 5.1.

Following Proposition 3.2, we consider the minimum dual problem (8) with fixed λ. In particular, we have the following results by complementary slackness:

τ t,n = 0, Pn s=1 cs < sup Et(n) {P

a Et(n) y t,a}

τ t,n > 0, Pn s=1 cs = sup Et(n) {P

a Et(n) y t,a}. (19)

Let t = {a1, a2, ..., a Lt 1|a1 < a2... < a Lt 1} be the index set with Lt 1 indices which satisfies τ t,a > 0, a t . After the elements a Et is inverted by vt,a, it s easy to find that t constitutes Lt 1 cut points of Et that can constitute a disjoint block set {Br} with Lt blocks. Each block B r can be represented as B (ar 1,ar]. Then, the following efficiency equation holds:

E(B (ar 1,ar]) =

P ar 1<a<ar vt,a P

ar 1<a ar y t,a =

P ar 1<a<ar vt,a P

ar 1<a ar I(1 a N)ca (20)

We prove the block set solved by Algorithm 2 contain the same cut points with t by contradiction. If the block set {Br} solved by Algorithm 2 is not the optimal block, then one of the following two conditions must be met:

There exist a block B(p,q] {Br}, and an inner point r (p, q] belongs to the optimal separator points t .

There exist two Blocks B(p,r], B(r,q] {Br}, p < r < q that need to be merged as one of the optimal blocks {Br} .

If the first condition holds, then there must be a merge step in Algorithm 2 that leads to the follows: P

p<a r vt,a P

p<a r I(1 a N)ca = E(B(p,r]) E(B(r,q]) =

r<a q vt,a P

r<a q I(1 a N)ca . (21)

Besides, r belongs to the optimal separator points t , it holds that X

0<a r y t,a = X

0<a r I(1 a N)ca (22)

and 0 p < r < q N: X

p<a r y t,a X

p<a r I(1 a N)ca (23)

Submission and Formatting Instructions for ICML 2022

r<a q y t,a X

r<a q I(1 a N)ca (24)

By substituting (23) and (24) into (21), we can obtain E(B (p,r]) E(B (r,q]) which is conflicted with Proposition 3.5.

Similarly, if the second condition holds, we have the following inequality held by the stop condition in Algorithm 2: P

p<a r vt,a P

p<a r I(1 a N)ca = E(B(p,r]) > E(B(r,q]) =

r<a q vt,a P

r<a q I(1 a N)ca . (25)

Besides, if B (p,q] is an optimal block, then it holds that X

p<a r y t,a X

p<a r ca I(1 a N) (26)

Then we can obtain the following inequalities by combining (25) and (26) :

E(B (p,r]) =

p<a r vt,a P

p<a r y t,a

p<a r vt,a P

p<a r I(1 a N)ca >

p<a q vt,a P

p<a q I(1 a N)ca = E(B (p,q]). (27)

It is conflicted with Proposition 3.4. Hence, we have the block set solved by Algorithm 2 containing the same cut points as t . Moreover, the efficiency value E(Br) is equal to E(B r) for all r r.t. by (21). This proves that {Br} is optimal, and further obtains the formulation of the convex optimization of λ in Theorem 5.1.

F. Proof of Theorem 5.3

For a better description, we define ft(yt) := r t yt + αH(yt) and |c| := PN n=1 cn , and it holds that:

|f|max = max y Y ft(y) |c|(r + α log A), t T; |f|min = min y Y ft(y) 0, t T (28)

where A is the number of advertisements. We show that our regret analysis in this part is valid for general concave objective functions ft(yt) with finite upper and lower bounds.

First, we give the following lemma that states the dual space of the optimal λ under random order σ.

Lemma F.1. For any s requests selected from the sample set {(rt, Mt)}T t=1, λ s is the optimal dual solution of the dual problem:

min λs 0 max y Y

t=1 ft(yt) + λ s (s B

t=1 M t yt)

, then it holds that:

||λ s||1 |f|max

Proof. For any s requests {(rt, Mt)}s t=1, if λ s denotes the optimal dual solution, we have:

t=1 ft(yt(λ s)) =

t=1 ft(yt(λ s)) + (λ s) s X

t=1 (B M t yt(λ s))

t=1 ft(ˆy0) + (λ s) s X

t=1 (B M t ˆy0)

s(|f|min) + sb||λ s||1.

where ˆy0 denotes a decision matrix with entries equal to 0 for all advertisements (except the advertisements with zero revenue and resource consumption), and thus we have M t ˆy0 = 0, t T. The first inequality comes from the maximum properties of yt(λ s) in the dual problem (8). The second inequality comes from the definition of |f|min and b. Applying |f|max to the left half of inequalities, we can obtain the result in this lemma.

In the next part, we decompose the proof of Theorem 5.3 into three steps. First, although our algorithm can not exceed the resource constraints when making decisions, we prove that the loss of primal objective is well bound after the resources

Submission and Formatting Instructions for ICML 2022

are depleted. Second, we consider the cumulative revenue computed by solution ˆyt in Algorithm 2 without the concern of resource violation. The revenue R(ˆy1, ˆy2, . . . , ˆy T |σ) is independent of the random variable θ in Algorithm 3. Let τA denote the stopping time when the resources are depleted, and we can further bound the regret by:

Regret(A|σ) (R R(ˆy1, ˆy2, . . . , ˆy T )) + (|f|max |f|min)Eθ[T τA] (30)

Then, we present the bounds of the above two parts respectively and finally put them together.

Part1. The expected loss by resource violation.

Through the following lemma, we give a bound of Eθ[T τA]:

Lemma F.2. Consider the projected subgradient descent in Algorithm 1 with gradient g(λt) = B M t ˆyt and update rule λt+1 = Projλ 0{λt ηg(λt)}. Suppose λ0 = 0, then it holds that

||λt|| |f|max

b + ηm, t T.

Furthermore, the following inequality can bound the expectation of T τA given random variable θ by:

Eθ[T τA] 2m

Proof. The lemma can be proved by citing Proposition 2 in (Balseiro et al., 2020) under the gradient descent framework. Besides, the random variable θ in the real-time decision is independent of the request input model, so we can obtain that the analysis of the expected stopping time Eθ[T τA] in (Balseiro et al., 2020) still holds under the random permutation model.

Part2. The primal performance.

In the stochastic input model, the coming unobserved requests achieve the same expected primal objective with the total T requests under the t-th updated dual variables λt. In the random permutation model, there exists a gap between the above two expectations. In this part, we give a bound on the gap, then we prove that the updated dual solution λt in each iteration t is too bad dual solution for the coming T t requests. Specifically, we present the following two lemmas:

Lemma F.3. For any s requests arrive in a random order from total requests {(rt, Mt)}T t=1, we denote their optimal primal objective as

R s = max y Y

t=1 ft(yt).

Then, it holds that:

s R s] |f|max

2m log s s + |f|max K

Proof. We denote the dual formulation of arriving s requests as Ls(λ) = Ps t=1[ft(yt(λ)) + λ (B Mtyt)]. Then, we assume λ s as the optimal dual solution of R s, and it holds that:

1 s R s = 1

s Ls(λ s) = 1

s max y Y min λ 0

t=1 [ft(yt) + (λ) (B Mtyt)]

t=1 [ft(yt) + (λ s) (B Mtyt)]

t=1 [ft(yt(λ T )) + (λ s) (B Mtyt(λ T )]

t=1 ft(yt(λ T )) + (λ s) (B 1

s Mtyt(λ T ))

In the next step, we prove that for any s request selected from the total samples {(rt, Mt)}T t=1, the resource violation can be well controlled with high probability while the primal solution is recovered by λ T . Concretely, for each resource constraint

Submission and Formatting Instructions for ICML 2022

k, the following inequalities hold:

t=1 m t,kyt(λ T ) Bk d)

t=1 m t,kyt(λ T ) 1

t=1 m t,kyt(λ T ) d)

s(m)2/2 + dm)

where the first inequality is because the optimal dual solution λ T is feasible with 1

T PT t=1 m t,kyt(λ T ) Bk, and the second inequality comes from Hoeffding-Bernstein s Inequality for sampling without replacement (Van Der Vaart et al., 1996). By choosing the gradient violation threshold d =

2m log(s) s , it holds that:

s(m)2/2 + dm) 1

Taking the union of the above bounds with all K constraints, we readily obtain:

P({ 1 k K, 1

t=1 m t,kyt(λ T ) Bk

2mlog(s) s }) K

Finally, we obtain the bound in this lemma:

s R s] Eσ[ 1

t=1 ft(yt(λ T )) 1

t=1 ft(yt(λ T )) + (λ s) (1

s Mtyt(λ T ) B)]

= Eσ[(λ s) (1

s Mtyt(λ T ) B)]]

Eσ[||λ s||1]

2m log(s) s + |f|max K

The first inequality comes from (33), and the second line comes from the probability properties in the random permutation model. Applying (35) as a conditional probability to the expectation, we can obtain the third inequality. With the bound of ||λ s||1 and |f|max, we finish the proof.

Lemma F.4. Consider the projected subgradient descent step in Algorithm 1. Suppose λ0 = 0, then it holds that:

t=1 λ t g(λt) K(m2 + b 2) 2 ηT (37)

Proof. For any input random order σ, it holds that:

t=1 λ t g(λt) =

t=1 [(λ t λ t+1)g(λt) + λ t+1g(λt)]

t=1 [η||g(λt)||2 2 + λ t+1(λ t λ t+1 η )]

t=1 [η||g(λt)||2 2 + 1

2η (||λt||2 2 ||λt+1||2 2 ||λt λt+1||2 2)]

η 2||g(λt)||2 2 + 1

2η ||λ0||2 2 ||λT +1||2 2

K(m2 + b 2) 2 ηT

The first four equations come from the update rule of gradient descent and combination of cumulative term. The last

Submission and Formatting Instructions for ICML 2022

inequality is bounded by the l2-norm of the gradient in Assumption 5.2.

Using the above two lemmas, we have the bound of primal performance:

R R(ˆy1, ˆy2, . . . , ˆy T |σ) = Eσ[T 1

t=1 ft(yt)]

t R t ]] + Eσ[

t R t ft(yt)]]

t R t ]] + Eσ[

t=1 [ 1 T t + 1R T t+1(t, t + 1, ..., T) ft(yt)]]

where R T t+1(t, t + 1, ..., T) represents the optimal objective among the coming unobserved T t + 1 requests. The first part of the last inequality can be bound by Lemma F.3. For the second part of the last inequality, we have:

R (t, t + 1, ..., T) f(yt)]] Eσ[

s=t [fs(ys(λt)) + λ t (B ys(λt))] f(yt(λt))]]

s=t λ t (B ys(λt))]]

t=1 λtg(λt)]

where := T t + 1. The first inequality comes from the minimum properties of the optimal dual objective. The second inequality and the third equation hold because the coming unobserved T t + 1 requests are independent of the dual solution λt.

Applying the results in Lemma F.3 and Lemma F.4 to (39) and (40), we can bound the primal performance:

R R(ˆy1, ˆy2, . . . , ˆy T |σ) |f|max

T + |f|max K(log T + 1) (41)

In the above inequality, we apply the results that PT t=1 1/T log T + 1 and PT t=1 1/

T. Finally, we finish the proof of Theorem 5.3 by putting the results together from (31), (41), and (30).

G. Additional details on the numerical experiments

G.1. Synthetic Dataset

The data are generated from the model y = Xβ + ϵ, with ϵ N(0, σ2) and σ = 0.5. For a varying number of slots, we employ the setting: β = [0, . . . , 0 | {z } 0.2N , 2, . . . , 2 | {z } 0.3N , 0, . . . , 0 | {z } 0.2N , 2, . . . , 2 | {z } 0.3N ], where N is the number of slots. And X N(0, σ2) with

σ = 0.1. In the sampling procedure, we keep the sample if its value is in [0, 1], otherwise, we discard it.

G.2. Estimated Revenue

We employed a two-layers Multi-Layer Perceptron (MLP) to parameterize the revenue function f(t, a). Specifically, we set embedding size demb = 8 for each sparse feature, learning rate = 0.001, batch size = 256, l2 regularizer = 0.001, and select Adam as the optimizer. The two-layers MLP is of [32, 16] hidden units, and its parameters are initialized with He Normal initializer (He et al., 2015) of a standard deviation of 0.02.

G.3. Click Model

For simplicity, we adopt the position-based model (Craswell et al., 2008) that the impression capacity vector c = (c1, c2, . . . , cn) RN follows a decay probability of cn = 1/nγ, indicating that users have a higher probability examining the top slots than the bottom slots. Other click models (Guo et al., 2009; Chapelle & Zhang, 2009; Kveton et al., 2015) can be applied as well in the proposed algorithm.