# online_pandoras_boxes_and_bandits__db96408a.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Online Pandora s Boxes and Bandits

Hossein Esfandiari,1 Mohammad Taghi Haji Aghayi,2 Brendan Lucier,3 Michael Mitzenmacher4

1Google Research, 2University of Maryland, 3Microsoft Research, 4Harvard University esfandiari@google.com, hajiagha@cs.umd.edu, brlucier@microsoft.com, michaelm@eecs.harvard.edu

We consider online variations of the Pandora s box problem (Weitzman 1979), a standard model for understanding issues related to the cost of acquiring information for decision-making. Our problem generalizes both the classic Pandora s box problem and the prophet inequality framework. Boxes are presented online, each with a random value and cost drawn jointly from some known distribution. Pandora chooses online whether to open each box given its cost, and then chooses irrevocably whether to keep the revealed prize or pass on it. We aim for approximation algorithms against adversaries that can choose the largest prize over any opened box, and use optimal ofﬂine policies to decide which boxes to open (without knowledge of the value inside)1. We consider variations where Pandora can collect multiple prizes subject to feasibility constraints, such as cardinality, matroid, or knapsack constraints. We also consider variations related to classic multi-armed bandit problems from reinforcement learning. Our results use a reduction-based framework where we separate the issues of the cost of acquiring information from the online decision process of which prizes to keep. Our work shows that in many scenarios, Pandora can achieve a good approximation to the best possible performance.

1 Introduction

Information learning costs play a large role in a variety of markets and optimization tasks. For example, in the academic job market, obtaining information about a potential match is a costly investment for both sides of the market. Conserving on information costs is an important component of efﬁciency in such settings. A classic model for information learning costs is the Pandora s box problem, attributed to Weitzman (Weitzman 1979), which has the following form. Pandora has n boxes, where the ith box contains a prize of value vi that has a known cumulative distribution function Fi. It costs ci to open the ith box and reveal the actual value vi. Pandora may open as many boxes as she likes, in any order. The payoff is the maximum-valued prize, minus the cost of the opened boxes. That is, if S is the subset of opened boxes, then the

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1See section 2 for formal deﬁnitions.

payoff Pandora seeks to maximize is

max i S vi X

The Pandora s box problem incorporates two key decision aspects: the ordering for opening boxes, and when to stop. It has been proposed for applications such as buying or selling a house and searching for a job. The original Pandora s box problem has a simple and elegant solution. The reservation price v i associated with an unopened box i is the value for which Pandora would be indifferent taking a prize with that value and opening box i. That is,

v i = inf{y : y ci + E [max vi, y]} = inf{y : ci E [max vi y, 0]}.

This result says that if Pandora is allowed to choose the ordering, Pandora should keep opening boxes in the order of decreasing reservation price, but should stop searching when the largest prize value obtained exceeds the reservation price of all unopened boxes. An alternative proof to Weitzman s proof (Weitzman 1979) of this was recently provided by Kleinberg, Waggoner, and Weyl (Kleinberg and Weyl 2016), who also present additional applications, including to auctions. Very recently Singla (Singla 2018) generalizes the approach of Kleinberg et al. (Kleinberg and Weyl 2016) for more applications in ofﬂine combinatorial problems such as matching, set cover, facility location, and prize-collecting Steiner tree. In other similar problems, the ordering is chosen adversarially and adaptively. For example, in the prophet inequality setting ﬁrst introduced in 1977 by Garling, Krengel, and Sucheston (Krengel and Sucheston 1978; 1977), the boxes have no cost, and the prize distributions are known, but the decision-maker has to decide after each successive box whether to stop the process and keep the corresponding prize; if not, the prize cannot be claimed later. It is known that there exists a threshold-based algorithm that in expectation obtains a prize value within a factor of two of the expected maximum prize (and the factor of two is tight) (Krengel and Sucheston 1978; 1977). There have subsequently been many generalizations of the prophet inequality setting, especially to applications in online auctions (see e.g. (Mohammad Taghi Hajiaghayi and Sandholm 2007;

Alaei, Hajiaghayi, and Liaghat 2012; Hajiaghayi, Kleinberg, and Parkes 2004; Alaei 2014; Yan 2011; Kleinberg and Weinberg 2012; Babaioff, Immorlica, and Kleinberg 2007; Lachish 2014; Feldman, Svensson, and Zenklusen 2015; Goel and Mehta 2008; Korula and P al 2009; Mahdian and Yan 2011; Karande, Mehta, and Tripathi 2011; Kesselheim et al. 2013; Guruganesh and Singla 2017; Esfandiari et al. 2017; Duetting, Feldman, and Lucier 2017)). Another related and well-studied theme includes multiarmed bandit problems and more generally reinforcement learning (see, e.g., (Gittins and Jones 1974)). In this setting, each box corresponds to a strategy, or arm, that has a payoff in each round. An online algorithm chooses one arm from a set of arms in each round over n rounds. Viewed in the language of selection problems, this translates to a feasibility constraint on the set of boxes that can be opened. Multi-armed bandit problems have applications including online auctions, adaptive routing, and the theory of learning in games. In this paper, we consider a class of problems that combine the cost considerations of Pandora s box with the online nature of prophet inequality problems. Again boxes are presented online, here with random values and costs drawn jointly from some distribution. Pandora chooses online whether to open each box, and then whether to keep it or pass on it. We aim for approximation algorithms against adversaries that can choose the largest prize over any opened box, and use optimal ofﬂine policies in deciding which boxes to open, without knowledge of the value inside. We consider variations where Pandora can collect multiple prizes subject to sets of constraints. For example, Pandora may be able to keep at most k prizes, the selected prizes must form an independent set in a matroid, or the prizes might have associated weights that form a knapsack constraint. We also introduce variations related to classic multiarmed bandit problems and reinforcement learning, where there are feasibility constraints on the set of boxes that can be opened. Our work shows that in many scenarios even without the power of ordering choices, Pandora can achieve a good approximation to the best possible performance. Our main result is a reduction from this general class of problems, which we refer to as online Pandora s box problems, to the problem of ﬁnding threshold-based algorithms for the associated prophet inequality problems where all costs are zero. Our reduction is constructive, and results in a polynomial-time policy, given a polynomial-time algorithm for constructing thresholds in the reduced problem. We ﬁrst describe the reduction in Section 2. Then in Section 3, we show how to use known results from the prophet inequality literature to directly infer solutions to online Pandora s box problems. Finally, in Section 4, we establish an algorithm for a multi-armed bandit variant of the online Pandora s box problem, by proving a novel multi-armed prophet inequality.

2 Pandora s Boxes Under General Constraints

In this section we consider a very general version of an online Pandora s box problem, with the goal of showing that,

if there is a suitable corresponding prophet inequality algorithm, we can use it in a way that yields good approximation ratios for the Pandora s box problem. We deﬁne the problem as follows. There is a sequence of boxes that arrive online, in an order chosen by an adversary (i.e., worst case order). Each box has a cost ci, a value vi, and a type ti. The tuple (ci, vi, ti) is drawn from a joint distribution Fi. The distributions are known in advance. When a box is presented, we observe its type ti. We can then choose whether to open the box. We note that, given the type ti, vi and ci have conditional distributions depending on the type ti. There is a set of constraints dictating what combinations of boxes can be opened; these constraints can depend on the indexes of the boxes and their types. If we open the box, then vi and ci are revealed, and we pay ci for opening the box. We must then choose (irrevocably) whether to keep and collect vi. There is also a set of constraints dictating what combinations of values can be kept; these constraints can depend on the indexes of the boxes and their types. We indicate the set of opened boxes by S {1, . . . , n} and the set of of kept boxes by R S. The goal is to maximize the expected utility E P

One might want to consider an adversary that obtains E[max R,S P

i S ci], that is, a fully clairvoyant adversary. However, it is not possible to provide any competitive algorithm against such an adversary even for the simple classical Pandora s box problem.2

We denote the expected utility of an algorithm Alg by u Alg. We compare our algorithms against the (potentially exponential time) optimal ofﬂine algorithm Opt that maximizes the expected utility. Speciﬁcally, Opt can see all the types ti of all the boxes, and Opt can choose to open boxes in any order. However, Opt does not learn the resulting cost and value ci and vi for a box until it is opened. Opt iteratively and adaptively opens boxes and at the end chooses a subset of opened boxes to keep. Of course Opt must respect the constraints on opened boxes and kept boxes.

We ﬁrst prove some fundamental lemmas that capture important structure for this Pandora s box problem. We then use these lemmas to provide a strong connection between the online Pandora s box problem and prophet inequalities. Our results allow us to translate several prophet inequalities algorithms, such as prophet inequalities under capacity constraints, matroid constraints, or knapsack constraints, to algorithms for online Pandora s box algorithms under the same constraints.

2Consider the following example with n identical (and independent) boxes where we have no constraint on opening boxes and can accept exactly one box at the end. The value of each box is 0 with probability 1/2 and 2 with probability 1/2; the cost of each box is 1. Note that the cost of each box is equal to its expected value. Hence, the expected utility of any online algorithm is upper bounded by 0. However, with probability 1 1 2n at least one of the boxes has a value 2. A fully clairvoyant adversary only opens the box with value 2 and obtain a utility 2 (1 1

2n ) 1 = 1 1 2n 1 . This is a positive utility when n 2.

Fundamental Lemmas Our lemmas allow us ﬂexibility in considering the distribution of costs, and show how we can preserve approximation ratios. Deﬁnition 1 Let F1, . . . , Fn, and F 1, . . . , F n be two sequences of boxes. Denote the outcomes of Fi and F i by (vi, ci, ti) and (v i, c i, t i) respectively. We say the two sequences are cost-equivalent if (a) they can be coupled so that for all i we have vi = v i and ti = t i, and (b) E [ci | ti] = E [c i | ti] for all ti. Lemma 2 Let F1, . . . , Fn, and F 1, . . . , F n be two costequivalent sequences of boxes. Let Alg be an online (resp., ofﬂine) algorithm that achieves an expected utility u Alg on boxes F 1, . . . , F n. There exists an online (resp., ofﬂine) algorithm Alg that achieves the same expected utility on boxes F1, . . . , Fn. Proof : We will ﬁrst suppose that Alg is online, so that the order of arrival is predetermined and the types are revealed online. We will deﬁne algorithm Alg using the run of algorithm Alg on a simulated set of boxes F 1, . . . , F n. When Alg attempts to open a box F i we do the following. We open Fi, and let (vi, ci, ti) be the outcome. Then we draw a triple (v i, c i, t i) from F i, conditioning on v i = vi and t i = ti, and report it to Alg . Alg then opens the box if and only if Alg

does, and likewise keeps the box if and only if Alg does. Let Yi be a binary random variable that is 1 if Alg opens box i and 0 otherwise. Also, let Xi be a binary random variable that is 1 if Alg keeps box i and 0 otherwise. Note that for any particular i, at the time that the algorithm decides about Yi, ci is unknown to the algorithm. Moreover, ci may be correlated with ti, but is independent of all observations of the algorithm from prior rounds. Therefore, after conditioning on ti, Yi is independent of ci. We have

Deﬁnition of utility

i=1 E [Yic i]

Linearity of Exp.

i=1 Et i [E [Yic i | t i]]

Draw t i ﬁrst

i=1 Et i [Yi E [c i | t i]]

Yi indep. of c i|t i

i=1 Eti [Yi E [ci | ti]]

cost-equivalence

i=1 Eti [E [Yici | ti]]

Yi indep. of ci|ti

Linearity of Exp.

= u Alg. Alg opens and keeps the same sets as Alg

The case where Alg is an ofﬂine algorithm is similar. The only difference is that the full proﬁle of types is known to the algorithm in advance, and hence Yi and Xi can depend on this proﬁle. We therefore ﬁx the type proﬁle, interpret variables Xi and Yi as being conditioned on this realization of the types, interpret all expectations with respect to this conditioning, and the argument proceeds as before (noting that the distribution of c i can depend on t i, but is independent of other types). Note that this actually simpliﬁes the chain of inequalities above, as the conditioning on t i on the third line is trivial and unnecessary when t i is ﬁxed.

Lemma 3 Let F1, . . . , Fn, and F 1, . . . , F n be two costequivalent sequences of boxes. Let Alg be an online (resp., ofﬂine) α-approximation algorithm on boxes F 1, . . . , F n. There exists an online (resp., ofﬂine) α-approximation algorithm Alg on boxes F1, . . . , Fn.

Proof : Let Opt and Opt be the optimum (ofﬂine) algorithms for boxes F1, . . . , Fn and F 1, . . . , F n respectively. Let Alg be the algorithm of Lemma 2 applied to Alg . Moreover, note that Applying Opt to Lemma 2 implies that there is some ofﬂine algorithm Alg on boxes F 1, . . . , F n such that u Opt = u Alg . We bound the approximation factor of Alg as follows.

u Alg u Opt = u Alg

u Opt By Lemma 2

u Alg u Opt = u Alg

u Opt deﬁnition of Opt

α. Alg is an α-approximation algorithm

Next we deﬁne the commitment Pandora s box problem on boxes F 1 = (v 1, c 1, t 1), . . . , F n = (v n, c n, t n). Commitment Pandora s box is similar to the Pandora s box problem with the following two restrictions, that we refer to as freeness and commitment, respectively.

Freeness: Opening any box is free, i.e., for all i we have c i = 0.

Commitment: If a box F i is opened and the value v i is the maximum possible value of F i , v i is kept.

Note that the commitment constraint is without loss of generality for an online algorithm, but is a non-trivial restriction for an ofﬂine algorithm. For each i, and for each type ti, we deﬁne a threshold σi so that we have E [vi min(vi, σi) | ti] = E [ci | ti]. If ci = 0, we set σi to the supremum of the vi. (It is possible to have σi be inﬁnity, with the natural interpretation.) We deﬁne F i = (min(vi, σi), 0, ti) in the following lemma.

Theorem 4 Let Alg be an α-approximation algorithm for the commitment Pandora s box problem on boxes F 1 , . . . , F n. There exists an α-approximation algorithm Alg for the Pandora s box problem on F1, . . . , Fn.

Proof : First we deﬁne a sequence of boxes F 1, . . . , F n, where for all i we have F i = (vi, vi min(vi, σi), ti). That is, we set v i = vi, t i = ti, and c i = vi min(vi, σi). Note that by deﬁnition of σi we have E [vi min(vi, σi) | ti] = E [ci | ti] for all ti. Thus, by Lemma 3 an (online) αapproximation algorithm for the Pandora s box problem on F 1, . . . , F n implies an α-approximation algorithm for the Pandora s box problem on F1, . . . , Fn as desired. Next, we construct Alg required by Lemma 3 using Alg . To construct Alg , whenever Alg attempts to open a box F i , we open F i and report (v i c i, 0, ti) = (min(vi, σi), 0, ti) = F i to Alg . Alg keeps the same set of boxes as Alg . Let Yi be a binary random variable that is 1 if Alg opens box i and 0 otherwise. Also, let Xi be a binary random variable that is 1 if Alg keeps box i and 0 otherwise. Note that v i = min(vi, σi) σi, and v i achieves it maximum value whenever vi σi. In this case by the commitment constraint we have Xi = Yi. Therefore we either have vi min(vi, σi) = 0 or Xi = Yi, which gives us

Yi vi min(vi, σi) = Xi vi min(vi, σi) (1)

Then for any ﬁxed proﬁle of types, and taking expectations conditional on those type realizations, we have

i=1 Yi vi min(vi, σi) #

i=1 Xi vi min(vi, σi) #

i=1 Xi vi vi min(vi, σi) #

i=1 Xi min(vi, σi)

where the ﬁrst equality is since Alg opens and keeps the same sets as Alg . Similarly, we can show u Opt = u Opt , where Opt is the optimum algorithm for the Pandora s box problem on F 1, . . . , F n and Opt is the optimum algorithm for the commitment Pandora s box problem on F 1 , . . . , F n. Therefore Alg is an α-approximation algorithm for the Pandora s box problem on F 1, . . . , F n as promised.

A Reduction for Online Pandora s Box Problems In this section we use Theorem 4 to provide a strong connection between the online Pandora s box problem under general constraints and prophet inequalities. This leads to our main result: we prove that a threshold-based algorithm for a prophet inequality problem, under any given feasibility

constraints on boxes that can be opened and/or prizes that can be kept, immediately translates into an algorithm for the online Pandora s box problem. This reduction preserves the approximation factor of the threshold-based algorithm. This implies several approximation algorithms for the online Pandora s box problem under different constraints, which we discuss in Section 3. Recall that, in the online Pandora s Box problem, the algorithm is permitted to keep a set R S of boxes and collect the reward, where R and S are restricted to be from arbitrary predeﬁned collections of feasible collections R and S. In the associated prophet inequality problem, the costs of all boxes are known to be 0. In Theorem 5 below, we use the notion of threshold-based algorithms for the prophet inequality problem deﬁned as follows. We say an algorithm is threshold-based if for every i {1, . . . , n} we have a threshold τi(ti) (where the threshold can depend on the type as well as the index) and the algorithm keeps a box if and only if it not less than τi(ti). The threshold τi(ti) may be adaptive, that is it may depend on any observation prior to observing the ith box.

Theorem 5 Let Algτ be a threshold-based α-approximation algorithm for the prophet inequalities problem, under a collection of constraints. There exists an α-approximation algorithm for the online Pandora s box problem under the same constraints.

Proof : We deﬁne F i = (v i = min(vi, σi), c i = 0, ti). Next we give an α-approximation algorithm Alg for the commitment Pandora s box problem on boxes F 1 , . . . , F n. This together with Theorem 4 will prove the theorem. Let τi(.) be the threshold function used by Algτ given observed values v 1, . . . , v n. We deﬁne Alg as follows. Upon arrival of box F i , ﬁrst we check if σi τi(ti). Note that this implicitly implies that the box is acceptable according to the constraints. If σi τi(ti) we open the box, otherwise we skip it. This ensures the commitment constraint. If we opened the box and v i τi we keep it, otherwise we ignore it and continue. It is easy to observe that Algτ and Alg

provide the same outcome and have the same approximation factor.

3 Algorithms For Online Pandora s Box via Prophet Inequalities Here we use the tools from the previous section to provide algorithms for the online Pandora s box problem under different kinds of constraints. First, as a warm-up, in Theorem 6 we show a 1/2-approximation algorithm for a simple version of the problem where there are no constraints on the set of boxes that we can open, and we can only keep the value of one box. Note that this is the online version of the classical Pandora s box problem. Indeed it is known that there is no 1/2+ϵ approximation algorithm for this problem even if all of the costs are 0 (where the problem is equivalent to a basic prophet inequalities problem) (Kleinberg and Weinberg 2012). We will prove Theorem 6 directly, without appealing to Theorem 5, to provide insight into how the given thresholds translate into a policy for the Pandora s Box problem.

Theorem 6 There exists a 1/2-approximation algorithm for the online Pandora s box problem with no constraints on opening boxes, but the value of exactly one box is kept.

Proof : We deﬁne F i = (v i = min(vi, σi), c i = 0, ti). Next we give a simple 1/2-approximation algorithm Alg for the commitment Pandora s box problem on boxes F 1 , . . . , F n. This together with Theorem 4 proves the theorem. Set τ such that Pr [maxn i=1 v i τ] = 1

2. Let j be a random variable that indicates the ﬁrst index such that v τ. Let v = v j , if there exists such index j, and let v = 0 otherwise. It is known that E [v ] = 1

2 E [maxn i=1 v i ] (Samuel Cahn 1984). We deﬁne Alg as follows. Upon arrival of box F i , ﬁrst we check if σi τ. If it is we open the box, otherwise we skip it. This ensures the commitment constraint; that is, if we observe σi, we will accept it. Next, if we opened the box and v i τi we keep it and terminate. Otherwise we continue to the next box. It is easy to observe that Alg keeps v , and hence is a 1/2-approximation algorithm.

We now explore other applications of our reduction. Theorem 5, together with previously-known approximation algorithms for prophet inequality problems with various types of constraints, implies the existence of approximation algorithms for variations of the online Pandora s box problem. Speciﬁcally, we have the following variations:

Online k-Pandora s box problem: we are given a cardinality k, and at most k boxes can be kept.

Online knapsack Pandora s box problem: we have a capacity C, and the type ti of each box corresponds to a size. The total size of the boxes that can be kept is at most C.

Online matroid Pandora s box problem: we have a matroid constraint on the set of boxes, and the boxes that are kept must be an independent set of the matroid.

We note that all of the variations above have no constraints on opening boxes; however, in what follows, we study a variation of the problem with constraints on opening boxes. As we have mentioned, prophet inequality approximation algorithms for the settings of cardinality constraints (Alaei 2014), knapsack constraints (Duetting, Feldman, and Lucier 2017), and matriod constraints (Kleinberg and Weinberg 2012) exist. Making use of these results and Theorem 5 implies the following corollary.

Corollary 7 There is

a 1 1 k+3-approximation algorithm for the online k Pandora s box problem,

a 1/5-approximation algorithm for the online knapsack Pandora s box problem,

and a 1/2-approximation algorithm for the online matroid Pandora s box problem.

4 Pandora s Box with Multiple Arms In the context reinforcement learning, we next consider a multi-arm version of the Pandora s box problem. In each of J rounds, m boxes are presented. There are therefore m J boxes in total. At most one box can be opened in each round. The m boxes presented in a given round are ordered; we can think of each box as having a type labeled 1 through m. All boxes of type t have the same cost ct, and also have the same value distribution Ft. For notational convenience we ll write vtj for the value in the box of type t presented at time j; for convenience we assume vtj > 0. At the end of the J rounds, the player can keep at most one prize for each type of box. That is, if we write St for the subset of boxes of type t opened by the player, then the objective is to maximize X

t max j St vtj ct |St|.

If no box of type t is opened, we ll deﬁne maxj St vtj (i.e., the prize for that type) to be 0. This is a variant of the online Pandora s box problem with a (partition) matroid constraint on the set of prizes that can be kept, and also a constraint on the subsets of boxes that can be opened. We can think of this problem as presenting boxes one at a time, where the boxes from a round are presented sequentially, with the additional constraint that one box per round can be selected. Note that types here are not random, but would depend on the index of the box. Our previous reduction applies, so that solving this multi-arm Pandora s box problem reduces to developing the related prophet inequality. In this prophet inequality, boxes can be opened at no cost, but we must irrevocably choose whether or not to keep any given prize as it is revealed. We can still keep at most one prize of each type, and we can still open at most one box in each time period. Our question becomes: can we develop a threshold policy to achieve a constant-factor prophet inequality for this setting? In this case, a threshold policy corresponds to choosing a threshold τt for each box type t, and accepting a prize vtj from an opened box if and only if vtj > τt. What is an appropriate benchmark for the prophet inequality? Note that the sum of the best prizes of each type, ex post, might not be achievable by any policy due to the restriction on which boxes can be opened. We will therefore compare against the following weaker benchmark. We consider a prophet who must choose, in an online fashion, one box to open in each round, given knowledge of the prizes in previously opened boxes. Then, after having opened one box on each of the J rounds, the prophet can select the largest observed prize of each type. In other words, the prophet has the advantage of being able to choose from among the opened boxes in retrospect, but must still open boxes in an online fashion. Our goal is to obtain a constant approximation to the expected value enjoyed by such a prophet. We begin with some observations about the choice of which box to open. First, the optimal policy for the prophet is to open boxes greedily. In particular, this policy can be implemented in time linear in m, each round.

Lemma 8 In each round j, it is optimal for the prophet to

open a box of type t that maximizes his expected value, as if the game were to end after time j. Proof : Note that the expected marginal gain of opening a box of type t can only decrease over time, and only as more boxes of that type are opened. Suppose t is the box with maximum expected marginal value at time j. Suppose further that the prophet does not open box t, and furthermore opens no box of type t until the ﬁnal round J. Then t will be the box with maximum expected marginal value in round J, and therefore it would be optimal to open the box of type t on the last round. This implies that it is optimal to open at least one box of type t, at some point between round j and the end of the game. The prophet is therefore at least as well off by opening the box of type t immediately, since doing so does not affect the distribution of the revealed value, and this can only provides more information for determining which other boxes to open. It is therefore (weakly) optimal to open box t at time j.

Similarly, once thresholds are ﬁxed, an identical argument implies that the optimal threshold-based policy behaves greedily. In particular, the optimal policy can be implemented in polynomial time, given an arbitrary set of thresholds. Lemma 9 Suppose the player s policy is committed to selecting a prize of type t, from an opened box of type t, if and only if its value exceeds τt. Then, in each round j, it is optimal to open a box t, from among those types for which a prize has not yet been accepted, that maximizes his expected value as if the game were to end after time j. That is, t arg maxt{E [vtj | vtj > τt] Pr [vtj > τt]}. We now claim that there are thresholds that yield a 2approximate prophet inequality for this setting. First, some notation. By the principle of deferred randomness, we can think of the value of the prize in any given box as only being determined at the moment the box is opened. With this in mind, we will write wtk for the value observed in the k th box of type t opened by the decision-maker. For example, wt1 is the value contained in whichever box of type t is opened ﬁrst, regardless of the exact time at which it is opened. Note that the behavior of any online policy is fully described by the proﬁle of values w = (wtk), and each wtk is a value drawn independently from distribution Ft. For such a proﬁle w, write y tk(w) for the indicator variable that is 1 if the prophet opens at least k boxes of type t, and keeps the k th one opened. Then the expected value enjoyed by the prophet is

t wtky tk(w)

For each box type t, we will set the threshold

k wtky tk(w)

That is, τt is half of the expected value obtained by the prophet from boxes of type t. To prove that these thresholds achieve a good approximation to the prophet s welfare, it will be useful to analyze the

possible correlation between the number of boxes of type t opened by the prophet, and whether any prize of type t is kept by the threshold algorithm. To this end, write z tk(w) for the indicator that the prophet opens at least k boxes of type t. Note that z tk(w) y tk(w) for all t, k, and w. We ll also write Qt(w) for the indicator variable that is 1 if wtkz tk(w) τt for all k. That is, Qt(w) = 1 if no box of type t opened by the prophet has value greater than τt, and hence the threshold algorithm does not keep any prize of type t. The following lemma shows that Qt is positively correlated with z tk, for every t and k.

Lemma 10 For all t and k, Ew [Qt(w) z tk(w)] Ew [Qt(w)] Ew [z tk(w)].

Proof : Fix the values of wℓk for all ℓ = t. Note that this also ﬁxes the choice of which box the prophet will open, on any round in which the prophet chooses not to open a box of type t. Due to the prophet s greedy method of opening boxes, at every time k, the prophet will choose to open the box of type t if and only if the maximum value from a box of type t, seen so far (or 0 if no box of type t has been opened yet), is below a threshold determined by the values observed from the other boxes. In other words, the values wℓk for ℓ = k deﬁne a sequence of non-decreasing thresholds h0, h1, h2, . . . , h J with the following property. Suppose, at time j, the prophet has previously opened s < j boxes of type t, and has therefore opened j s 1 boxes of types other than t. Then the prophet will open box t at time j if and only if

max r s {wtr} < hj s 1. (2)

Here and below, we ll take the maximum of an empty set to be 0. We now claim that the prophet opens k or more boxes of type t, at or before time J, if and only if maxr<k{wtr} < h J k. We prove this claim by induction on J. The case J = 1 is immediate, since box t is opened ﬁrst if and only if h0 > 0. Now suppose J > 1. If maxr<k{wtr} h J k, then maxr k{wtr} hj k 1 at all times j J, so by (2) the prophet never chooses to open the k th box of type t. In the other direction, note that if maxr<k{wtr} < h J k, then maxr<k 1{wtr} < h(J 1) (k 1), so by induction the prophet opens at least k 1 boxes by time J 1. This means that either the prophet has already opened k boxes of type t before time J (and we are done), or it has opened exactly k 1 boxes of type t before time J. In the latter case, since we have maxr k 1{wtr} < h J (k 1) 1, we conclude from (2) that the prophet opens box t at time J, as required. We are now ready to return to z tk and Qt. From the claim above, we have that z tk(w) = 1 if and only if maxr<k{wtr} < h J k. It sufﬁces to show that this event is only more likely to occur if we condition on the event [Qt(w) = 1]. We will actually consider a stronger event A, which is that maxr<J{wtr} τt. Note that event A is more stringent than the event [Qt(w) = 1], since event A requires that all of the ﬁrst J values from box t are at most τt, whether or not those boxes are opened. But since these events differ only on values that are in unopened boxes, we

have E [z tk | A] = E [z tk | Qt], so it sufﬁces to prove that z tk is positively correlated with event A. To show that E [z tk | A] E [z tk], we will couple outcomes with and without this conditioning on A. To do so, we imagine ﬁrst drawing a sequence (wt1, wt2, . . . ), then re-drawing any values that are greater than τt until all values are at most τt; say (w t1, w t2, . . . ) is the modiﬁed proﬁle. Note that since w ts wts for all s, we have that if [maxr<k{wtr} < h J k], then it must also be that [maxr<k{w tr} < h J k]. So the expected value of z tk can only increase as a result of this transformation. Since (wt1, wt2, . . . ) is chosen uniformly from all proﬁles, and (w t1, w t2, . . . ) is distributed uniformly from proﬁles that satisfy event A, we conclude that E [z tk | A] E [z tk] as required.

Finally, we can prove the multi-arm prophet inequality.

Theorem 11 The optimal threshold policy, using the thresholds described above, achieves at least half of the expected value enjoyed by the prophet for the multi-armed prophet inequality problem.

Proof : As above, write y tk(w) for the indicator variable for if the prophet opens at least k boxes of type t and keeps the k th one opened, and write z tk(w) for the indicator that the prophet opens at least k boxes of type t. We ll show a 2approximation for the policy that uses the given thresholds, but chooses to open the same boxes that the prophet would open. This will imply the theorem, since the optimal policy that uses these thresholds τt would do at least as well as the policy that opens the same boxes as the prophet. Given this choice of what boxes to open, let ytk(w) be the indicator variable that is 1 if the threshold algorithm opens at least k boxes of type t and keeps the k th one opened. The total value obtained by the threshold algorithm is then

t wtkytk(w)

Note that the threshold algorithm might not choose a prize of every type, since it might be that all observed prizes of type t are less than τt. Write Qtk(w) for the indicator variable that is 1 if none of the ﬁrst (up to) k boxes of type t opened by the threshold algorithm are kept by the algorithm. We ll also write Qt(w) for the indicator variable that is 1 if no prize of type t is kept by the threshold algorithm at any time. In particular, Qt(w) Qtk(w) for all k. Finally, we ll write qt = Ew [Qt(w)]. We decompose the value generated by the algorithm into (a) the value attributable to the thresholds, and (b) any value in excess of the thresholds. That is,

t wtkytk(w)

t (wtk τt)ytk(w)

For the ﬁrst term, we have

t (1 qt) Ew

k wtky tk(w)

The second term is more interesting. We have

t (wtk τt)ytk(w)

k Ew (wtk τt)+ z tk(w) Qtk(w)

k Ew (wtk τt)+ Ew [z tk(w) Qtk(w)]

k Ew (wtk τt)+ Ew [z tk(w) Qt(w)]

k Ew (wtk τt)+ Ew [z tk(w)] Ew [Qt(w)]

k Ew (wtk τt)+ z tk(w)

k Ew (wtk τt)+ y tk(w)

k wtky tk(w)

k wtky tk(w)

where the ﬁrst inequality is linearity of expectation and the deﬁnition of y, the equality on the second line uses the fact that wtk is independent of the values that occur earlier (which determine z tk and Qtk, and the inequality on the fourth line is Lemma 10. The result now follows by adding (3) and (4), yielding

t wtkytk(w)

k wtky tk(w)

t wtky tk(w)

as required.

We note that Theorem 11 is constructive, and the thresholds can be computed to within an arbitrarily small error in polynomial time. For example, this can be done by sampling instances of w, simulating the behavior of the prophet, and calculating an empirical average of the associated threshold values. See (Duetting, Feldman, and Lucier 2017) for further details on this approach. Applying the reduction from Theorem 5 to the prophet inequality in Theorem 11 yields a polynomial time algorithm for the multi-arm online Pandora s box problem.

Corollary 12 There is an 2-approximation algorithm for the online multi-arm Pandora s box problem.

5 Conclusion and Open Problems

We have presented a general reduction method for translating approximation algorithms in the prophet inequality setting to corresponding approximation algorithms in the online Pandora box setting with applications to information learning. Further we have introduced a novel multi-armed bandit Pandora box variation in the context of reinforcement learning where our methods apply. Along the way, we have considered many generalizations of the Pandora box problem, including allowing distributions on costs and the use of types. One open challenge is to relax the assumption that values, costs, and/or types of different boxes are independent. One could also generalize to objectives beyond maximizing the sum of the values selected minus the cost of opening boxes. For example, what happens if the cost of opening a box increases as more boxes are opened? Finally, the multi-arm prophet inequality is a special case of a more general class of stochastic optimization problems, and it would be interesting to extend to more general scenarios. For example, can one extend the result to distributions that vary across time, or to general matroid constraints over the set of boxes that can be opened?

Acknowledgments

Hossein Esfandiari was supported in part by NSF grants CCF-1320231 and CNS-1228598. Mohammad Taghi Hajiaghayi was supported in part by NSF CAREER award CCF-1053605, NSF AF:Medium grant CCF-1161365, NSF BIGDATA grant IIS-1546108, NSF SPX grant CCF-1822738, UMD AI in Business and Society Seed Grant and UMD Year of Data Science Program Grant. Part of this work was done while visiting Microsoft Research New England. Michael Mitzenmacher was supported in part by NSF grants CCF-1563710, CCF-1535795, CCF-1320231, and CNS-1228598; also, part of this work was done while visiting Microsoft Research New England.

Alaei, S.; Hajiaghayi, M.; and Liaghat, V. 2012. Online prophet-inequality matching with applications to ad allocation. In ACM EC, 18 35. Alaei, S. 2014. Bayesian combinatorial auctions: Expanding single buyer mechanisms to many buyers. SIAM Journal on Computing 43(2):930 972. Babaioff, M.; Immorlica, N.; and Kleinberg, R. 2007. Matroids, secretary problems, and online mechanisms. In SODA, 434 443. Duetting, P.; Feldman, M.; and Lucier, B. 2017. Prophet inequalities made easy: Stochastic optimization by pricing non-stochastic inputs. In FOCS, 540 551.

Esfandiari, H.; Hajiaghayi, M.; Liaghat, V.; and Monemizadeh, M. 2017. Prophet secretary. SIAM Journal on Discrete Mathematics 31(3):1685 1701. Feldman, M.; Svensson, O.; and Zenklusen, R. 2015. A simple O (log log (rank))-competitive algorithm for the matroid secretary problem. In SODA, 1189 1201. Gittins, J., and Jones, D. 1974. A dynamic allocation index for the sequential design of experiments. In Progress in Statistics, 241 266. Goel, G., and Mehta, A. 2008. Online budgeted matching in random input models with applications to adwords. In SODA, 982 991. Guruganesh, G. P., and Singla, S. 2017. Online matroid intersection: Beating half for random arrival. In IPCO, 241 253. Hajiaghayi, M. T.; Kleinberg, R.; and Parkes, D. C. 2004. Adaptive limited-supply online auctions. In ACM EC, 71 80. Karande, C.; Mehta, A.; and Tripathi, P. 2011. Online bipartite matching with unknown distributions. In STOC, 587 596. Kesselheim, T.; Radke, K.; T onnis, A.; and V ocking, B. 2013. An optimal online algorithm for weighted bipartite matching and extensions to combinatorial auctions. In ESA, 589 600. Kleinberg, R.; Waggoner, B.; and Weyl, E.G. 2016. Descending price optimally coordinates search. In ACM EC, 23 24. Kleinberg, R., and Weinberg, S. M. 2012. Matroid prophet inequalities. In STOC, 123 136. Korula, N., and P al, M. 2009. Algorithms for secretary problems on graphs and hypergraphs. In ICALP. 508 520. Krengel, U., and Sucheston, L. 1977. Semiamarts and ﬁnite values. Bulletin of the American Mathematical Society. Krengel, U., and Sucheston, L. 1978. On semiamarts, amarts, and processes with ﬁnite value. Advances in Prob 4:197 266. Lachish, O. 2014. O (log log rank) competitive ratio for the matroid secretary problem. In FOCS, 326 335. Mahdian, M., and Yan, Q. 2011. Online bipartite matching with random arrivals: an approach based on strongly factorrevealing lps. In STOC, 597 606. Mohammad Taghi Hajiaghayi, R. K., and Sandholm, T. 2007. Automated online mechanism design and prophet inequalities. In AAAI, 58 65. Samuel-Cahn, E. 1984. Comparison of threshold stop rules and maximum for independent nonnegative random variables. In the Annals of Probability, 12(4):1213 1216. Singla, S. 2018. The price of information in combinatorial optimization. In SODA, 2523 2532. Weitzman, M. L. 1979. Optimal search for the best alternative. In Econometrica, 47(3):641 654. Yan, Q. 2011. Mechanism design via correlation gap. In SODA, 710 719.