# learning_to_screen__2da20e21.pdf

Learning to Screen

Alon Cohen Avinatan Hassidim Haim Kaplan Yishay Mansour Shay Moran

Imagine a large ﬁrm with multiple departments that plans a large recruitment. Candidates arrive one-by-one, and for each candidate the ﬁrm decides, based on her data (CV, skills, experience, etc), whether to summon her for an interview. The ﬁrm wants to recruit the best candidates while minimizing the number of interviews. We model such scenarios as an assignment problem between items (candidates) and categories (departments): the items arrive one-by-one in an online manner, and upon processing each item the algorithm decides, based on its value and the categories it can be matched with, whether to retain or discard it (this decision is irrevocable). The goal is to retain as few items as possible while guaranteeing that the set of retained items contains an optimal matching. We consider two variants of this problem: (i) in the ﬁrst variant it is assumed that the n items are drawn independently from an unknown distribution D. (ii) In the second variant it is assumed that before the process starts, the algorithm has an access to a training set of n items drawn independently from the same unknown distribution (e.g. data of candidates from previous recruitment seasons). We give near-optimal bounds on the best-possible number of retained items in each of these variants. These results demonstrate that one can retain exponentially less items in the second variant (with the training set). Our algorithms and analysis utilize ideas and techniques from statistical learning theory and from discrete algorithms.

1 Introduction

Matching is the bread-and-butter of many real-life problems from the ﬁelds of computer science, operations research, game theory, and economics. Some examples include job scheduling where we assign jobs to machines, economic markets where we allocate products to buyers, online advertising where we assign advertisers to ad slots, assigning medical interns to hospitals, and many more.

One particular example that motivates this work is the following example from labor markets. Imagine a ﬁrm that is planning a large recruitment. Candidates arrive one-by-one and the HR department immediately decides whether to summon them for an interview. Moreover, the ﬁrm has multiple departments, each requiring different skills and having a different target number of hires. Different employees have different subsets of the required skills, and thus ﬁt only certain departments and with a certain quality. The ﬁrm s HR department, following the interviews, decides which candidates to recruit and to which departments to assign them. The HR department has to maximize the total quality of the hired employees such that each department gets its required number of hires with the required skills. In addition, the HR uses data from the previous recruitment season in order to minimize the number of interviews while not compromising the quality of the solution.

Technion Israel Inst. of Technology and Google Research. aloncohen@technion.ac.il Bar-Ilan University and Google Research. avinatanh@gmail.com Tel-Aviv University and Google Research. haimk@tau.ac.il Tel-Aviv University and Google Research. mansour.yishay@gmail.com Princeton University. shaymoran1@gmail.com. This work was done while the author was working at Google Research.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

We study the following formulation of the problem above. We receive n items (candidates), where each item has a subset of d properties (departments) denoted by P1, . . . , Pd. We select k items out of the n, subject to d constraints of the form

exactly ki of the selected items must satisfy a property Pi,

where Pd i=1 ki = k and we assume that d k n. Furthermore, if item c possesses property Pi, then it has a value vi(c) associated with this property. Our goal is to compute a matching of maximum value that associates k items to the d properties subject to the constraints above.

We consider matching algorithms in the following online setting. The algorithms receive n items online, drawn independently from D, and either reject or retain each item. Then, the algorithm utilizes the retained items and outputs an (approximately-)optimal feasible solution. We present a naive greedy algorithm that returns the optimal solution with probability at least 1 δ and retains O(k log(k/δ)) items in expectation. We prove that no other algorithm with the same guarantee can retain less items in expectation.

Thus, to further reduce the number of retained items, we add an initial preprocessing phase in which the algorithm learns an online policy from a training set. The training set is a single problem instance that consists of n items drawn independently from the same unknown distribution D. We address the statistical aspects of this problem and develop efﬁcient learning algorithms. In particular, we deﬁne a class of thresholds-policies. Each thresholds-policy is a simple rule for deciding whether to retain an item. We present uniform convergence rates for both the number of items retained by a thresholds policy and the value of the resulting solution. We show that these quantities deviate from their expected value by order of

k (rather than an easier n bound; recall that we assume k n) which we prove using concentration inequalities and tools from VC-theory. Using these concentration inequalities, we analyze an efﬁcient online algorithm that returns the optimal ofﬂine solution with probability at least 1 δ, and retains a near-optimal O(k log log(1/δ)) number of items in expectation (compare with the O(k log(k/δ)) number of retained items when no training set is given).

Related work. Our model is related to the online secretary problem in which one needs to select the best secretary in an online manner (see Ferguson, 1989). Our setting differs from this classical model due to the two-stage process and the complex feasibility constraints. Nonetheless, we remark that there are few works on the secretary model that allow delayed selection (see Vardi, 2015, Ezra et al., 2018) as well as matroid constraints [Babaioff et al., 2007]. These works differ from ours in the way the decision is made, the feasibility constraints and the learning aspect of receiving a single problem instance as a training example. Correa et al., 2018 consider a distributional setting for the single-choice prophet inequality problem. Similarly to the setting considered here, they assume that the data is drawn independently from an unknown distribution and that the algorithm has an access to a training-set sampled from the same distribution. However, the objective is quite different from ours: the goal is to pick a stopping time τ such that the τ th sample approximately maximizes the value among all samples (including those that were not seen yet).

Another related line of work in algorithmic economics studies the statistical learnability of pricing schemes (see e.g., Morgenstern and Roughgarden, 2015, 2016, Hsu et al., 2016, Balcan et al., 2018). The main difference of these works from ours is that our training set consists of a single example (namely the set of items that are used for training), and in their setting (as well as in most typical statistical learning settings) the training set consists of many i.i.d examples. This difference also affects the technical tools used for obtaining generalization bounds. For example, some of our bounds exploit Talagrand s concentration inequality rather than the more standard Chernoff/Mc Diarmid/Bernstein inequalities. We note that Talagrand s inequality and other advanced inequalities were applied in machine learning in the context of learning combinatorial functions [Vondrák, 2010, Blum et al., 2017]. See also the survey by Bousquet et al. [2004] or the book by Boucheron et al. [2013] for a more thorough review of concentration inequalities.

Furthermore, there is a large body of work on online matching in which the vertices arrive in various models (see Mehta et al., 2013, Gupta and Molinaro, 2016). We differ from this line of research, by allowing a two-stage algorithm, and requiring to output the optimal matching is the second stage.

Celis et al. [2017, 2018] studies similar problems of ranking and voting with fairness constraints. In fact, the optimization problem that they consider allows more general constraints and the value of a candidate is determined from votes/comparisons. The main difference with our framework is that

they do not consider a statistical setting (i.e. there is no distribution over the items and no training set for preprocessing) and focus mostly on approximation algorithms for the optimization problem.

2 Model and Results

Let X be a domain of items, where each item c X can possess any subset of d properties denoted by P1, . . . , Pd (we view Pi X as the set of items having property Pi). Each item c has a value vi(c) [0, 1] associated with each property Pi such that c Pi.

We are given a set C X of n items as well as counts k1, . . . kd such that Pd i=1 ki = k. Our goal is to select exactly k items in total, constrained on selecting exactly ki items with property Pi. We assume that these constraints are exclusive, in the sense that each item in C can be used to satisfy at most one of the constraints. Formally, a feasible solution is a subset S C, such that |S| = k and there is partition S into d disjoint subsets S1, . . . , Sd, such that Si Pi and |Si| = ki. We aim to compute a feasible subset S that maximizes Pd i=1 P

c Si vi(c).

Furthermore, we assume that d k n. Namely, the number of constraints is much smaller than the number of items that we have to select, which is much smaller than the total number of items in C. In order to avoid feasibility issues we assume that there is a set Cdummy that contains k dummy 0-value items with all the d properties (we assume that the algorithm has always access to Cdummy and do not view them as part of C).

Formulation as bipartite matching. We ﬁrst discuss the ofﬂine versions of these allocation problems. That is, we assume that C and the capacities ki are all given as an input before the algorithm starts. We are interested in an algorithm for computing an optimal set S. That is a set of items of maximum total value that satisfy the constraints. This problem is equivalent to a maximum matching problem in a bipartite graph (L, R, E, w) deﬁned as follows.

L is the set of vertices in one side of the bipartite graph. It contains k vertices, where each constraint i is represented by ki of these vertices. R is the set of vertices in the other side of the bipartite graph. It contains a vertex for each item c C and for each dummy item c Cdummy. E is the set of edges. Each vertex in R is connected to each vertex of each of the constraints that it satisﬁes. The weight w(l, r) of edge (l, r) E is vl(r): the value of item r associated with property Pl.

There is a natural correspondence between saturated-matchings in this graph, that is matchings in which every l L is matched, and between feasible solutions (i.e., solutions that satisfy the constraints) to the allocation problem. Thus, a saturated-matching of maximum value corresponds to an optimal solution. It is well know that the problem of ﬁnding such a maximum weight bipartite matching can be solved in polynomial time (see e.g., Lawler, 2001).

Problem deﬁnition. In this work we consider the following online learning model. We assume that n items are sequentially drawn i.i.d. from an unknown distribution D over X. Upon receiving each item, we decide whether to retain it, or reject it irrevocably (the ﬁrst stage of the algorithm). Thereafter, we select a feasible solution6 consisting only of retained items (the second stage of the algorithm). Most importantly, before accessing the online sequence and take irreversible online decisions of which items to reject, we have access a training set Ctrain consisting of n independent draws from D.

2.1 Results

2.1.1 Oblivious online screening

We begin by studying a greedy algorithm that does not require a training set. In the online phase, this algorithm acts greedily by keeping an item if it participates in the best solution thus far. Then,

6In addition to the retained items, the algorithm has access to Cdummy, and therefore a feasible solution always exists.

the algorithm computes an optimal matching among the retained items. The particular details of the algorithm are given in the supplementary material. We have the following guarantee for this greedy algorithm proven in the supplementary material.

Theorem 1. Let δ (0, 1). The greedy algorithm outputs the optimal solution with probability at least 1 δ and retains O(k log(min{k/δ, n/k})) items in expectation.

As we shall see in the next section, learning from the training set allows one to retain exponentially less items than is implied by the theorem above.7 It is then natural to ask to which extent is the training phase essential in order to accommodate such an improvement. We answer this question in The supplementary material by proving a lower bound on the number of retained items for any algorithm that does not use a training phase. This lower bound already applies in the simple setting where d = 1: here, each item consists only of a value v [0, 1], and the goal of the algorithm is to retain as few items as possible while guaranteeing with high probability that the top k maximal values are retained.

Theorem 2. Let δ (0, 1). For every algorithm A which retains the maximal k elements with probability at least 1 δ, there exists a distribution µ such that the expected number of retained elements for input sequences v1 . . . vn µn is at least Ω(k log(min{k/δ, n/k})).

Thus, the above theorem implies that Θ(k log(n/k)) can not be improved even if we allow failure probability δ = Θ(k2/n) (see Theorem 1).

2.1.2 Online screening with learning

We now design online algorithms that, before the online screening process begins, use Ctrain to learn a thresholds-policy T T such that with high probability: (i) the number of items that are retained in the online phase is small, and (ii) there is a feasible solution consisting of k retained items whose value is optimal (or close to optimal). Thresholds-policies are studied in Section 3 and are deﬁned as follows.

Deﬁnition 3 (Thresholds-policies). A threshold-policy is parametrized by a vector T = (t1, . . . , td) of thresholds, where ti corresponds to property Pi for 1 i d. The semantics of T is as follows: given a sample C of n items, each item c C is retained if and only if there exists a property Pi satisﬁed by c, such that its value vi(c) passes the threshold ti. More formally, c is retained if and only if i {1, . . . , d} such that c Pi and vi(c) ti.

Having proven uniform convergence results for thresholds-policies (see Section 3.1), we show the following in Section 4.

Theorem 4. There exists an algorithm that learns a thresholds-policy T from a single training sample Ctrain Dn, such that after processing the ( real-time ) input sample C Dn using T:

It outputs an optimal solution with probability at least 1 δ.

The expected number of retained items in the ﬁrst phase is O k(log d + log log(n/k) + log log(1/δ)) .

Thus, with the additional information given by the training set, the algorithm presented in Theorem 4 improves the number of retained items from k log(k/δ) to k log log(1/δ). This demonstrates a signiﬁcant improvement over Theorem 1.

Finally, in the supplementary material we prove that the algorithm from Theorem 4 is nearly-optimal in the sense that it is impossible to signiﬁcantly improve the number of retained items even if we allow the algorithm to fully know the distribution over input items (so, in a sense, having an access to n i.i.d samples from the distribution is the same as knowing it completely).

Theorem 5. Consider the case where k = d and k1 = kd = 1. There exists a universe X and a ﬁxed distribution D over X such that for C Dn the following holds: any online learning algorithm (which possibly knows D) that retains a subset S C of items that contains an optimal solution with probability at least 1 δ must satisfy that Ex |S| = Ω(k log log(1/δ)).

7That is, the expected number of retained items is reduces from order of log n to log log n.

3 Thresholds-policies

We next discuss a framework to design algorithms that exploit the training set to learn policies that are applied in the ﬁrst phase of the matching process. We would like to frame this in standard ML formalism by phrasing this problem as learning a class H of policies such that:

H is not too small: The policies in H should yield solutions with high values (optimal, or near-optimal).

H is not too large: H should satisfy some uniform convergence properties; i.e. the performance of each policy in H on the training set is close, with high probability, to its expected real-time performance on the sampled items during the online selection process.

Indeed, as we now show these demands are met by the class T of thresholds policies (Deﬁnition 3). We ﬁrst show that the class of thresholds-policies contains an optimal policy, and in the sequel we show that it satisﬁes attractive uniform convergence properties.

An assumption (values are unique). We assume that for each constraint Pi, the marginal distribution over the value of c D conditioned on c Pi is atomless; namely Prc D[v(c) = v | c Pi] = 0 for every v [0, 1]. This assumption can be removed by adding artiﬁcial tie-breaking rules, but making it will simplify some of the technical statements.

Theorem 6 (There is a thresholds policy that retains an optimal solution). For any set of items C, there exists a thresholds vector T T that retains exactly k items that form an optimal solution for C.

Proof. Let S denote the set of k items in an optimal solution for C, and let Si S Pi be the subset of M that is assigned to the constraint Pi. Deﬁne ti = minc Si vi(c), for i 1, Clearly, T retains all the items in S. Assume towards contradiction that T retains an item cj / S, and assume that Pi is a constraint such that cj Pi and vi(cj) ti. Since by our assumption on D all the values vi(cj) are distinct it follows that vi(cj) > ti. Thus, we can modify S by replacing cj with the item of minimum value in Si and increase the total value. This contradicts the optimality of S.

We next establish generalization bounds for the class of thresholds-policies.

3.1 Uniform convergence of the number of retained items

For a sample C Dn and a thresholds-policy T T , we denote by RT i (C) = {c : c Pi and vi(c) ti} the set of items that are retained by the threshold ti, and we denote its expected size by ρT i = Ex C Dn |RT i (C)| . Similarly we denote by RT(C) = i RT i (C) the items retained by T, and by ρT its expectation. We prove that the sizes of RT i (C) and RT(C) are concentrated around their expectations uniformly for all thresholds policies.

The following theorems establish uniform convergence results for the number of retained items. Namely, with high probability we have RT i ρT i , RT ρT simultaneously for all T T and i d.

Theorem 7 (Uniform convergence of the number of retained items). With probability at least 1 δ over C Dn, the following holds for all policies T T simultaneously:

1. If ρT k, then (1 ε)ρT |RT(C)| (1 + ε)ρT , and

2. if ρT < k, then ρT εk |RT(C)| ρT + εk ,

d log(d) log(n/k) + log(1/δ)

Theorem 8 (Uniform convergence of the number of retained items per constraint). With probability at least 1 δ over C Dn, the following holds for all policies T T and all i d +1 simultaneously:

1. If ρT i k, then (1 ε)ρT i |RT i (C)| (1 + ε)ρT i , and

2. if ρT i < k, then ρT i εk |RT i (C)| ρT i + εk ,

log(d) log(n/k) + log(1/δ)

The proofs of Theorems 7 and 8 are based on standard VC-based uniform convergence results, and technically the proof boils down to bounding the VC-dimension of the families R = {RT : T T } and Q = {RT i : T T , i d}. Indeed, in the supplementary material we prove the following. Lemma 9. VC(R) = O(d log d) . Lemma 10. VC(Q) = O(log d) .

Using Lemmas 9 and 10, we can now apply standard uniform convergence results from VC-theory to derive Theorems 7 and 8. Deﬁnition 11 (Relative (p, ε)-approximation; Har-Peled and Sharir, 2011). Let F be a family of subsets over a domain X, and let µ be a distribution on X. Z X is a (p, ε)-approximation for F if for each f F we have,

1. If µ(f) p, then (1 ε)µ(f) bµ(f) (1 + ε)µ(f),

2. If µ(f) < p, then µ(f) εp bµ(f) µ(f) + εp,

where bµ(f) = |Z F|/|Z| is the ( empirical ) measure of f with respect to Z.

The proof of Theorems 7 and 8 now follows by plugging p = k/n in Har-Peled and Sharir [2011, Theorem 2.11], which we state in the next proposition. Proposition 12 (Har-Peled and Sharir, 2011). Let F and µ like in Deﬁnition 11. Suppose F has VC dimension m. Then, with provability at least 1 δ, a random sample of size

Ω m log(1/p) + log(1/δ)

is a relative (p, ε)-approximation for F.

3.2 Uniform convergence of values

We now prove a concentration result for the value of an optimal solution among the retained items. Unlike the number of retained items, the value of an optimal solution corresponds to a more complex random variable, and analyzing the concentration of its empirical estimate requires more advanced techniques.

We denote by VT(C) the value of the optimal solution among the items retained by the thresholdspolicy T, and we denote its expectation by νT = Ex C Dn VT(C) . We show that VT(C) is concentrated uniformly for all thresholds policies. Theorem 13 (Uniform convergence of values). With probability at least 1 δ over C Dn, the following holds for all policies T T simultaneously:

|νT VT(C)| εk, where ε = O r

d log k + log(1/δ)

Note that unlike most uniform convergence results that guarantee simultaneous convergence of empirical averages to expectations, here VT(C) is not an average of the n samples, but rather a more complicated function of them. We also note that a bound of e O( n) (rather than e O(

k)) on the additive deviation of VT(C) from its expectation can be derived using the Mc Diarmid s inequality [Mc Diarmid, 1989]. However, this bound is meaningless when n > k (because k upper bounds the value of the optimal solution). We use Talagrand s concentration inequality [Talagrand, 1995] to derive the O(

k) upper bound on the additive deviation. Talagrand s concentration inequality allows us to utilize the fact that an optimal solution uses only k n items, and therefore replacing an item that does not participate in the solution does not affect its value.

To prove the theorem we need the following concentration inequality for the value of the optimal selection in hindsight. Note that by Theorem 6 this value equals to VT(C) for some T.

Lemma 14. Let OPT(C) denote the value of the optimal solution for a sample C. We have that

Pr C Dn |OPT(C) Ex[OPT(C)]| α 2 exp( α2/2k).

So, for example, it happens that |OPT(C) Ex[OPT(C)]| p

2k log(2/δ) with probability at least 1 δ.

To prove this lemma we use the following version of Talagrand s inequality (that appears for example in lecture notes by van Handel [2014]).

Proposition 15 (Talagrand s Concentration Inequality). Let f : Rn 7 R be a function, and suppose that there exist g1, . . . , gn : Rn 7 R such that for any x, y Rn

i=1 gi(x)1[xi =yi]. (1)

Then, for independent random variables X = (X1, . . . , Xn) we have

Pr |f(X) Ex[f(X)]| > α 2 exp α2

2 supx Pn i=1 g2 i (x)

Proof of Lemma 14. We apply Talagrand s concentration inequality to the random variable OPT(C). Our Xi s are the items c1, . . . , cn in the order that they are given. We show that Eq. (1) holds for gi(C) = 1[ci S] where S = S(C) is a ﬁxed optimal solution for C (we use some arbitrary tie breaking among optimal solutions). We then have, Pn i=1 g2 i (C) = |S| = k, thus completing the proof.

Now, let C, C be two samples of n items. Recall that we need to show that

OPT(C) OPT(C )

i=1 gi(C)1[ci =c i ] .

We use S to construct a solution S for C as follows. Let Sj S the subset of S matched to Pj. For each i, if ci Sj for some j, and ci = c i, then we add i to S j. Otherwise, we add a dummy item from C dummy to S j (with value zero). Let V(S ) denote the value of S . Note that the difference between the values of S and S is the total value of all items i S such that ci = c i. Since the item values are bounded in [0, 1] we get that

OPT(C) V(S ) =

ci Sj vj(ci)1[ci =c i ]

ci Sj 1[ci =c i ] =

i=1 gi(C)1[ci =c i ] .

The proof is complete by noticing that OPT(C ) V(S ).

We also require the following construction of a bracketing of T which is formally presented in the supplementary material.

Lemma 16. There exists a collection of N thresholds-policies such that |N| k O(d), and for every thresholds-policy T T there are T+, T N such that

1. VT (C) VT(C) VT+(C) for every sample of items C; note that by taking expectations this implies that νT νT νT+, and

2. νT+ νT 10.

Proof of Theorem 13. The items in C that are retained by T are independent samples from a distribution D that is sampled as follows: (i) sample c D, and (ii) if c is retained by T then keep it, and otherwise discard it. This means that v T(C) is in fact the optimal solution of C with respect to D . Since Lemma 14 applies to every distribution D we can apply it to D and get that for any ﬁxed T T Pr C Dn |νT VT(C)| α 2 exp( α2/2k) .

Now, by the union bound for N be as in Lemma 16 we get that the probability that there is T N such that |νT VT(C)| α is at most |N| 2 exp( α2/2k). Thus, since |N| k O(d), it follows that with probability at least 1 δ,

( T N) : |νT VT(C)| O q

k d log k + log(1/δ) . (2)

We now show why uniform convergence for N implies uniform convergence for T . Combining Lemma 16 with Equation (2) we get that with probability at least 1 δ, every T T satisﬁes:

|νT VT(C)| max{|νT+ VT (C)|, |νT VT+(C)|} (by Item 1 of Lemma 16)

max{|νT VT (C)|, |νT+ VT+(C)|} + 10 (by Item 2 of Lemma 16)

k d log k + log(1/δ) . (by Eq. (2))

Here the ﬁrst inequality follows from Item 1 by noticing that if [a, b], [c, d] are intervals on the real line and x [a, b], y [c, d] then |x y| max{|b c|, |d a|}, and plugging in x = νT, y = VT(C), a = νT , b = νT+, c = VT (C), d = VT+(C).

This ﬁnishes the proof, by setting ε such that ε k = O p

k(d log k + log(1/δ)) .

4 Algorithms based on learning thresholds-policies

We next exemplify how one can use the above properties of thresholds-policies to design algorithms. A natural algorithm would be to use the training set to learn a threshold-policy T that retains an optimal solution with k items from the training set as speciﬁed in Theorem 6, and then use this online policy to retain a subset of the n items in the ﬁrst phase. Theorem 7 and Theorem 13 imply that with probability 1 δ, the number of retained items is at most m = k + O p

kd log(d) log(n/k) + k log(1/δ)

and that the value of the resulting solution is at least OPT O p

kd log k + k log(1/δ) .

We can improve this algorithm by combining it with the greedy algorithm of Theorem 1 described in the supplementary material. During the ﬁrst phase, we retain an item c only if (i) c is retained by T, and (ii) c participates in the optimal solution among the items that were retained thus far. Theorem 1 then implies that out of these m items greedy keeps a subset of

= O k log log n

+ log log 1

items in expectation that still contains a solution of value at least OPT O( p

kd log k + k log(1/δ)).

We can further improve the value of the solution and guarantee that it will be optimal (with respect to all n items) with probability 1 δ. This is based on the observation that if the set of retained items contains the top k items of each property Pi then it also contains an optimal solution. Thus, we can compute a thresholds-policy T that retains the top k + O( p

k log(d) log(n/k) + k log(1/δ)) items of each property from the training set (if the training set does not have this many items with some property then set the corresponding threshold to 0). Then, it follows from Theorem 8, that with probability 1 δ, T will retain the top k items of each property in the ﬁrst online phase and therefore will retain an optimal solution. Now, Theorem 8 implies that with probability 1 δ the total number of items that are retained by T in real-time is at most m = dk + O(d p

k log(d) log(n/k) + k log(1/δ)). By ﬁltering the retained elements with the greedy algorithm of Theorem 1 as before it follows that the total number of retained items is at most

k + k log m

= O k log d + log log n

+ log log 1

with probably 1 δ. This proves Theorem 4.

Acknowledgements

We thank an anonymous reviewer for their remarks regarding a previous version of this manuscript. Their remarks and questions eventually led us to proving Theorem 2.

M. Babaioff, N. Immorlica, and R. Kleinberg. Matroids, secretary problems, and online mechanisms. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 434 443. Society for Industrial and Applied Mathematics, 2007.

M. Balcan, T. Sandholm, and E. Vitercik. A general theory of sample complexity for multi-item proﬁt maximization. In EC, pages 173 174. ACM, 2018.

A. Blum, I. Caragiannis, N. Haghtalab, A. D. Procaccia, E. B. Procaccia, and R. Vaish. Opting into optimal matchings. In SODA, pages 2351 2363. SIAM, 2017.

S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 9780191747106.

O. Bousquet, U. von Luxburg, and G. Rätsch, editors. Advanced Lectures on Machine Learning, ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, volume 3176 of Lecture Notes in Computer Science, 2004. Springer.

L. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with fairness constraints. ar Xiv preprint ar Xiv:1704.06840, 2017.

L. E. Celis, L. Huang, and N. K. Vishnoi. Multiwinner voting with fairness constraints. In IJCAI, pages 144 151, 2018.

J. R. Correa, P. Dütting, F. A. Fischer, and K. Schewior. Prophet inequalities for independent random variables from an unknown distribution. Co RR, abs/1811.06114, 2018. URL http: //arxiv.org/abs/1811.06114.

T. Ezra, M. Feldman, and I. Nehama. Prophets and secretaries with overbooking. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 319 320. ACM, 2018.

T. S. Ferguson. Who solved the secretary problem? Statistical Science, 4(3):282 289, 1989.

S. Greenberg and M. Mohri. Tight lower bound on the probability of a binomial exceeding its expectation. Co RR, abs/1306.1433, 2013.

A. Gupta and M. Molinaro. How the experts algorithm can help solve lps online. Math. Oper. Res., 41(4):1404 1431, 2016.

S. Har-Peled and M. Sharir. Relative (p, ε)-approximations in geometry. Discrete & Computational Geometry, 45(3):462 496, 2011.

J. Hsu, J. Morgenstern, R. M. Rogers, A. Roth, and R. Vohra. Do prices coordinate markets? In STOC, pages 440 453. ACM, 2016.

E. L. Lawler. Combinatorial optimization: networks and matroids. Courier Corporation, 2001.

C. Mc Diarmid. On the method of bounded differences. In Surveys in Combinatorics 1989. Cambridge University Press, Cambridge, 1989.

A. Mehta et al. Online matching and ad allocation. Foundations and Trends R in Theoretical Computer Science, 8(4):265 368, 2013.

S. Moran, M. Snir, and U. Manber. Applications of ramsey s theorem to decision tree complexity. Journal of the ACM (JACM), 32(4):938 949, 1985.

J. Morgenstern and T. Roughgarden. On the pseudo-dimension of nearly optimal auctions. In NIPS, pages 136 144, 2015.

J. Morgenstern and T. Roughgarden. Learning simple auctions. In COLT, volume 49 of JMLR Workshop and Conference Proceedings, pages 1298 1318. JMLR.org, 2016.

N. Sauer. On the density of families of sets. J. Combinatorial Theory Ser. A, 13:145 147, 1972.

M. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathématiques de l Institut des Hautes Etudes Scientiﬁques, 81(1):73 205, 1995.

R. van Handel. Probability in high dimension. Technical report, PRINCETON UNIV NJ, 2014.

S. Vardi. The returning secretary. In 32nd International Symposium on Theoretical Aspects of Computer Science, page 716, 2015.

J. Vondrák. A note on concentration of submodular functions. Co RR, abs/1005.2791, 2010.