# pasta_pessimistic_assortment_optimization__c66e94c3.pdf PASTA: Pessimistic Assortment Optimization Juncheng Dong * 1 Weibin Mo * 2 Zhengling Qi 3 Cong Shi 4 Ethan X. Fang 5 Vahid Tarokh 1 Abstract We consider a class of assortment optimization problems in an offline data-driven setting. A firm does not know the underlying customer choice model but has access to an offline dataset consisting of the historically offered assortment set, customer choice, and revenue. The objective is to use the offline dataset to find an optimal assortment. Due to the combinatorial nature of assortment optimization, the problem of insufficient data coverage is likely to occur in the offline dataset. Therefore, designing a provably efficient offline learning algorithm becomes a significant challenge. To this end, we propose an algorithm referred to as Pessimistic ASsortment op Timiz Ation (PASTA for short) designed based on the principle of pessimism, that can correctly identify the optimal assortment by only requiring the offline data to cover the optimal assortment under general settings. In particular, we establish a regret bound for the offline assortment optimization problem under the celebrated multinomial logit model. We also propose an efficient computational procedure to solve our pessimistic assortment optimization problem. Numerical studies demonstrate the superiority of the proposed method over the existing baseline method. 1. Introduction One of the most critical problems faced by a seller is to select products for presentation to potential buyers. Often faced with limited display spaces and storage costs in both *Equal contribution 1Department of Electrical and Computer Engineering, Duke University, Durham, NC 27705, United States. 2Mitchell E. Daniels, Jr. School of Business, Purdue University, West Lafayette, IN 47907, United States. 3Department of Decision Sciences, George Washington University, Washington, DC 20052, United States. 4Herbert Business School, University of Miami, Coral Gables, FL 33146, United States. 5Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27705, United States.. Correspondence to: Zhengling Qi . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). brick-and-mortar and online retailing, the seller needs to carefully choose a set of products from the vast collection of all available products for displaying to its customers. In this line, the problem of selecting an assortment, i.e., a collection of products from all available products, in order to maximize the seller s revenue is the assortment optimization problem. Obviously, the choice behavior of customers (Mc Fadden, 1981) is of great importance in the problem of assortment optimization. Without loss of generality, we assume the choice of each customer can be described by a preference vector θ . This subsumes the seminal multinomial logit (MNL) model (Mc Fadden, 1973) which is arguably the most well-studied and widespread models in assortment optimization literature (Please see Section 6 for more details) (Talluri & van Ryzin, 2004; Caro & Gallien, 2007; Rusmevichientong et al., 2010; Davis et al., 2013; Chen et al., 2021a; Aouad et al., 2022). In practice, θ is often unknown and needs to be estimated. Assuming no historical data of customers, dynamic assortment optimization adaptively learns θ in a trial-and-error fashion by updating the assortment and observing the subsequent choices of customers sequentially (Caro & Gallien, 2007; Chen et al., 2020; Rusmevichientong et al., 2020; Chen et al., 2021b; Li et al., 2022). Meanwhile, in our era of Big Data, companies often collect abundant customer data. Therefore, it is often in companies best interest to learn from the existing (potentially massive) offline datasets rather than starting from scratch. Moreover, offline learning is beneficial since online exploration can sometimes be expensive or infeasible. Hence, we take the first stab to formally study the following important question faced by every seller. Research Question: Given a pre-collected offline dataset of historically offered assortment, customers choices, and revenue, how can we find an efficient and theoretically justified offline algorithm to estimate the optimal assortment set without unrealistic assumptions on the offline dataset? When the dataset is not adaptively collected, it is not uncommon to encounter the challenge of insufficient coverage of data. For estimators such as the maximum likelihood estimator (MLE) to approximate θ accurately, the offline dataset must include sufficiently many assortments and customer choices. In other words, the data-collecting process needs to sufficiently explore different assortments (by the Pessimistic Assortment Optimization seller) and different choices (by the customers). This is unlikely to happen for offline datasets because the seller would not choose unreasonable assortments whose expected revenues are obviously suboptimal, and the customers would not choose products against their preferences. Major Contributions. The main contribution of this work is two-fold. First, based on the principle of pessimism, we propose the Pessimistic ASsortment op Timiz Ation (PASTA for short) framework, which correctly identifies the optimal assortment. In particular, our framework only requires that the offline dataset covers the optimal assortment set instead of all possible (combinatorially many) assortment sets. Second, we derive the first finite-sample regret bound for offline assortment optimization under the multinomial logit (MNL) model (Please see Section 6), one of the most widely used models for modeling customers choices. We subsequently propose an algorithm, also with the name PASTA, that can efficiently solve the pessimistic assortment optimization problems. Experiments on the simulated datasets (so that θ is known) corroborate the efficacy of pessimistic assortment optimization. Paper Organization. We briefly review the related work in Section 2 and the preliminaries in Section 3. We propose the pessimistic assortment optimization in Section 4. In Section 5, we present the theoretical results. In Section 6, we study pessimistic assortment optimization under the MNL model as a concrete example. In Section 7, we propose an algorithm that can solve the problem efficiently. We provide experimental results in Section 8, after which we conclude. 2. Related Work Assortment Optimization. The assortment optimization problem under the MNL model without any constraints was first studied in (Talluri & van Ryzin, 2004). Then more complicated assortment optimization problems under various types of constraints, including space requirement (Rusmevichientong et al., 2009) and cardinality (Rusmevichientong et al., 2010), were considered. (Davis et al., 2013) proposed a linear programming (LP) formulation of the assortment optimization problem that includes several previous works as special cases corresponding to different constraints in the formulation of LP. This line of work assumes that the true parameters of the customer models are known (or at least can be accurately estimated) (Gallego & Topaloglu, 2014; Feldman & Topaloglu, 2015; Flores et al., 2019; Désir et al., 2020; Liu et al., 2020; Aouad et al., 2021). Another closely related line of work is dynamic assortment optimization (Caro & Gallien, 2007). In the setting of dynamic assortment optimization, the seller without any prior information about the customers, has finite selling horizons in which it observes the choices of customers and, based on the observed behaviors, optimize their assortments in an adaptive, trial-and-error fashion (Sauré & Zeevi, 2013; Wang et al., 2018; Chen et al., 2021a; Rusmevichientong et al., 2020; Chen et al., 2020; 2021b). In comparison with the online setting used in dynamic assortment optimization, our work departs from the existing literature by focusing on the offline setting where the seller only has collected datasets but not any control on the data-collecting process. Pessimism in Offline Learning. The principle of pessimism has been successfully used in reinforcement learning (RL) for finding an optimal policy with pre-collected datasets. On the empirical side, it has helped with improving the performance of both the model-based approach and value-based approach in offline setting (e.g., Yu et al., 2020; Kidambi et al., 2020; Kumar et al., 2020). The importance of pessimism has been analyzed and verified theoretically in the setting of RL (Jin et al., 2021; Fu et al., 2022). Our work main contribution is to take a pessimistic approach to assortment optimization problems and demonstrate its empirical and theoretical values. Moreover, our work differs from the above works by focusing on a decision-making problem with exponentially many choices. 3. Preliminary Let r Ns . t1, 2, .., Nu denote the set of N distinct items. For each item i, a feature vector xi P Rd is available. Assume that txiui Pr Ns are fixed vectors. Denote the collection of all possible assortments under consideration by S Ď 2r Nszt Hu. For the offline data, we define a random vector p S, A, Rq from each customer, where S Ď r Ns denotes an assortment presented to the customer, A P S Y t0u denotes the item purchased by the customer for A P S (A 0 where no purchase is made), and R denotes the corresponding revenue. The ultimate goal of assortment optimization is to find an optimal set of items s P S for all customers to maximize the expected revenue. A specific goal of this work is to study how to leverage the offline data, which consists of i.i.d. samples of the random triplet p S, A, Rq in order to learn an optimal assortment. For the assortment optimization with offline data, a fundamental question is to estimate the expected revenue for an unexplored assortment s P S. This amounts to addressing the causal relationship between assortment and revenue. Under the celebrated potential outcome framework (Rubin, 1974), let the random variable Rpsq be the potential revenue under an intervention that the assortment is set to be s P S. Our goal is to find an optimal assortment s P arg max s PS Er Rpsqs. Note that the expected potential revenue Er Rpsqs defined in the counterfactual world may not be identifiable from the observed data without additional assumptions. Throughout this paper, we make the following standard consistency and Pessimistic Assortment Optimization un-confoundedness assumptions in causal inference. Assumption 3.1. [CONSISTENCY] With probability one, the observed revenue coincides with the potential revenue of the observed assortment. That is, R Rp Sq almost surely. Assumption 3.2. [UN-CONFOUNDEDNESS] The potential revenues are independent variables of the observed assortment, i.e., t Rpsqus PS KK S. Assumption 3.1 ensures that the observed revenue is consistent with the potential revenue of purchasing item A p 0q, or no purchase if A 0, under the observed assortment S. Assumption 3.2 rules out possible unobserved factors that could confound the causal effect of assortment on revenue1. Denote πSpsq . Pp S sq as the probability of observing assortment s in the offline data. To non-parametrically identify Er Rpsqs for every s P S, we further require the following positivity assumption (Imbens & Rubin, 2015). Assumption 3.3. [POSITIVITY EVERYWHERE] For all s P S, the probability πpsq of observing assortment s is positive (i.e. πpsq ą 0). Assumption 3.3 requires that every assortment can be observed with a positive chance in the offline data. This is a strong assumption that will be later relaxed it Assumption 5.1 (I), i.e., requiring positivity only at optimum. With Assumptions 3.1 3.3, we can identify the effect of an assortment set via inverse propensity score weighting (Rosenbaum & Rubin, 1983): for any s P S, Er Rpsqs E "Ip S sq R where the expectation in the right-hand-side is taken with respect to the data distribution of p S, A, Rq. However, when the number of possible assortments |S| grows exponentially in N, Assumption 3.3 rarely holds for all s P S in practice, given potentially limited offline data particularly when N is large. Moreover, when an assortment s corresponds to an inferior expected revenue Er Rpsqs, it may not be considered by the seller at all. As a consequence, the probability of observing such an assortment πSpsq is zero. These may prevent us from estimating (1) for every assortment s P S. We may tempt to use the following identification strategy: Er Rpsqs Ep R|S sq pby Assumptions 3.1-3.2q Er Ep R|S s, Aqs ÿ i Ps Yt0u πApi|s; xqrs,i, (2) 1With the observed features txjuj Pr Ns, it can be possible to relax Assumption 3.2 to a more plausible condition: the independence holds conditional on the observed features. However, for notation simplicity, without loss of generality, we consider Assumption 3.2. where x . txjuj Pr Ns are the features across items, πApi|s; xq . Pp A i|S sq is the customer s choice probability (Mc Fadden, 1973) of purchasing the i-th item given an assortment s, rs,i . Ep R|S s, A iq is the conditional expected revenue given the assortment s with the i-th item being purchased. For ease of notation, we omit the features x in πA when there is no confusion. Identifying Er Rpsqs as above requires the knowledge of πApi|sq and rs,i, which can be learned from data. Although such an identification approach does not explicitly depend on πSpsq, full identification of πApi|sq and rs,i requires the positivity of πSpsq for every s P S as assumed above. Despite the aforementioned challenge of insufficient coverage over assortments, we argue that finding an optimal assortment s may not necessarily require πSpsq ą 0 everywhere but only at the optimal assortment s . In particular, based on (2), when computing s P arg max s PS i Ps Yt0u πApi|sqrs,i, (3) we may not necessarily need to estimate πApi|sq and rs,i well for s s , as long as sub-optimal assortments can be safely ruled out during the optimization. Our insight is that the estimation of πApi|sq and rs,i for the less seen assortment s in the data often incurs large errors. Deploying pessimism by taking the estimation error into consideration can rule out those assortments (Jin et al., 2021), while standard predict-then-optimize (Bertsimas & Kallus, 2020) or empirical maximization approaches (Zhao et al., 2012) may suffer from an overestimation of Er Rpsqs. Hence, in our proposed pessimistic assortment optimization framework, we only require the positivity at optimum πSps q ą 0, which is a much weaker assumption than that of Assumption 3.3. In this paper, we focus on handling the estimation error from πApi|sq while assuming that rs,i is known. This is a typical assumption in the literature of assortment optimization (Talluri & van Ryzin, 2004; Davis et al., 2013; Flores et al., 2019; Aouad et al., 2021). Our framework can be naturally extended to the scenario where we need to estimate rs,i s. For optimization tractability, we further assume that rs,i ri that the expected revenue depends only on the purchased item but not on the underlying assortment. This assumption is reasonable in many applications where the revenue is a deterministic consequence of a purchased item. This can also be easily extended under our pessimism framework but could result in a more complicated assortment optimization problem. Below, without loss of generality, we assume that ri ě 0 for i P r Ns, while r0 0 (no purchase incurs zero revenue). For any vector x, let x J and ||x||2 respectively denote the transpose and ℓ2-norm of x. For any set A, let |A| denote the cardinal number of A. For any two sequences tϖpnquně1 Pessimistic Assortment Optimization and tγpnquně1, we write ϖpnq Á γpnq (resp. ϖp Nq À γpnq) whenever there exist constants c1 ą 0 (respectively c2 ą 0 such that ϖpnq ě c1γpnq (resp. ϖpnq ď c2γpnq). Moreover, we write ϖpnq γpnq whenever ϖpnq Á γpnq and ϖpnq À γpnq. 4. Pessimistic Offline Assortment Optimization In this section, we introduce our pessimistic offline assortment optimization framework. To this end, based on Eq. (2), we first estimate the choice probability πApi|sq from offline data. Subsequently we calculate optimizing values in optimization problem (3) using a plug-in estimator of πAp q. Consider a generic form of model πApa|s; θ , xq with the unknown true parameter θ . Again, for ease of notation, we omit the features x in πA when there is no confusion. We remark that θ could be either finite-dimensional or infinitedimensional. Given an offline dataset D t Si, Ai, Riun i 1, where n is the sample size, one can estimate the model parameter θ via maximum likelihood estimator (MLE). Specifically, define the likelihood-based loss function p Lnpθq as p Lnpθq 1 i 1 log πAp Ai|Si; θq. Then the MLE of the unknown parameter θ is pθML,n P arg minθPΘ p Lnpθq, where Θ is a pre-specified parameter space. Let Vps; pθML,nq . ÿ i Ps πApi|s; pθML,nqri. Here, we define Vps; θ q as the value function of s with the customer choice model for πA depending on the parameter θ . The plug-in estimator of the optimal assortment based on (3) is ps ML,n P arg max s PS ! Vps; pθML,nq ) . The MLE-based approach first plugs in the MLE of θ , and then directly optimizes the corresponding estimated value function. As discussed before, a disadvantage of the above estimatethen-optimize approach is that the estimation error of pθML,n caused by insufficient data coverage may result in the overestimation of Vps; θ q, which will propagate to downstream optimization. Alternatively, we can quantify the estimation uncertainty by considering the following likelihood-ratiotest-based confidence region (Owen, 1990): Ωnpαnq . tθ P Θ : p Lnpθq p LnppθML,nq ď αnu, where αn ą 0 is pre-specified. Later we analyze the MNL model as a special case (Please see Section 6). With αn chosen as Opd{nq, we establish in Theorem 6.1 that θ P Ωnpαnq with high probability. Such a guarantee does not require any data coverage assumption on assortments. For now on, for simplicity, we drop αn and write Ωn for Ωnpαnq when there is no ambiguity. In order to robustify assortment optimization against plug-in estimation errors, we consider a pessimistic version of (3) by taking the estimation uncertainty from Ωn into account. Specifically, we propose the Pessimistic ASsortment op Timiz Ation (PASTA) by solving ps PASTA,n P arg max s PS min θPΩn Vps; θq. (4) Here, for a fixed assortment s P S, the inner layer of minimization computes the worst-case value among all possible model parameters θ within the confidence set Ωn. In particular, if the estimated value Vps; pθML,nq for s is highly uncertain due to insufficient data coverage, the worst-case value minθPΩn Vps; θq is likely much smaller than Vps; θ q. In that case, the outer layer of (4) may prefer another assortment with a relatively higher worst-case value. In this way, the inner layer of (4) rules out those assortments with less frequency in the offline data. Hence, one essential advantage of such a strategy is that it avoids an overestimation of the value function. In other words, by the plug-in approach, with a non-negligible chance, the estimated value Vps; pθML,nq can be much larger than the truth Vps; θ q, which further leads to a possibly sub-optimal assortment but optimized by the MLE-based approach. In contrast, PASTA is aware of insufficient data coverage, and hence more pessimistic about those highly uncertain value estimates. In the next section, we theoretically analyze the advantage of the PASTA approach. 5. Theoretical Results In this section, we show that the PASTA method (4) enjoys a generic regret guarantee under a weak assumption of positivity at optimum that is πSps q ą 0. Specifically, given ps PASTA,n in (4), we adopt the following regret as the performance metric to evaluate the PASTA s performance Rpps PASTA,nq Vps ; θ q Vpps PASTA,n; θ q. We aim to derive a regret bound for Rpps PASTA,nq under generic conditions. Denote Lpθq Er log πAp A|S; θqs as the population loss function. All detailed proofs can be found in the Appendix 9. We first show that whenever θ P Ωn that the confidence region covers the true parameter, the regret of the PASTA method can be calibrated by the worst-case estimation error among θ P Ωn of the value function at the optimal assortment s . Lemma 5.1. Let ps PASTA,n be the solution by the PASTA Pessimistic Assortment Optimization method defined in (4). If θ P Ωn, then Rpps PASTA,nq ď max θPΩn Vps ; θ q Vps ; θq ( . Proof of Lemma 5.1. Vps ; θ q Vpps PASTA,n; θ q ď Vps ; θ q min θPΩn Vpps PASTA,n; θq pby θ P Ωnq ď Vps ; θ q min θPΩn Vps ; θq pby ps PASTA,n solves (4)q Vps ; θ q Vps ; θq ( . Lemma 5.1 highlights the benefit of pessimistic optimization. In particular, guarantees for the regret of estimate-thenoptimize approaches require the control of estimation error uniformly over all s P S, that is, sups PS ˇˇVps; pθML,nq Vps; θ q ˇˇ, in order to control error caused by the greedy policy, i.e., ˇˇVpps ML,n, pθML,nq Vpps ML,n, θ q ˇˇ. This may typically entail that the value function is estimated uniformly well across s P S. It further explains the reason why an estimate-then-optimize approach would require the positivity Assumption 3.3 for all s P S. In contrast, our Lemma 5.1 suggests that controlling the estimation error at s can be enough. Therefore, it is sufficient for our PASTA method to require the positivity only at optimum. Next, we impose the following assumptions to obtain the regret guarantee of our algorithm. Assumption 5.1. (I) [POSITIVITY AT OPTIMUM] The probability of observing the optimal assortment is positive, that is, πSps q ą 0. (II) [LIKELIHOOD-BASED CONCENTRATION] For any 0 ă δ ă 1, with probability at least 1 δ, we have: (1) θ P Ωn, and (2) ˇˇˇLpθq Lpθ q p Lnpθq p Lnpθ q ˇˇˇ ď αn. We emphasize that PASTA only requires the positivity at optimum. Compared to the positivity at all assortments in Assumption 3.3, our Assumption 5.1 (I) is much weaker and hence more plausible to be satisfied. Assumption 5.1 (II) is a generic condition for likelihood-based concentration. We later justify that (II) above indeed holds under the general MNL model in Theorem 6.2. In particular, Statement (1) of Part (II) requires the validity of the likelihood-ratio-testbased confidence region Ωn while Statement (2) of Part (II) requires the concentration of the likelihood-based localized empirical process (van der Vaart & Wellner, 1996). The positivity at optimum is associated with a finite constant Cs 1{πsps q related to the learning performance. We also denote rs . maxj Ps rj as the largest possible revenue among all items in s . Notice that both constants Cs and rs depend on the optimal assortment s only. In the following lemma, we establish the estimation error bound at the optimal assortment s . Lemma 5.2. Under Assumption 5.1, for any 0 ă δ ă 1, with probability at least 1 δ, we have for any θ P Ωn, Vps ; θ q Vps ; θq À rs Cs ?αn. Combining Lemmas 5.1 and 5.2, we summarize the regret bound for PASTA in the following theorem. Theorem 5.3. Under Assumption 5.1, for any 0 ă δ ă 1, with probability 1 δ, we have Rpps PASTA,nq À rs Cs ?αn. 6. Application: Multinomial Logit Model In this section, we consider the Multinomial Logit Model (MNL) for customer choices πApa|sq. This is one of the most widely used models in assortment optimization literature (Feng et al., 2022). Under the MNL model, we will verify Assumption 5.1 (II) and establish the regret bound for PASTA in this case. Given the item-specific features txiui Pr Ns, MNL assumes that customer s preference for the i-th item is proportional to exppx J i θ q, where θ P Θ is the underlying unknown parameter. Here, we assume that the parameter space Θ Ď Rd is compact with θmax . supθPΘ }θ}2 ă 8. Given an assortment s, the customer choice probability under MNL is given by πApi|s; θ q exppx J i θ q 1 ř j Ps exppx J j θ q, @i P s. (5) Moreover, the probability of no-purchase is normalized to πAp0|s; θ q 1{p1 ř j Ps exppx J j θ qq. Based on (3) and the MNL model (5), the objective function for assortment optimization can be written as Vps; θq ř i Ps ri exppx J i θq 1 ř i Ps exppx J i θq. We first justify Statement (1) of Assumption 5.1 under the MNL model. To this end, given the compactness of Θ, there exists a finite constant CA ą 0 such that for all θ P Θ, s P S and i P s, we have 1{πApi|s; θq ď CA. Lemma 6.1. Consider the MNL model (5) with a compact set Θ. Assume that θ P Θ. For any 0 ă δ ă 1, with probability at least 1 δ, we have p Lnpθ q p LnppθML,nq À CAd Pessimistic Assortment Optimization Lemma 6.1 suggests that, with αn chosen as CAd δ , we can guarantee that θ P Ωn with high probability, which justifies Statement (1) of Assumption 5.1 (II). In particular, the order of αn is Opd{nq. Notice that Lemma 6.1 does not depend on the distribution of S, which implies that no data coverage assumption on the observed assortments is required. The assumption that θ P Θ for Θ a compact set requires that given any assortment, every product has a chance of being selected by the customer in the data. This is a mild requirement as θ is always finite. Next, we justify Statement (2) of Assumption 5.1 (II) in the following theorem. Lemma 6.2. Consider the MNL model (5). Suppose conditions in Lemma 6.1 hold, and Lpθq and Lnpθq are uniformly and strongly convex. Let αn CAd δ . For 0 ă δ ă 1, with probability ě p1 δq, we have ˇˇˇLpθq Lpθ q p Lnpθq p Lnpθ q ˇˇˇ ď αn. Finally, with the Assumption of positivity only at optimum (Assumption 5.1 (I)), we can apply Theorem 5.3 to establish the regret bound for PASTA in MNL. Theorem 6.3. Consider the MNL model (given in Equation (5)). Assume that the conditions in Lemma 6.2 hold and that πSps q ą 0. Fix a δ P p0, 1q. Suppose ps PASTA,n is output of PASTA with αn CAd δ . Then with probability at least 1 δ, we have Rpps PASTA,nq À rs Cs We remark that under the MNL model (given in Equation (5)), the order of regret is Op a d{nq. This is due to the concentration rate of MNL s empirical likelihood ratio in Lemma 6.1. Such a rate of regret bound matches those in the literature under parametric model assumptions (Qian & Murphy, 2011; Mo & Liu, 2022). However, existing literature requires the positivity πSpsq ą 0 at every s P S. In contrast, Theorem 6.3 only requires positivity πSps q ą 0 at the optimal assortment s . Furthermore, we can show that mini PN πApi | S r Ns; θq ď 1{N for any θ P Θ, which implies that CA ě N. Therefore, our regret is of order at least ? N, where N is the total number of available items. It is an interesting problem to establish the minimax lower bound of offline assortment optimization in terms of N, n, d and the cardinal number of s . This will investigated in a subsequent work. 7. PASTA Algorithm In this section, we propose an efficient algorithm for solving the max-min problem given in Optimization Problem (4) for the MNL model. Specifically, let ri exppx J i θq 1 ř j Ps exppx J j θq and given the confidence set Ωn, we wish to solve max s PS min θPΩn Vps; θq. The proposed iterative algorithm is executed for a maximum of T iterations. At the t-th iteration, given st and θt from the previous iteration, we consecutively execute the following two steps: Step 1: Compute the optimal assortment st 1 given θt (see Section 7.1). Step 2: Compute the optimal θt 1 using st 1 (see Section 7.2). The corresponding pseudo-code is presented in Algorithm 1 below. Algorithm 1 PASTA Input: offline dataset tp Si, Ai, Riqun i 1; αn; triu N i 1; txiu N i 1; maximum number of iterations T Output: the solution to pessimistic assortment optimization ps p Lnpθq . 1 n řn i 1 log πAp Ai|Si; θq pθML,n Ð arg minθPΘ p Lnpθq Ωn Ð tθ P Θ : p Lnpθq p LnppθML,nq ď αnu t Ð 0; θt Ð pθML,n /* Initialize θ0 as pθML,n */ for t 1 to T do st Ð Solve LPpθt 1, triu N i 1, txiu N i 1q /* Section 7.1 */ θt Ð Solve GDpst, Ωn, triu N i 1, txiu N i 1q /* Section 7.2 */ end for ˆs Ð s T 7.1. Optimal Assortment Computation Given the MNL model parameter θt, computing the assortment st 1 that maximizes the expected revenue can be formulated as a linear programming (LP) problem. Suppose that an assortment s can be represented by an Ndimensional binary vector γ P t0, 1u N where γj 1 if and only if j P s. Suppose that s P S corresponds to the following feasible set for γ with M linear inequality constraints: γ P t0, 1u N : ÿ j PN aijγj ď bi for i P r Ms where the matrix of constraint coefficients raijsi Pr Ms,j Pr Ns is a totally unimodular matrix (Pang, 2017). In other words, Pessimistic Assortment Optimization based on the one-to-one correspondence between s and γ, we have s P S if and only if γ P Γ. Next, we denote vi exppx J i θtq as the preference score for the i-th item. The customer choice probability under the MNL model (5) becomes πApi|sq vi 1 ř j Ps vj . The optimization for st 1 can be formulated as i Pr Ns riviγi 1 ř i Pr Ns viγi , (6) which is equivalent to the following linear programming problem (Davis et al., 2013): max wj:j Pr Ns Yt0u j Pr Ns rjwj subject to ÿ j Pr Ns wj w0 1 j Pr Ns aij wj vj ď biw0 @i P r Ms vj ď w0 @j P r Ns. In particular, we can recover the optimal solution to Problem (6), denoted as γ , using the optimal solution to Problem (7), denoted by w , via the following formula: γ j w j vjw 0 @j P r Ns. (8) To conclude, at the t-th iteration, in order to compute an optimal assortment st 1 for a given θt, we first solve an LP problem in (7) for w . Then we recover γ via (8). Finally, the updated assortment st 1 is obtained by the correspondence i P st 1 if and only if γ j 1. 7.2. Model Parameter Computation For a given optimized assortment st 1 from Section 7.1, we aim to search for the worst-case MNL parameter θt 1 from the confidence set Ωn that minimizes the expected revenue. In particular, we employ a gradient descent with line search (GDLS) method to compute θt 1 by solving the following problem min θPΩn Vpst 1; θq. (9) Here, we remark that Vpst 1; θq i Pst 1 ri exppx J i θq i Pst 1 exppx J i θq is a locally Lipschitz function in θ. Given a feasible initial parameter θp0q P Ωn, we run at most L gradient descent steps. Suppose βℓis the step size for gradient descent in the ℓ-th step. At each step ℓ 1, 2, , L, we do a line search to maintain the feasibility. In particular, given θpℓ 1q P Ωn, we first evaluate the gradient as ξℓ θVpst 1; θpℓ 1qq. Then we initiate βℓwith a pre-specified step size βℓ rβ, and check whether θpℓq θpℓ 1q βℓξℓis feasible, i.e. θpℓq P Ωn. If not, we set βℓÐ cβℓfor some pre-specified c P p0, 1q, and recompute θpℓq θpℓ 1q βℓξℓ. Such a search is repeated until θpℓq is feasible. We provide the pseudocode in Algorithm 2 for the overall process. Note that L, rβ, c are all hyper-parameters. In all of our numerical studies, we set L 2, rβ 0.01 and c 1 2, which performs well empirically. Algorithm 2 Gradient Descent with Line Search (GDLS) Input: assortment st 1; feasible set Ωn; initial parameter θp0q; initial step size β; step shrinkage constant c; number of descent steps L Output: the updated parameter θt 1 ℓÐ 0 for ℓ 1 to L do ξℓ θVpst 1; θpℓ 1qq /* compute the gradient */ βℓÐ rβ θpℓq Ð θpℓ 1q βℓξℓ while θpℓq R Ωn do βℓÐ cβℓ /* decrease the step size */ θpℓq Ð θpℓ 1q βℓξℓ end while end for θt 1 Ð θpℓq 8. Experiments We compare the PASTA method with assortment optimization without pessimism (referred to as the baseline method in the sequel). Our method and the baseline method are evaluated on synthetic data for which the optimal assortment s and true parameter θ are known so that the true regrets can be computed. We describe the data generation process and the baseline method in details below. 8.1. Data Generation We consider the assortment optimization scenarios described by N, K, d, n and p, where N is the total number of available products; K is the cardinality constraint of the assortments, i.e., S ts : |s| ď Ku; d is the dimension of θ and txju N j 1; n is the sample size of the offline dataset; p is the probability for sampling the optimal assortment s . Similar to (Chen et al., 2020), we first generate the true preference vector θ as a uniformly random unit d-dim vector. For i P t1, . . . , Nu, we generate ri (the reward of product i) uniformly from the range r0.5, 0.8s and generate xi (the feature of product i) as uniformly random unit d-dim vector such that exppx J i θ q ď expp 0.6q to avoid degenerate cases, where the optimal assortments include too few items. Given such information, the true optimal assortment s can be computed. Then, we generate an offline dataset D tp Si, Ai, Riqun i 1 with n samples. For i P t1, . . . , nu, we generate Si following the distribution πS Pessimistic Assortment Optimization such that πSps q p and πSpsq 1 p |S| 1, where 0 ă p ă 1 is the probability of observing the optimal assortment s . After the assortment Si is sampled, the customer choice (action) Ai is sampled according to the probability computed by MNL as in Eq. (5) with the true parameter θ . 8.2. Baseline In our experiments, we use the gradient descent method to find pθML,n that minimizes the empirical negative loglikelihood function. Then given pθML,n, the baseline method solves the assortment optimization problem by solving the linear programming problem in (7). 8.3. Performance Comparison For a given p N, K, d, n, pq, we repeat the data generation process in Section 8.1 to randomly generate 50 offline datasets. The solutions of PASTA and the baseline method are recorded in these experiments. For hyper-parameters, we set αn 2p LML where p LML p LnppθML,nq and the maximum of iteration T 30. We measure the performance with two metrics: (1) the average regret of the solutions which indicates how far the performance of the solutions is to that of the optimal performance (i.e., revenue of s ); (2) the assortment accuracy of the solutions (with respect to the optimal assortment s ). The assortment accuracy of an assortment s is defined as the ratio of the number of correctly chosen products to the number of products in s . The key results are summarized below. Effect of Sample Size. We set N 40, K 8, d 16 and p 0.9. We then gradually increase the number of samples n . The result is presented in Figure 1 indicating that PASTA significantly outperforms the baseline method. While the performance of the baseline method improves with increasing number of samples, the PASTA method maintains a regret that is less than 25% of that of the baseline method. The same experiment repeated with an increased number of products (N 60, K 15) demonstrates that the gain of the PASTA method is stable, as presented in Figure 1. Effect of Probability of Sampling Optimal Assortment in Offline Data. We set N 40, K 8, d 16, n 150, and let p P t0.1, 0.3, 0.5, 0.7, 0.9u. We also study the effect of p in scenarios with an increased total number of products (N 60, K 15). As can be seen in Figure 2, the gain of pessimistic assortment optimization is consistent and robust for varying values of p. Effect of Dimension of Features. We set N 20, K 5, p 0.9, n 150, and let d P t8, 20, 32, 64, 128u. In order to characterize the effect of dimension d, we generate d elements of θ independently from Uniformr 1, 1s. The results are presented in Figure 3. We observe that while both the regret of the baseline method and that of the pes- Figure 1. Performance comparison between PASTA and the baseline method with varying number of samples (n). On the left is the average regret (the lower the better) while the assortment accuracy (the higher the better) is on the right. simistic assortment optimization increase with increasing dimensions of features, the PASTA method maintains its performance gain as the dimension d varies. Figure 3. Comparison between PASTA and the baseline method with increasing dimensions of product features (d). 9. Conclusion This work addresses the issue of insufficient data coverage in offline assortment optimization problems. This becomes more challenging as the number of choices grows quickly as a function of the number of items N. We presented a framework of pessimistic assortment optimization and provided theoretical justifications for our approach. We then performed an in-depth study of the Multinomial Logit Model (MNL), and derived a finite-sample regret bound of pessimistic assortment optimization for this popular model. We presented an efficient algorithm to solve the pessimistic assortment optimization problem for MNL, and demonstrated significant improvements of our approach over the baseline method by extensive numerical studies. Pessimistic Assortment Optimization Figure 2. Comparison between PASTA and the baseline method with varying probability of the optimal assortment (p). Top row: N 40; bottom row: N 60. Acknowledgements Ethan X. Fang is partially supported by NSF DMS-2230795, NSF DMS-2230797. Cong Shi is partially supported by an Amazon Research Award. Aouad, A., Farias, V., and Levi, R. Assortment optimization under consider-then-choose choice models. Management Science, 67(6):3368 3386, 2021. Aouad, A., Feldman, J., and Segev, D. The exponomial choice model for assortment optimization: an alternative to the MNL model? Management Science, 2022. Bertsimas, D. and Kallus, N. From predictive to prescriptive analytics. Management Science, 66(3):1025 1044, 2020. Caro, F. and Gallien, J. Dynamic assortment with demand learning for seasonal consumer goods. Management Science, 53(2):276 292, 2007. Chen, X., Wang, Y., and Zhou, Y. Dynamic assortment optimization with changing contextual information. Journal of Machine Learning Research, 2020. Chen, X., Shi, C., Wang, Y., and Zhou, Y. Dynamic assortment planning under nested logit models. Production and Operations Management, 30(1):85 102, 2021a. Chen, X., Wang, Y., and Zhou, Y. Optimal policy for dynamic assortment planning under multinomial logit models. Mathematics of Operations Research, 46(4): 1639 1657, 2021b. Davis, J. M., Gallego, G., and Topaloglu, H. Assortment planning under the multinomial logit model with totally unimodular constraint structures, 2013. URL https://people.orie.cornell.edu/ jmd388/publications/MNLConstr.pdf. Working Paper, Cornell University, Ithaca, NY. Désir, A., Goyal, V., Segev, D., and Ye, C. Constrained assortment optimization under the Markov chain based choice model. Management Science, 66(2):698 721, 2020. Diaconis, P. and Saloff-Coste, L. Logarithmic sobolev inequalities for finite markov chains. The Annals of Applied Probability, 6(3):695 750, 1996. Feldman, J. and Topaloglu, H. Bounding optimal expected revenues for assortment optimization under mixtures of multinomial logits. Production and Operations Management, 24(10):1598 1620, 2015. Feng, Q., Shanthikumar, J. G., and Xue, M. Consumer choice models and estimation: A review and extension. Production and Operations Management, 31(2):847 867, 2022. Flores, A., Berbeglia, G., and Van Hentenryck, P. Assortment optimization under the sequential multinomial logit model. European Journal of Operational Research, 273 (3):1052 1064, 2019. Fu, Z., Qi, Z., Wang, Z., Yang, Z., Xu, Y., and Kosorok, M. R. Offline reinforcement learning with instrumental variables in confounded markov decision processes, 2022. URL https://arxiv.org/abs/ 2209.08666. Working Paper, Northwestern University, Evanston, IL. Gallego, G. and Topaloglu, H. Constrained assortment optimization for the nested logit model. Management Science, 60(10):2583 2601, 2014. Harsha, P. Communication complexity, 2011. URL https: //www.tifr.res.in/~prahladh/teaching/ 2011-12/comm/lectures/l12.pdf. Lecture Notes, Tata Institute of Fundamental Research (TIFR), Mumbai, India. Imbens, G. W. and Rubin, D. B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, Cambridge, UK, 2015. Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline RL? In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5084 5096. PMLR, 18 24 Jul 2021. Pessimistic Assortment Optimization Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. Morel: Model-based offline reinforcement learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 21810 21823. Curran Associates, Inc., 2020. Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. Li, S., Luo, Q., Huang, Z., and Shi, C. Online learning for constrained assortment optimization under Markov chain choice model, 2022. URL https://papers.ssrn.com/sol3/papers. cfm?abstract_id=4079753. Working Paper, University of Michigan, Ann Arbor, MI. Liu, N., Ma, Y., and Topaloglu, H. Assortment optimization under the multinomial logit model with sequential offerings. INFORMS Journal on Computing, 32(3):835 853, 2020. Mc Fadden, D. Conditional logit analysis of qualitative choice behavior. Institute of Urban and Regional Development, Berkeley, CA, 1973. Mc Fadden, D. Econometric models of probabilistic choice. Structural analysis of discrete data with econometric applications, 198272, 1981. Mo, W. and Liu, Y. Efficient learning of optimal individualized treatment rules for heteroscedastic or misspecified treatment-free effect models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(2): 440 472, 2022. Owen, A. Empirical likelihood ratio confidence regions. The Annals of Statistics, 18(1):90 120, 1990. Pang, R. 1 totally unimodular matrices - stanford university, 2017. URL https://theory.stanford. edu/~jvondrak/MATH233B-2017/lec3.pdf. Qian, M. and Murphy, S. A. Performance guarantees for individualized treatment rules. The Annals of Statistics, 39(2):1180 1210, 2011. Rosenbaum, P. R. and Rubin, D. B. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41 55, 1983. Rubin, D. B. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974. Rusmevichientong, P., Shen, M., and Shmoys, D. A PTAS for capacitated sum-of-ratios optimization. Operations Research Letters, 37:230 238, 07 2009. Rusmevichientong, P., Shen, Z.-J. M., and Shmoys, D. B. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations Research, 58(6):1666 1680, 2010. Rusmevichientong, P., Sumida, M., and Topaloglu, H. Dynamic assortment optimization for reusable products with random usage durations. Management Science, 66(7): 2820 2844, 2020. Sauré, D. and Zeevi, A. Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management, 15(3):387 404, 2013. Sen, B. A gentle introduction to empirical process theory and applications, 2018. URL http://www.stat.columbia.edu/~bodhi/ Talks/Emp-Proc-Lecture-Notes.pdf. Lecture Notes, Columbia University, New York, NY. Talluri, K. and van Ryzin, G. Revenue management under a general discrete choice model of consumer behavior. Management Science, 50(1):15 33, 2004. van der Vaart, A. W. and Wellner, J. A. Weak Convergence and Empirical Processes: with Applications to Statistics. Springer, New York, NY, 1996. Wang, Y., Chen, X., and Zhou, Y. Near-optimal policies for dynamic multinomial logit assortment selection models. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. Mopo: Model-based offline policy optimization. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 14129 14142. Curran Associates, Inc., 2020. Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106 1118, 2012. Pessimistic Assortment Optimization A. Proof of Theoretical Results Throughout the proofs, we use θ to denote the true parameter and n to denote the number of samples. We use p Ln to denote the empirical negative log-likelihood function, i.e., p Lnpθq 1 n řn i 1 log πAp Ai|Si; θq where tp Si, Ai, Riqun i 1 is the offline dataset. We use L and pθML,n to respectively denote the negative log-likelihood function , i.e., Lpθq Erlog πAp A|S; θqs, and the MLE of θ , i.e., pθML,n P arg minθPΘ ! p Lnpθq ) . The confidence region Ωn is defined as Ωn . tθ P Θ : p Lnpθq p LnppθML,nq ď αnu. For a general pair of random variables p X, Y q, assume that the conditional probability density function of Y given X is parameterically modeled by ppy|x; θq for parameter θ. For technical reasons, we will consider the following distances. Definition A.1 (Squared Hellinger Distance). h2ppp |x; θ1q, pp |x; θ2qq 1 p1py|x; θ1q a p2py|x; θ2q 2 dy. (10) Definition A.2 (Hellinger Distance). hppp |x; θ1q, pp |x; θ2qq a h2ppp |x; θ1q, pp |x; θ2qq. (11) Definition A.3 (Generalized Squared Hellinger Distance). H2pθ1, θ2q EX h2ppp |X; θ1q, pp |X; θ2qq ı . (12) Definition A.4 (Generalized Hellinger Distance). Hpθ1, θ2q EX a h2ppp |X; θ1q, pp |X; θ2qq ı . (13) In our theoretical results, we particularly consider ppy|x; θq πApa|s; θq as the conditional density of A given S (hereafter denoted as A|S). A.1. Proof of Lemma 5.2 Under Assumption 5.1, for any 0 ă δ ă 1, with probability at least 1 δ, we have for any θ P Ωn, Vps ; θ q Vps ; θq À rs Cs ?αn, (14) where rs . maxj Ps rj is the largest possible revenue among all items in the optimal assortment. Proof of Lemma 5.2. For any θ such that p Lnpθq p LnppθML,nq ď αn, i.e., θ P Ωn, we have Vps ; θ q Vps ; θq ď ˇˇˇˇVps ; θq Vps ; θ q ˇˇˇˇ ď ˇˇˇˇVps ; θq Vps ; pθML,nq Vps ; pθML,nq Vps ; θ q ˇˇˇˇ ď ˇˇˇˇVps ; θq Vps ; pθML,nq ˇˇˇˇ ˇˇˇˇVps ; pθML,nq Vps ; θ q ˇˇˇˇ ď 2 max θPΩn ˇˇˇˇVps ; θq Vps ; pθML,nq ˇˇˇˇ (By Assumption 5.1 (II) θ P Ωn). With Lemma B.2, we have that for any θ P Θ, ˇˇˇˇVps ; θq Vps ; pθML,nq ˇˇˇˇ ď rs Cs ES ||πAp |S; θq πAp |S; pθML,nq||1 Pessimistic Assortment Optimization where || ||1 is the ℓ1-norm, rs . maxj Ps rj is the largest possible revenue among all items and Cs 1{πSps q. In Lemma B.3, we establish that ||πAp |S; θq πAp |S; pθML,nq||1 H2pθ, pθML,nq, where H2 is the generalized squared Hellinger distance defined in (12) with ppy|x; θq πApa|s; θq as the conditional density of A|S. Combining the above two inequalities, we have that for any θ P Θ, ˇˇˇˇVps ; θq Vps ; pθML,nq ˇˇˇˇ À rs Cs b H2pθ, pθML,nq. (15) In the following, we use the fact that log x ď 2p?x 1q for any x ě 0 to show that for any s P S and any θ: ż πApa|s; θ q log πApa|s; θq πApa|s; θ qda ě 2 ż πApa|s; θ q πApa|s; θq πApa|s; θ q 1 ż πApa|s; θ q πApa|s; θq 2 a πApa|s; θqπApa|s; θ q da πApa|s; θ q a πApa|s; θq 2 da πApa|s; θ q a πApa|s; θq 2 da, which implies that Lpθq Lpθ q ě 2H2pθ; θ q. (16) By Lemma B.4, we have that for any θ P Ωn, H2pθ, pθML,nq ď 2H2pθ , θq 2H2pθ , pθML,nq ď LppθML,nq Lpθ q t Lpθq Lpθ qu. (17) From Assumption 5.1, we have that with probability at least 1 δ, for any θ P Ωn, ˇˇˇLpθq Lpθ q p Lnpθq p Lnpθ q ˇˇˇ ď αn. In other words, under Assumption 5.1, with probability at least 1 δ for any θ P Ωn, Lpθq Lpθ q ď ˇˇp Lnpθq p Lnpθ q ˇˇ αn ď 2αn. (18) Plugging Eq. (18) into Eq. (17), we have that with probability at least 1 δ, for any θ P Ωn, H2pθ, pθML,nq ď 4αn. (19) Combining the above inequality and Eq. (15), we have that, with probability at least 1 δ, ˇˇVps ; θq Vps ; pθMLqq ˇˇ À rs Cs ?αn for all θ P Ωn . This concludes the proof. A.2. Proof of Lemma 6.1 Consider the MNL model (5) with a compact set Θ. Assume that θ P Θ. For 0 ă δ ă 1, with probability at least 1 δ, we have p Lnpθ q p LnppθML,nq À CAd Pessimistic Assortment Optimization Proof of Lemma 6.1. Fix 0 ă δ ă 1. Suppose αn CAd n log 2θmax δ . Define an oracle confidence set as rΩn . tθ P Θ : Lpθq Lpθ q ď αnu . In particular, θ P rΩn. By Lemmas B.1 and A.2, we also have with probability at least 1 δ{2, LppθML,nq Lpθ q ď 2CAH2ppθML,n, θ q À αn, that is, pθML,n P rΩn. Define Fn . ! log πAp A|S;θ q πAp A|S;θq : θ P rΩn ) . In particular, for any f P Fn, we have }f}8 ď 2 log CA. Let }Pn P}Fn . supf PFn |p Pn Pqpfq| be the envelope. By Talagrand s inequality, with probability at least 1 δ{2, we have for any f P Fn, p Pn Pqpfq À E}Pn P}Fn plog CAq E}Pn P}Fn sup f PFn Epf Efq2 + For the variance term, we have σ2 Fn . sup f PFn Epf Efq2 ď sup θPrΩn E ˆ log πAp A|S; θ q πAp A|S; θq ď sup θPrΩn E 2CAh2 πAp |S; θ q, πAp |S; θq ( pby Lemma B.5q 2CA sup θPrΩn H2pθ , θq ď 2CA sup θPrΩn r Lpθq Lpθ qs pby (17)q 2CAαn. pby definition of rΩnq For the expected envelope, our goal below is to apply Sen (2018, Theorem 7.13) (stated in Theorem B.6). Consider the covering number Npϵ, Fn, L2p Qqq for any given ϵ ą 0 and finitely supported probability measure Q. By Lemma A.1, based on the MNL model (5), for some L ă 8, Fn is a class of L-Lipschitz functions with respect to the index space pΘ, } }2q. Then in terms of the bracketing number Nrs and covering number N, for any ϵ ą 0 and probability measure Q, we have NpϵL, Fn, L2p Qqq ď Nrsp2ϵL, Fn, L2p Qqq ď Npϵ, Θ, } }2q. By Θ Ď Rd and Θ is compact, we further have Npϵ, Θ, } }2q À 1 ϵ d. Therefore, Npϵ, Fn, L2p Qqq À ˆL By Theorem B.6, we further have E}Pn P}Fn À d nσ2 Fn log L σFn _ " d n ˆ 2 logp CAq log L σFn The Talagrand s inequality (21) becomes p Pn Pqpfq À αn. In particular, in the case of pθML,n P rΩn corresponding to fppθML,nq P Fn, we have p Lnpθ q p LnppθML,nq PnfppθML,nq À PfppθML,nq αn Lpθ q LppθML,nq l jh n This complete the proof. Pessimistic Assortment Optimization Lemma A.1. Consider the MNL model: πApi|s; θq exppx J i θq 1 ř j Ps exppx J j θq; πAp0|s; θq 1 1 ř j Ps exppx J j θq; @i P s, s P S, θ P Θ. Let θmax . maxθPΘ }θ}2, xmax . maxj Pr Ns }xj}2. If Θ is compact, that is, θmax ă 8, then the log-likelihood ratio log πAp A|S;θ q πAp A|S;θq is a uniformly Lipschitz function in θ P Θ. Proof of Lemma A.1. B Bθ log πAp A|S; θ q πAp A|S; θq 2 1 πAp A|S; θq Bπp A|S; θq r1 πAp A|S; θqs j PS exppx J j θqpx A xjq 2 ď xmax 2Nxmax exppxmaxθmaxq . L ă 8. That is, θ ÞÑ log πAp A|S;θ q πAp A|S;θq is L-Lipschitz. Lemma A.2 (Concentration of Parametric MLE in Hellinger Distance). Consider the MNL model (5). For 0 ă δ ă 1, with probability at least 1 δ, we have H2ppθML,n, θ q À d Proof of Lemma A.2. We follow from Fu et al. (2022, Corollary 2) as a special case, where our data are generated i.i.d. instead of being a general Markov chain. A.3. Proof of Lemma 6.2 Consider the MNL model (5). Suppose conditions in Lemma 6.1 hold, and Lpθq and Lnpθq are uniformly and strongly convex. Let αn CAd δ . For 0 ă δ ă 1, with probability at least 1 δ, we have ˇˇˇLpθq Lpθ q p Lnpθq p Lnpθ q ˇˇˇ ď αn. Proof of Lemma 6.2. Fix 0 ă δ ă 1. By the strong convexity assumption on Lpθq and Lnpθq, there exists a constant µ ą 0 such that for any θ P Θ, µ}θ θ }2 2 ď Lpθq Lpθ q; µ}θ pθML,n}2 2 ď p Lnpθq p LnppθML,nq. By Lemma 6.1, with probability at least 1 δ{2, we have pθML,n P rΩn. Then for any θ P Ωn, we have }θ θ }2 ď }θ pθML,n}2 }pθML,n θ }2 pby triangular inequalityq p Lnpθq p LnppθML,nq 1 ?µ LppθML,nq Lpθ q pby strong convexityq À ?αn pby θ P Ωn and pθML,n P rΩn respectivelyq. The above implies that with probability at least 1 δ{2, we have Ωn Ď sΩn, where sΩn is a ball centered around θ with radius ?αn: sΩn . θ P Θ : }θ θ }2 ď ?αn ( . Pessimistic Assortment Optimization Define s Fn . ! log πAp A|S;θ q πAp A|S;θq : θ P sΩn ) . Let }Pn P} s Fn . supf P s Fn |p Pn Pqpfq| be the envelop. By Talagrand s inequality, with probability at least 1 δ{2, we have for any f P s Fn, |p Pn Pqpfq| À E}Pn P} s Fn plog CAq E}Pn P} s Fn sup f P s Fn Epf Efq2 + For the variance term, we have σ2 s Fn . sup f P s Fn Epf Efq2 ď sup θPsΩn E ˆ log πAp A|S; θ q πAp A|S; θq À sup θPsΩn }θ θ }2 2 pby Lipschitzness in Lemma A.1q ď αn pby definition of sΩnq. For the expected envelope, by Theorem B.6, we further have E}Pn P} s Fn À d nσ2 s Fn log L σ s Fn _ " d n ˆ 2 logp CAq log L σ s Fn d nαn À αn. Therefore, the Talagrand s inequality (23) gives that for any f P s Fn |p Pn Pqpfq| À αn. In other words, with probability at least 1 δ, Ωn Ď sΩn corresponding to tfpθquθPΩn Ď s Fn, we have ˇˇˇLpθq Lpθ q p Lnpθq p Lnpθ q ˇˇˇ ď sup f P s Fn |p Pn Pqpfq| À αn. B. Technical Lemmas Lemma B.1. Suppose CA ą 2 and CA ě 1{πApi|s; θq for all θ P Θ, s P S and i P s. Then for any θ P Θ: |Lpθq Lpθ q| ď 2CAH2pθ, θ q, (24) Proof of Lemma B.1. By definition, |Lpθ q Lpθq| ˇˇˇˇE " EA log πAp A|S; θ q πAp A|S; θq ˇˇˇˇS ȷ*ˇˇˇˇ . In particular, for a fixed s P r Ns, we have E log πAp A|S; θ q πAp A|S; θq ˇˇˇˇS s ȷ KL πAp |s; θ q ˇˇˇ ˇˇˇπAp |s; θq ď logp CA 1q ! 1 1 h2pπAp |s; θ q, πAp |s; θqq 2) pby log-Sobolev inequality (Diaconis & Saloff-Coste, 1996, Theorem A.1)q CA logp CA 2 1q CA 2 t2h2 ph2q2u with h2 h2 πAp |s; θ q, πAp |s; θq ď 2CAh2pπAp |s; θ q, πAp |s; θqq. Therefore, |Lpθ q Lpθq| ď E 2CAh2pπAp |S; θ q, πAp |S; θqq ( 2CAH2pθ , θq. Pessimistic Assortment Optimization Lemma B.2. Let Cs . 1 πSps q and rs . maxj Ps rj, then the following inequality holds for any θ1, θ2 P Θ: ˇˇˇˇVps ; θ1q Vps ; θ2q ˇˇˇˇ ď rs Cs ES ||πAp |S; θ1q πAp |S; θ2q||1 where || ||1 denotes the L1 norm. Proof of Lemma B.2. ˇˇˇˇVps ; θ1q Vps ; θ2q ˇˇˇˇ ˇˇˇˇES ˆ πApi|S; θ1q πApi|S; θ2q ffˇˇˇˇ (25) ˇˇˇˇ Ip S s q ˆ πApi|S; θ1q πApi|S; θ2q ˇˇˇˇ ˇˇˇˇ Ip S s q Ip S s q ˇˇˇˇ ÿ ˆ πApi|S; θ1q πApi|S; θ2q ˇˇˇˇ ˆ πApi|S; θ1q πApi|S; θ2q ˇˇˇˇ ˆ πApi|S; θ1q πApi|S; θ2q ˇˇˇˇ ||πAp |S; θ1q πAp |S; θ2q||1 where Eq. (25) comes from the sample-based estimation of Er Rpsqs (Eq. (1)), Eq. (27) comes from the Hölder s inequality, Eq. (28) comes from the fact that ˇˇˇˇ ˇˇˇˇ Ip S s q ˇˇˇˇ 8 Cs because Ip S s q πSp Sq has the value zero everywhere except at s s . The last equality follows from the definition of L1 norm. Lemma B.3. For any θ1, θ2 P Θ, ||πAp |S; θ1q πAp |S; θ2q||1 H2pθ1, θ2q. Proof of Lemma B.3. We first use the facts that (1) L1 1 2TV where TV is the total variation distance and (2) TV ď ? 2h (Harsha, 2011) where h is the Hellinger distance to have that for any s P S, ||πAp |s; θ1q πAp |s; θ2q||1 ď 2 ? 2h πAp |s; θ1q, πAp |s; θ2q . (31) From (31), we have ||πAp |s; θ1q πAp |s; θ2q||1 ď 2 ? 2h πAp |s; θ1q, πAp |s; θ2q , ||πAp |s; θ1q πAp |s; θ2q||2 1 ď 8h2 πAp |s; θ1q, πAp |s; θ2q . Taking expectation with respect to S on both sides, we have ||πAp |S; θ1q πAp |S; θ2q||2 1 h2pπAp |S; θ1q, πAp |S; θ2qq ȷ , ||πAp |S; θ1q πAp |S; θ2q||2 1 ȷ ď 8H2pθ1, θ2q. By the Jensen s inequality, we have ES||πAp |S; θ1q πAp |S; θ2q||1 ||πAp |S; θ1q πAp |S; θ2q||2 1 Pessimistic Assortment Optimization This implies that ||πAp |S; θ1q πAp |S; θ2q||1 H2pθ1, θ2q. Lemma B.4 (Properties of H and H2). For any θ1, θ2, θ3 P Θ, the following inequalities hold: Hpθ1, θ2q ď Hpθ1, θ3q Hpθ2, p3q; Hpθ1, θ2q 2 ď H2pθ1, θ2q ď Hpθ1, θ2q; H2pθ1, θ2q ď 2H2pθ1, θ3q 2H2pθ2, θ3q. Proof of Lemma B.4. For ease of notation, for i 1, 2, 3, we use pi to denote πA parametrized by θi, i.e., pipa|sq πApa|s; θiq. (1) Notice that for any s P S, a h2pp1p |sq, p2p |sqq is just the regular Hellinger distance that satisfies the triangular inequality. Hence we have a h2pp1p |sq, p2p |sqq ď a h2pp1p |sq, p3p |sqq a h2pp2p |sq, p3p |sqq. (32) Take expectation of both side of Eq. (32) with respect to S, we have h2pp1p |Sq, p2p |Sqq ı ď ES a h2pp1p |Sq, p3p |Sqq ı ES a h2pp2p |Sq, p3p |Sqq ı . By the definition of H, this means that Hpθ1, θ2q ď Hpθ1, θ3q Hpθ2, θ3q. (2) For the first inequality, by applying the Jensen s inequality, we have ˆ E a h2pp1p |Sq, p2p |Sqq ȷ 2 ď E a h2pp1p |Sq, p2p |Sqq 2 ȷ . (33) Then the inequality follows. For the second inequality, we have Hpθ1, θ2q H2pθ1, θ2q ES hpp1p |Sq, p2p |Sqq h2pp1p |Sq, p2p |Sqq ı (34) ES 1 hpp1p |Sq, p2p |Sqq hpp1p |Sq, p2p |Sqq ı . (35) Note that for any s, 1 hpp1p |sq, p2p |sqq is a non-negative function because the Hellinger distance is no larger than 1, and hpp1p |sq, p2p |sqq is also a positive function because it is a metric. We have Eq. (35) as the expectation of non-negative functions, and thus we have ES 1 hpp1p |Sq, p2p |Sqq hpp1p |Sq, p2p |Sqq ı ě 0. Then the inequality follows. (3) Notice that for any a, b, c, we have pa bq2 ď 2pa cq2 2pb cq2. With this fact, we have that for any s, h2pp1p |sq, p2p |sqq 1 p2pa|sq 2 da (36) p3pa|sq 2 2 a p3pa|sq 2 da (37) p3pa|sq 2 da 2 1 p3pa|sq 2 da (38) 2h2pp1p |sq, p3p |sqq 2h2pp2p |sq, p3p |sqq. (39) Pessimistic Assortment Optimization This implies that H2pθ1, θ2q ď 2H2pθ1, θ3q 2H2pθ2, θ3q. (40) Lemma B.5 (Log-Density Ratio Variance Bound). Suppose X p is an R-valued random variable with probability density function p, and p1, p2 are two other probability density functions for X such that p1 and p2 are uniformly bounded from below by C 1 on the support of p. Then we have ˆ log p1p Xq 2 ď 2Ch2pp1, p2q, where h2 is the squared Hellinger distance in (10). Proof of Lemma B.5. By logpxq ď 2p?x 1q for any x ě 0, we have ˆ log p1p Xq 2 ż ˆ log p1p Xq p1pxq p2pxq 1 p2pxq p1pxq 1 4 ż max " 1 p2pxq p2pxq 2 , 1 p1pxq p1pxq 2* ppxqdx 2Ch2pp1, p2q. Theorem B.6 (Sen (2018, Theorem 7.13)). Let F be a measurable function class, such that supf PF }f}8 ď fmax for some constant fmax ă 8. Assume that for A ě efmax, d ě 2, 0 ď ϵ ď fmax, and every finitely supported probability measure Q, we have the covering number (Sen, 2018) as: Npϵ, F, L2p Qqq À ˆA Let σ2 F . supf PF Epf Efq2. Then we have d nσ2 F log A nfmax log A Proof of Theorem B.6. In this proof, we denote X as the underlying random variable, t Xiun i 1 are n i.i.d. copies of X, and for any f P F, Pnpfq . 1 n řn i 1 fp Xiq, Ppfq . Erfp Xqs. Without loss of generality, assume that 0 P F, and for any f P F, Ppfq 0. Let tϵiun i 1 be i.i.d. Rademacher random variables that are independent of t Xiun i 1. By symmetrization, we have E}Pn P}F ď 2E sup f PF i 1 ϵifp Xiq ˇˇˇˇˇ . (42) Conditional on t Xiun i 1, by Dudley s entropy bound, we have Eϵ sup f PF i 1 ϵi fp Xiq ?n ˇˇˇˇˇ ď ż σF,n log Npu, F, L2p Pnqqdu, (43) Pessimistic Assortment Optimization where we consider the L2p Pnq as the metric on F, that is, for any f P F, }f}2 L2p Pnq 1 n řn i 1 fp Xiq2. We also denote σ2 F,n . supf PF }f}2 L2p Pnq, and Eϵ to emphasize that the expectation is taken with respect to the Rademacher random variables tϵiun i 1 but holding t Xiun i 1 as fixed. By (41) with Q chosen as Pn, we have log A σF,n pby Lemma B.7q. In particular, logp A{σF,nq ě logp A{fmaxq ě 1 by assumption, which satisfies the condition for Lemma B.7. Combining (42), (43) and (44), we have σ2 F,n log A σF,n 1 2Epσ2 F,nq log A2 by the concavity of u ÞÑ where E takes expectation with respect to t Xiun i 1. Notice that Epσ2 F,nq E sup f PF Pnpf 2q ď E sup f PF ! |p Pn Pqpf 2q| Ppf 2q ) ď E}Pn P}F2 σ2 F, where we define F2 . tf 2 : f P Fu. We aim to apply (42), (43), (44), (45) to F2. Notice that supf 2PF2 }f 2}8 ď f 2 max, σ2 F2,n . supf 2PF2 }f 2}2 L2p Pnq ď f 2 max supf PF }f}2 L2p Pnq f 2 maxσ2 F,n. By }f 2 g2}L2p Pnq a Pnrpf gq2pf gq2s ď Pnpf gq2 2fmax}f g}2 L2p Pnq for any f, g P F, we further have Np2fmaxϵ, F2, L2p Pnqq ď Npϵ, F, L2p Pnqq À ˆA Therefore, applying (42), (43), (44) and (45) to F2, we have E}Pn P}F2 À 1 2f 2max Epσ2 F,nq log 4A2 Define B . b 1 2Epσ2 F,nq log A2 Epσ2 F,nq. Then we have Epσ2 F,nq σ2 F À By u ÞÑ u log A2 u is non-decreasing on u P p0, A2{es and non-increasing on u P r A2{e, 8q, we have 2Epσ2 F,nq log A2 Epσ2 F,nq À 1 log A2 σ2 F b d nfmax B A2 In particular, B ď b 2e , σ2 F ď f 2 max ď A2{e2 ă A2{e. Then the cap A2{e is inactive as d{n Ñ 0 asymptotically. Therefore, In particular, B is bounded by both roots of the corresponding quadratic equation: d nfmax log A 2 4σ2 F log A d nfmax log A Pessimistic Assortment Optimization Combined with (45), we further have d nσ2 F log A nfmax log A Lemma B.7. Suppose a, A ą 0 such that logp A{aq ě 1. Then we have Proof of Lemma B.7. Define u du, a ą 0; Then f is continuous at 0. Moreover, for a ą 0, we have which is nonnegative if logp A{aq ě 1. As a Ñ 0 , we further have u du pby integration-by-partq lim inf aÑ0 fpaq Therefore, for any a ě 0, we have fpaq ě 0, which concludes the proof.