# statistical_properties_of_robust_satisficing__fb3db83f.pdf

Statistical Properties of Robust Satisficing

Zhiyi Li 1 Yunbei Xu 2 Ruohan Zhan 3 4

The Robust Satisficing (RS) model is an emerging approach to robust optimization, offering streamlined procedures and robust generalization across various applications. However, the statistical theory of RS remains unexplored in the literature. This paper fills in the gap by comprehensively analyzing the theoretical properties of the RS model. Notably, the RS structure offers a more straightforward path to deriving statistical guarantees compared to the seminal Distributionally Robust Optimization (DRO), resulting in a richer set of results. In particular, we establish two-sided confidence intervals for the optimal loss without the need to solve a minimax optimization problem explicitly. We further provide finite-sample generalization error bounds for the RS optimizer. Importantly, our results extend to scenarios involving distribution shifts, where discrepancies exist between the sampling and target distributions. Our numerical experiments show that the RS model consistently outperforms the baseline empirical risk minimization in small-sample regimes and under distribution shifts. Furthermore, compared to the DRO model, the RS model exhibits lower sensitivity to hyperparameter tuning, highlighting its practicability for robustness considerations.

1. Introduction

Robust methods are optimization techniques that guarantee performances even when environments vary slightly

1School of Mathematical Sciences, Peking University, Beijing, China 2Yunbei Xu, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA, USA 3Department of Industrial Engineering and Decision Analytics, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong SAR 4HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, Futian, Shenzhen, China. Correspondence to: Ruohan Zhan <rhzhan@ust.hk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

(Ronchetti, 2021). These methods are resilient against variations or uncertainties, ensuring consistent and reliable outcomes. Robustness provided by these methods is particularly valuable in scenarios where limited sample sizes may not fully capture the entire distribution, or where the target environment differs from the initial sampling distribution.

The application of robust methods spans across various domains: in machine learning, they are utilized to enhance the robustness of algorithms, ensuring they maintain strong performance even when there are adversarial attacks in the input data (Blanchet et al., 2019; Sim et al., 2021). In energy systems, they are adopted to optimize the operation and planning, including bidding strategies in electricity markets, operation scheduling of power systems, and integration of renewable energy (Li et al., 2023; Huang et al., 2023). In supply chains, they are employed to optimize various aspects such as production planning, inventory management, logistics, and transportation. (Chen & Chen, 2023; Deng et al., 2023; Wang et al., 2023). These examples represent a fraction of the wide-ranging applications of robust methods. In fact, robust methods can be applied to any field that involves optimization problems, making it a vital tool for decision-making under uncertainty.

Among various robust methods, Distributionally Robust Optimization (DRO) is a pivotal approach (Hu & Hong, 2013; Bayraksan & Love, 2015; Esfahani & Kuhn, 2015). DRO s significance lies in its more robust handling of ambiguity compared to conventional stochastic programming models. This is achieved by optimizing the worst-case performance over a set of potential distributions rather than for a single distribution. Specifically, the DRO problem is formulated as follows:

min x X max P Pr EP [h(x, ξ)], (1)

where Pr = {P P : d(P, ˆPN) r}.

Above, x represents the decision variable, which is contained in a non-empty decision space X, and ξ is a random variable. The function h(x, ξ) denotes the loss associated with x and ξ. ˆPN is the empirical distribution derived from the data1. The function d( , ) is a distance measure to quantify discrepancies between distributions. The hyperparame-

1Other nominal distributions are also viable. For example, when provided with a parametric distributional class, the distribu-

Statistical Properties of Robust Satisficing

ter r, referred to as the radius , defines the ambiguity set Pr, a subset of P that encompasses all feasible distributions. It plays a crucial role in controlling robustness the larger the value of r, the greater the robustness demanded.

Despite its strengths, DRO has a few shortcomings. First, DRO can be overconservative in practice, as Esfahani & Kuhn pointed out. This is because the DRO framework optimizes the worst-case scenario in the distribution domain, which may be unnecessarily large to incorporate the target distribution. Second, selecting an appropriate and interpretable radius r is a challenging task in practice, as noted by Sim et al.. This difficulty stems from the abstract nature of the radius, which characterizes the distance within the distributional space and is hard to be intuitively translated into tangible, real-world values. In addition, there is a growing demand for incorporating globalized distributions as opposed to restricting to an ambiguity set under the DRO framework to further increase robustness (Liu et al., 2023).

To address these issues, the Robust Satisficing (RS) model has been proposed (Long et al., 2023), as structured below:

kτ = min k (2)

s.t. EP [h(x, ξ)] τ kd(P, ˆPN), P P

Here, the hyperparameter is no longer the radius r of the ambiguity set, but a reference value τ, which can be interpreted as an anticipated cost in practical applications. The constraint (2) then ensures that once the expected loss under a certain distribution exceeds our anticipated cost, the excess part should not be too large: it will be controlled by a multiple of the distribution s distance to the empirical distribution ˆPN. Hence, the RS model compromises some training set performance for robustness in the target distribution, as it doesn t aim for minimizing the empirical loss. Unlike DRO that focuses on worst-case optimization, RS follows a satisficing strategy to avoid over-conservatism, thereby providing better generalization performance on the target distribution. Another key aspect of the RS model is its global consideration of probability distributions, unlike DRO s restriction to ambiguity set.

Current research on the RS model primarly centers around forming tractable new optimization models and experimental analysis. Notable examples of tractable RS model encompass Risk-based Linear Optimization and Linear Optimization with Recourse (Sim, 2023; Long et al., 2023), illustrating RS s practical optimization and generalization advantages compared to the DRO models. Ruan et al. proposed Robust Satisficing Markov Decision Processes and demonstrated its superiority over traditional robust MDP

tion estimated using maximum likelihood estimation can serve as a substitute for ˆPN.

through experiments. Saday et al. proposed the Robust Bayesian Satisficing model, established upper bound on regret and outperformed Distributionally Robust Bayesian Optimization in experiments. Despite RS s notable results in the realm of optimization, to the best of our knowledge, there are no existing studies on the statistical properties of the RS model. This gap leads to the central research questions of this paper:

Are there statistical guarantees for the RS model? What are the statistical merits it holds, potentially surpassing DRO?

1.1. Contributions

Our work delves into the statistical theory of the RS model, with a focus on deriving and analyzing its statistical properties. In particular, we provide a two-sided confidence interval estimate for the optimal loss using the reference value, and present non-asymptotic upper bound of the generalization error. These results fill a crucial gap in the literature, where statistical guarantees for the RS model have been seldom studied. It is noteworthy that our results extend beyond cases where the sampling distribution matches the target distribution of interest, a context where robustness still remains relevant due to potential discrepancies between the empirical and sampling distributions, especially in smallsample regimes; we also consider scenarios involving distribution shifts, where disparities exist between the sampling distribution and the target distribution. We highlight the contributions of this paper as follows:

1) We obtain two-sided, non-asymptotic confidence intervals for the optimal loss J in the RS model, where J

is the minimum expected loss under the true distribution. Notably, this result does not necessitate solving a minimax optimization problem explicitly.

2) We present finite-sample generalization error bounds for the optimizer derived from the RS model, achieved through an insightful and succinct derivation.

3) We demonstrate that, even under distribution shifts, our key findings confidence intervals and generalization error bounds for the RS model optimizer remain valid. These results incorporate an additional term, a finite multiple of the distance between the sampling and target distributions. This adaptation highlights the RS model s robust generalization abilities.

4) Our numerical experiments reveal that RS model s advantages over the empirical risk minimization baseline becomes more pronounced in small-sample regimes or with increasing distribution shifts. Furthermore, our analysis reveals the relationship between the RS and DRO models under the Lipschitz loss scenarios, which also

Statistical Properties of Robust Satisficing

highlights that the RS model has lower sensitivity to hyperparameter tuning as compared to DRO.

In all these aspects, we perform an extensive comparison with DRO. Our analysis reveals that these advantageous properties are closely associated with the inherent structure of the RS model itself. It becomes evident that obtaining statistical guarantees is more straightforward within the RS framework compared to the DRO framework.

We start by describing our learning problem. Let ξ Ξ be the m-dimensional random variable of observations and x X be the decision variable to be learned. Let h(x, ξ) be the loss function (which can accommodate a wide range of machine learning problems as detailed in Appendix D). We use J to denote the minimum expected loss under the optimal decision variable x :

J := inf x X EP [h(x, ξ)] = EP [h(x , ξ)]. (3)

Given N observations {ξi}N i=1 sampled from the distribution P , the decision maker wants to learn a decision variable such that the expected loss is minimized.

Consider the Robust Satisficing (RS) model (2), and we focus on the Wasserstein distance for the distance measure between distributions. Here, τ is the reference value , which can be interpreted as the anticipated cost in practical applications, and its choice will be further discussed in Section 2.1. P is the set of all feasible distributions, on which the RS model does not impose any constraint; this allows the RS model to consider probability distributions globally. ˆPN denotes the empirical distribution of samples {ξi}N i=1, which converges to P as sample size goes to infinity. And d W denotes the type-1 Wasserstein distance between two distributions2:

d W(Q1, Q2) := inf Π

Ξ Ξ c(ξ1, ξ2)Π(dξ1, dξ2) ,

where Π is a joint distribution over (ξ1, ξ2), with its marginal distributions on ξ1 and ξ2 being Q1 and Q2 respectively; the cost function c( , ) used for Wasserstein distance is chosen as the type-I version, with c(x, y) = x y 2.

Let ˆx N be the solution derived from the RS model (2), the reformulation of which will be elaborated upon in Section 2.2. Our goal is to provide statistical guarantees on ˆx N and J .

2We follow the literature (Long et al., 2023) and consider the Wasserstein distance instead of f-divergence to avoid the requirement that P is absolute continuous with respect to ˆPN, which is impractical for continuous distribution.

2.1. Reference Value τ

The reference value τ introduced in the RS model (2) is critical in controlling the robustness of the learned solution ˆx N. Conceptually, following the satisficing criterion, the RS model ensures that any excess beyond the reference value τ under a certain distribution is controlled by a multiple of the distance between this distribution and the empirical distribution of the data. A larger τ indicates increased robustness considered in the RS model.

Choose P as ˆPN in (2), we easily obtain:

τ E ˆ PN [h(x, ξ)]. (4)

Inspired by this, Long et al. suggest choosing τ as:

τϵ := (1 + ϵ) inf x E ˆ PN [h(x, ξ)], (5)

where ϵ is referred to as tolerance rate that the RS model allows for excess empirical loss. This means that the reference value τ, which we choose or tolerate, is ϵ more than the smallest cost achievable under the empirical distribution. We adopt this approach, focusing on characterizing the role of ϵ in the statistical guarantees provided by the RS model. Additionally, ϵ will be the primary hyperparameter we adjust and analyze in the numerical experiments section.

2.2. Reformulation

The original RS optimization (2) requires enumerating over all possible distributions over P, which may not be tractable. We now reformulate the model (2), following the practice by Long et al.. Let η and ξ be samples from P and ˆPN respectively, and let π(η|ϵ) be the conditional distribution of η when conditioning on ξ. We have:

sup P {EP [h(x, η)] kd W (P, ˆPN)}

ZZ [h(x, η) kc (ξ, η)]d ˆPN (ξ) dπ (η|ξ)

=E ˆ PN [sup z Ξ h(x, z) kc(ξ, z)],

where the last equation is achieved by choosing the maximizer π as the Dirac distribution, which concentrates the mass at the point to maximize {h(x, ) kc(ξ, )}. Then the RS model (2) can be reformulated as:

min k 0 (7)

s.t. E ˆ PN [sup z Ξ h(x, z) kc(ξ, z)] τ.

With that, the optimizer ˆx N of RS model can be obtained in a hierarchical way. First, for a fixed decision variable x, let

Statistical Properties of Robust Satisficing

kτ(x) be the smallest k that satisfies the RS constraint:

kτ(x) := min k(x),

s.t. E ˆ PN [sup z Ξ h(x, z) kc(ξ, z)] τ.

Then ˆx N is the minimizer of kτ(x):

ˆx N := argminxkτ(x). (8)

We note that a similar reformulation technique has been employed by Blanchet & Murthy to derive tractable solutions for DRO. However in the DRO framework, the distribution is restricted to an ambiguity set, necessitating the use of a Lagrange multiplier for constraint conditions and the existence of strong duality. These constraints introduce additional assumptions, including the continuity of functions h(x, ξ) and c( , ). In contrast, robust satisficing, which does not limit the distribution set, avoids these extra assumptions.

3. Statistical Properties

This section presents our main results for the statistical properties of the optimizer ˆx N in the RS model (2). We start by describing the assumptions required for our analysis. Assumption 1 (Exponential tail decay in random variable). There exists an a > 1, such that EP [exp(||ξ||a)] < .

Assumption 1 requires that ξ is relatively light-tailed. It plays a key role in bounding the rate at which the empirical distribution ˆPN approximates the true distribution P

under the type-1 norm Wasserstein distance (Fournier & Guillin, 2015) (see Proposition 2 for details). This assumption is relatively mild and is applicable to a broad range of distributions including sub-Gaussian random variables. Assumption 2 (Lipschitz continuity of loss function). The loss h(x, ξ) is Lipschitz with a uniform constant L in ξ.

Assumption 2 is essential for deriving the dual expression form of the type-1 norm Wasserstein distance (Esfahani & Kuhn, 2015) (see Proposition 1 for details). This assumption holds true for a wide range of machine learning problems, which we further elaborate in Appendix D. Note that we don t need the Lipschitz continuity assumption of h(x, ξ) with respect to x, which we defer the detailed discussion in the Appendix E.

3.1. Confidence Intervals of Optimal Loss

This section provides both non-asymptotic and asymptotic confidence intervals for the optimal loss J , the smallest attainable expected loss as defined in (3), and the true loss of ˆx N. Theorem 1 (Confidence intervals of optimal loss). Suppose Assumptions 1 & 2 hold. For any N, let βN be the confidence level. We have with probability at least 1 βN:

L r N + τϵ 1 + ϵ J EP [h(ˆx N, ξ)] kτϵ r N + τϵ,

where r N, denoted as the remainder , is solved from the below equation:

( c1 exp c2Nr N max{m,2} if r N 1,

c1 exp c2Nra N if r N > 1,

with c1, c2 as positive constants that only depend on exponential decay rate a and dimension m.

Moreover, when choosing the confidence sequence {βN} satisfying P N=1 βN < , we have

P n L r N + τϵ 1 + ϵ J EP [h(ˆx N, ξ)]

kτϵ r N + τϵ for all sufficiently large N o = 1. (10)

Table 1 outlines typical selections of βN and their respective rates of decay for the remainder r N. Notably, the last two βN options satisfy P N=1 βN < and lim N r N = 0, under which (10) suggests asymptotic consistency of EP [h(ˆx N, ξ)](Note that this asymptotic interval applies to J directly by the convergence of empirical loss to the true loss as N increases):

1 + ϵ EP [h(ˆx N, ξ)] τϵ for all

sufficiently large N o = 1. (11)

We recognize the challenge posed by the curse of dimensionality, as indicated by the exponent m in r N, which is a common issue associated with the Wasserstein distance (Esfahani & Kuhn, 2015; Kuhn et al., 2019), and we leave as a promising future research question.

We also note that the upper bound in Eq. (9) includes kτϵ, which may be difficult to derive analytically. Fortunately, the following lemma provides an upper bound guarantee for kτϵ. Lemma 1 (Fragility Upper Bound). Under Assumption 2, we have kτ L, where kτ is solved from the RS model (2).

Lemma 1 that we prove is noteworthy on its own. As pointed out by Long et al., kτ characterizes the fragility of the model, with lower values indicating more robustness. Lemma 1 sets an upper bound for kτ based on the Lipschitz constant L, suggesting the model fragility being controlled. Remark 1. In the proof detailed in the Appendix B.2, we establish the following relationship:

L d W (P , ˆPN) + τϵ 1 + ϵ J EP [h(ˆx N, ξ)]

kτϵ d W (P , ˆPN) + τϵ. (12)

Statistical Properties of Robust Satisficing

Equation (12) illustrates that the true loss of ˆx N (the optimizer obtained from the RS model), under the target distribution P , also falls within the confidence interval provided by Equation (9). Furthermore, Equation (9) further facilitates the derivation of the upper bound of the generalization error in Theorem 3.

Remark 2. Equation (9) provides a guideline on determining the sufficient sample size required to achieve a predefined accuracy at a specified confidence level βN. This sample size is primarily quantified by the width of the confidence interval and mainly driven by r N. Table 1 illustrates various selections of βN along with their corresponding r N values, which allows us to explicitly compute the sample size required for specific scenarios.

By integrating Lemma 1 with Theorem 1, we derive a simpler form of confidence intervals for J , which depends solely on the Lipschitz constant L and the reference value τϵ, eliminating the need to compute kτϵ from the RS model.

Corollary 2. Suppose Assumptions 1 & 2 hold. For any N and the confidence level βN, let r N be solved as in Theorem 1. With probability at least 1 βN, we have

L r N + τϵ 1 + ϵ J , EP [h(ˆx N, ξ)] L r N + τϵ.

The remainder r N becomes negligible for choices of βN listed in Table 1. Thus Corollary 2 indicates that as N approaches , the expected loss EP h(ˆx N, ξ) of the optimizer ˆx N will also fall within the interval [ τϵ

1+ϵ, τϵ]. This allows us to characterize the loss value that the optimizer can achieve and also shows that the regret of our optimizer ˆx N (the gap between EP h(ˆx N, ξ) and the true loss) will be controlled by the length of the interval asymptotically.

To conclude this section, we offer a brief comparison of our confidence intervals with those derived by Esfahani & Kuhn. In their DRO framework, they define

JN = inf x X sup P B( ˆ PN,ϵ(βN)) EP [h(x, ξ)] ,

where B( ˆPN, ϵ(βN)) represents a Wasserstein ball with its center ˆPN and radius ϵ(βN). Under similar assumptions, Esfahani & Kuhn show that

P{J JN} 1 βN,

which provides only an upper bound for the optimal loss J . Moreover, this upper bound JN requires to solve the minimax problem in the DRO framework. In contrast, our confidence intervals from Corollary 2 are derived through the relatively easier optimization of the ERM problem than the minimax problem, and our results provide two-sided rather than one-sided confidence intervals.

3.2. Finite-Sample Generalization Error Bound

We now focus on characterizing the generalization error of the optimizer ˆx N derived from the RS model. The generalization error, denoted as R(P , ˆx N), is defined as follows:

R(P , ˆx N) :=EP [h(ˆx N, ξ)] J

=EP [h(ˆx N, ξ)] EP [h(x , ξ)].

Theorem 3. Suppose Assumptions 1 & 2 hold. With probability at least 1 βN, we have:

R(P , ˆx N) ϵ J + (2 + ϵ) L r N, (13)

where r N is the reminder solved as in Theorem 1. Taking expectation with respect to data, we have:

EP [R(P , ˆx N)] ϵ J + O(L N min{ 1

Remark 3. We further elaborate on the expectation with respect to data . Recall that we derive the optimizer ˆx N based on sample data, which are random variables that follow the source distribution P . As a result, the ˆx N and its generalization error upper bound are also random variables. So we take the expectation with respect to the randomness from the sample data to derive our expected version of the generalization error upper bound.

Theorem 3 explicitly characterizes how the generalization error is influenced by ϵ. By reducing ϵ as the sample size N increases indicating less tolerance for empirical loss excess with more data we can bound the generalization error more succinctly, as outlined in the following result. Corollary 4. Suppose Assumptions 1 & 2 hold. Choose reference value τϵN with ϵN = N min{ 1

EP [R(P , ˆx N)] = O(L N min{ 1

4. Guarantees under Distribution Shift

As discussed, the distribution selection under the RS framework is globalized, eliminating the need to pre-select a radius to restrict the distribution domain. We take this advantage further and integrate it into the earlier derivation process, allowing us to straightforwardly derive the confidence intervals and the finite-sample generalization error bound under distribution shifts.

Consider that samples are drawn from the source distribution P , and the empirical distribution is denoted as ˆPN. The decision variable learned from the RS model (2) is ˆx N. Under distribution shifts, we evaluate the performance when applying ˆx N to another distribution P, which may shift from P , resulting in a certain degree of discrepancy.

Define the optimal loss under the new distribution P as J:

J := inf x X E P [h(x, ξ)] = E P [h( x, ξ)],

Statistical Properties of Robust Satisficing

Table 1. Choices of Confidence Level βN Choice of βN Corresponding r N

c2N 1/ max{m,2} if N log(c1β 1)

c2 , log(c1β 1)

c2N 1/a if N < log(c1β 1)

βN = exp( γ

N), γ > 0 r N =

1/ max{m,2} if c2N γ

N log c1, log c1

1/a if c2N γ

N < log c1.

βN = N α, α > 0 r N =

c2N + α log N

c2N 1/ max{m,2} if c2N α log N log c1, log c1

c2N + α log N

c2N 1/a if c2N α log N < log c1.

where x = argminx E P [h(x, ξ)]. For the learned decision variable ˆx N, denote the corresponding generalization error as

R( P, ˆx N) := E P [h(ˆx N, ξ)] J

= E P [h(ˆx N, ξ)] E P [h( x, ξ)].

Our goal is to derive confidence intervals for J and generalization error bound for R( P, ˆx N).

Theorem 5 (Distribution Shift). Suppose Assumptions 1 & 2 hold. For any N, let βN be some nominal confidence level. We have with probability at least 1 βN:

L r N L d W (P , P) + τϵ 1 + ϵ J E P h(ˆx N, ξ)

kτϵ r N + kτ d W (P , P) + τϵ,

R( P, ˆx N) ϵ J + (2 + ϵ) L d W (P , P)

+(2 + ϵ) L r N,

where the reminder r N is solved as Theorem 1.

Taking the expectation on data, we have:

EP h R( P, ˆx N) i ϵ J + (2 + ϵ) L d W (P , P)

+O L N min{ 1

This theorem shows that results under distribution shifts merely require adding a multiple of the shift distance.

Remark 4. While our results face the common curse of dimensionality issue associated with the Wasserstein distance, they embody a trade-off. In higher dimensions, despite the slow decay of the remainder term r N, a greater degree of distribution shift is tolerable. Specifically, when the distribution shift decays at the rate of N min{ 1

2 }, this rate can

be integrated with the remainder term to yield the following guarantee:

EP h R( P, ˆx N) i ϵ J + O(L N min{ 1

This implies that the RS model can accommodate a distribution shift up to N min{ 1

2 } while still maintaining performance comparable to scenarios with no shift.

Finally, we compare our results with DRO. Under the DRO framework, if the distribution shifts, we must require the radius to reach a certain magnitude so that the ambiguity set can contain the distribution after the shift. However, as this ball expands, the worst-case expected value within DRO s conservative minimax framework deteriorates. In contrast, the RS framework, benefiting from its globalized distribution selection, only requires the inclusion of a linear multiple of the shift distance to address the same situation.

5. Numerical Experiments

In this section, we conduct numerical evaluations of the RS model under both the original sampling distribution and distributional shifts. We compare RS with the baseline method, which is empirical risk minimization (ERM). Additionally, we establish connections with DRO and demonstrate that RS exhibits lower sensitivity to hyperparameter tuning.

All experiments are based on a data generating process detailed below. We define the random variable ξ as ξ = (u, y), with u Rmu representing the feature variable and y R as the label variable. The sampling distribution P

is specified as follows: the feature variable u is drawn from a normal distribution:

u N [0.5, 0.5, ..., 0.5]T , 0.5Imu ; (16)

and the label variable y is generated via a linear model:

y = u x + e,

Statistical Properties of Robust Satisficing

where means the inner product, x is the true model parameter, and e is the exogenous noise sampled from N(0, 0.1). P satisfies Assumption 1 because Gaussian distribution is light-tailed. Let the training data {(ui, yi)}N i=1 be i.i.d. samples from the distribution P .

We use ℓ1 loss for model parameter x: h(u, y, x) = |y u x|, which satisfies the Lipschitz condition in Assumption 2. For the cost function used in the type-I Wasserstein distribution, we follow (Blanchet et al., 2019) and slightly modify its original definition of the l2 norm as follows3:

c(ξ1, ξ2) = c((u1, y1), (u2, y2)) = n ||u1 u2||2 if y1 = y2, + otherwise.

The learned parameter ˆx N is evaluated on the target distribution P. The marginal ditribution of u under P is identical to that under P , following (16). However, the label variable y is generated under a potentially different parameter x:

y = u x + e.

In the following sections, we will evaluate the performances of RS under two scenarios: when P = P (i.e., x = x), representing settings without distribution shift, and when P = P (i.e., x = x), indicating settings with distribution shift. We focus on the mean square error (MSE) in the target distribution as the performance metric. We will conclude this section by drawing connections between RS and DRO.

5.1. RS Performance in the Sampling Distribution

In this section, we evaluate the RS optimizer ˆx N under the sampling distribution. Although the target distribution aligns with the sampling distribution, discrepancies between the empirical and sampling distributions may arise, particularly in small-sample regimes for high dimensional random variables. For this purpose, we consider a relatively highdimensional setting with the dimension mu = 10 and the true model parameter x = x = [2.0, 1.0, ..., 2.0, 1.0]T . We investigate the generalization performance of ˆx N across various sample sizes.

Figure 1 demonstrates how the RS model s performance varies with different settings of the tolerate rate ϵ and across various sample sizes. For smaller sample sizes, the RS model outperforms the baseline that does not incorporate robustness; among those RS models, those configured with a larger ϵ (indicating a greater emphasis on robustness) perform better. This result notes the importance of accounting for robustness, particularly when there is a notable gap between the sampling and empirical distributions in smallsample regimes. As the sample size increases, the relative

3The purpose of this adjustment is to make the subsequent exposition more concise. We leave the results under the original l2 norm in Appendix C.2

Figure 1. Performances across various sample sizes. RS outperforms the ERM baseline in small-sample regimes.

benefit of the RS model decreases. This trend is expected since a larger dataset allows to better picture the sampling distribution, making the baseline approach of empirical risk minimization increasingly effective.

5.2. RS Performance under Distribution Shift

We now evaluate the performance of RS under distribution shift. Let the model parameter in the sampling distribution be x = [2.0, 1.0], and the model parameter in the target distribution be:

x = [2.00 0.05 DEGREE, 1.00 + 0.025 DEGREE],

where DEGREE is a positive number characterizing the degree of distribution shift4. As DEGREE increases, the discrepancy between x and x increases, leading to a larger distribution shift between the sampling distribution P and the target distribution P.

Figure 2 shows how the RS model performs when configured with different tolerance rates ϵ and under various distribution shifts. For minor distribution shifts (DEGREE less than 2), the RS model s performance is comparable to the baseline, deteriorating slightly at models of larger ϵ for stronger robustness control. However, with more substantial distribution shifts (DEGREE greater than 5), the RS model almost consistently outperforms the baseline, presenting stronger robustness under larger distribution shifts. This result highlights the potential of RS framework for strong generalization guarantee in uncertain environments.

5.3. Connection to DRO under Lipschitz loss

We proceed to compare RS and DRO. We establish explicit correspondence between the hyperparameters of RS and

4Here our focus is on evaluating robustness in scenarios with smoothed parameters, rather than under arbitrary perturbations of x . This setup is based on the observation that both DRO and RS, known for their robustness, typically yield smoother parameters than those derived from direct empirical risk minimization.

Statistical Properties of Robust Satisficing

Figure 2. Performances across various degree of distribution shifts. RS outperforms the ERM baseline under distribution shifts.

DRO in this type of problem. And we conduct experiments to compare their sensitivities to hyperparameter tuning.

5.3.1. HYPERPARAMETER CORRESPONDENCE

Consider a Lipschitz loss function L( ). We reformulate the general Lipschitz-loss learning problem following the DRO literature (Blanchet et al., 2019; Shafieezadeh-Abadeh et al., 2019):

i=1 L(yi ui x) + r ||x||2. (17)

Building on the reformulation presented in Section 2.2, this Lipschitz-loss learning problem under the RS framework is equivalent to:

min x ||x||2, (18)

i=1 L(yi ui x) τ.

By applying the Lagrangian method to solve Equation (18) and with the strong duality held, we further deduce that Equation (18) simplifies to the following expression (see Appendix C.1 for the proof):

sup λ>0 inf x

1 N PN i=1 L(yi ui x) + λ ||x||2 τ

We immediately observe a clear link between RS and DRO for the general Lipschitz-loss learning problem: given a reference value τ, solving the RS model (19) yields the optimizer (ˆλN, ˆx N). Then, by setting the radius r in the DRO model (17) to ˆλN, the DRO model (17) generates the same optimizer for the model parameter x. Recall that the

Figure 3. Correspondence between RS torelance rate parameter ϵ and DRO radius parameter r.

reference value is set to be τϵ = (1 + ϵ) infx E ˆ PN [h(x, ξ)] throughout this paper, where ϵ is the tolerate rate that controls the robustness of RS. Thus each hyperparameter ϵ in the RS model is associated with a specific radius r(ϵ), the hyperparameter in the DRO model.

Figure 3 illustrates the relationship between the robustnesscontrolling hyperparameters of the two models: the DRO radius r and the RS torelance rate ϵ. Notably, the function r(ϵ) is concave with respect to ϵ, flatting as 1 + ϵ nears 1.6. This indicates that, to achieve comparable performance, the RS model can accommodate larger variations in ϵ compared to variations in r for the DRO model, implying that RS is less sensitive to hyperparameter tuning. This observation will be further supported by the following experiments.

5.3.2. NUMERICAL SENSITIVITY ANALYSIS

We now conduct experiments to evaluate the sensitivity of RS and DRO to hyperparameter tuning. Specifically, we vary the tolerance rate ϵ in the RS model and the radius r in the DRO model. We set the model parameter in the sampling distribution to x = [2.0, 1.0]T , as in Section 5.2; and set the target environment to be x = [1.80, 0.90]T .

Figure 4 shows that DRO outperforms the baseline for radius smaller than 0.24, with the optimal r around 0.215. Figure 5 shows that RS exceeds the baseline when 1 + ϵ is below 1.4, with the optimal 1 + ϵ around 1.285. The smallest MSEs from both DRO and RS models are comparable.

However, there is a drastic MSE surge in response to changes of r for DRO in Figure 4, in contrast to the milder variation of MSE to ϵ for RS showed in Figure 5. This difference in hyperparameter sensitivity aligns with the nuanced relationship between r and ϵ depicted in Figure 3. In particular, within a 15% relative error range around the optimal hyperparameters, the MSE for DRO may spike to

Statistical Properties of Robust Satisficing

Figure 4. DRO performance under distribution shifts. DRO model shows higher sensitivity to haperparameter r.

Figure 5. RS performance under distribution shifts. RS model shows lower sensitivity to hyperparameter ϵ.

0.30, whereas RS maintains a more stable MSE of 0.125. This suggests that RS offers greater flexibility in setting the tolerate rate hyperparameter ϵ, unlike DRO, which requires more precise tuning for the radius r.

6. Conclusions

This paper focuses on exploring the statistical properties of the RS model, a recent robust optimization framework introduced in (Long et al., 2023). We provide theoretical guarantees for the RS model, including two-sided confidence intervals for the optimal loss and finite-sample generalization error bounds. These guarantees extend to scenarios involving distribution shifts, highlighting the RS model s robust generalization performance. Our numerical experiments reveal the superiority of the RS model compared to the baseline empirical risk minimization method, particularly in small-sample regimes and under distribution shifts. We establish explicit connections between the RS and DRO frameworks within specific models, showcasing that RS exhibits lower sensitivity to hyperparameter tuning than DRO, making it a more practical and interpretable choice. Future research directions include extending our analysis of RS to other distribution distances such as f-divergence, proving lower bounds for generalization error, and applying RS to various practical applications.

Acknowledgement

We would like to thank the reviewers for their constructive feedback, which has greatly improved the current version of our work. Yunbei Xu acknowledges support from ARO through award W911NF-21-1-0328 and from the Simons Foundation and NSF through award DMS-2031883. Ruohan Zhan is partly supported by the Guangdong Provincial Key Laboratory of Mathematical Foundations for Artificial Intelligence (2023B1212010001) and the Project of Hetao Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone (HZQB-KCZYB-2020083).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Bayraksan, G. and Love, D. K. Data-driven stochastic programming using phi-divergences. In The operations research revolution, pp. 1 19. INFORMS, 2015.

Blanchet, J. and Murthy, K. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565 600, 2019.

Blanchet, J., Kang, Y., and Murthy, K. Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830 857, 2019.

Chen, S. and Chen, Y. Designing a resilient supply chain network under ambiguous information and disruption risk. Computers & Chemical Engineering, 179:108428, 2023.

Deng, M., Bian, B., Zhou, Y., and Ding, J. Distributionally robust production and replenishment problem for hydrogen supply chains. Transportation Research Part E: Logistics and Transportation Review, 179:103293, 2023.

Esfahani, P. M. and Kuhn, D. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. ar Xiv preprint ar Xiv:1505.05116, 2015.

Fournier, N. and Guillin, A. On the rate of convergence in wasserstein distance of the empirical measure. Probability theory and related fields, 162(3-4):707 738, 2015.

Hu, Z. and Hong, L. J. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 1(2):9, 2013.

Statistical Properties of Robust Satisficing

Huang, H., Li, Z., Gooi, H. B., Qiu, H., Zhang, X., Lv, C., Liang, R., and Gong, D. Distributionally robust energytransportation coordination in coal mine integrated energy systems. Applied Energy, 333:120577, 2023.

Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shafieezadeh Abadeh, S. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations research & management science in the age of analytics, pp. 130 166. Informs, 2019.

Li, Y., Han, M., Shahidehpour, M., Li, J., and Long, C. Datadriven distributionally robust scheduling of community integrated energy systems with uncertain renewable generations considering integrated demand response. Applied Energy, 335:120749, 2023.

Liu, F., Chen, Z., and Wang, S. Globalized distributionally robust counterpart. INFORMS Journal on Computing, 2023.

Long, D. Z., Sim, M., and Zhou, M. Robust satisficing. Operations Research, 71(1):61 82, 2023.

Ronchetti, E. The main contributions of robust statistics to statistical science and a new challenge. Metron, 79(2): 127 135, 2021.

Ruan, H., Zhou, S., Chen, Z., and Ho, C. P. Robust satisficing mdps. In International Conference on Machine Learning, pp. 29232 29258. PMLR, 2023.

Saday, A., Yıldırım, Y. C., and Tekin, C. Robust bayesian satisficing. ar Xiv preprint ar Xiv:2308.08291, 2023.

Shafieezadeh-Abadeh, S., Kuhn, D., and Esfahani, P. M. Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019.

Sim, M. The analytics of robust satisficing: predict, optimize, satisfice, then fortify. In Book of Abstracts, pp. 42, 2023.

Sim, M., Zhao, L., and Zhou, M. Tractable robust supervised learning models. Available at SSRN 3981205, 2021.

Wang, D., Yang, K., and Yang, L. Risk-averse two-stage distributionally robust optimisation for logistics planning in disaster relief management. International Journal of Production Research, 61(2):668 691, 2023.

Statistical Properties of Robust Satisficing

A. Key Propositions

Proposition 1. For any distributions Q1, Q2 M(Ξ), we have

d W Q1, Q2 = sup f L

Ξ f(ξ) Q1(dξ) Z

Ξ f(ξ) Q2(dξ) o , (20)

where L denotes the space of all Lipschitz functions with |f(ξ) f(ξ )| ξ ξ for all ξ, ξ Ξ. and || || is a norm

This is the dual representation of Wasserstein distance, and it needs the Assumption 2.

We will also utilize the bound of Wasserstein distance between P and ˆPN, which is presented below.

Proposition 2. (Fournier & Guillin, 2015) If Assumption1 holds, we have

P Nn d W P , ˆPN r o c1 exp c2Nrmax{m,2} if r 1, c1 exp c2Nra if r > 1,

for all N 1, the dimension of ξ : m = 2 and r > 0, where c1, c2 are positive constants that only depend on a and m.

B. Proof of Lemmas and Theorems

B.1. Proof of Lemma 1.

Choose x = ˆx N, we have

EP [h(ˆx N, ξ)] τ kτd W (P, ˆPN) P P. (21)

Moreover, by the definition of kτ, for any δ > 0, we can choose one distribution P1, which satisfies:

EP1[h(ˆx N, ξ)] τ (kτ δ)d W (P1, ˆPN). (22)

Then (22) minus (21), we have:

(kτ δ)d W (P1, ˆPN) kτd W (P, ˆPN) (23)

EP1[h(ˆx N, ξ)] EP [h(ˆx N, ξ)] (24)

L d W (P1, P), (25)

for all δ > 0 and P P, Where the second inequality utilizes (20). Due to the arbitrariness of δ, we can ignore the terms of δ and choose ˆPN as P in (23), we get:

kτd W (P1, ˆPN) L d W (P1, ˆPN). (26)

If d W (P1, ˆPN) > 0, we will complete the proof.

Now we explain why d W (P1, ˆPN) > 0 holds. Actually, d W (P1, ˆPN) = 0 if and only if P1 = ˆPN, a.s. But if P1 = ˆPN, then (22) leads to that E ˆ PN [h(ˆx N, ξ)] τ. Meanwhile, we can also choose P = ˆPN in (21) and we find E ˆ PN [h(ˆx N, ξ)] τ. So E ˆ PN [h(ˆx N, ξ)] = τ. But it is impossible to hold because we can change ϵ in τ = τϵ randomly.

B.2. Proof of Theorem 1.

For the right side , J = EP [h(x , ξ)] EP [h(ˆx N, ξ)] kτϵ d W (P , ˆPN) + τϵ, where the second inequality is derived from the model (2) itself and the fact that kτ(ˆx N) = kτ. For the other side, by the definition of J , for any η > 0, choose xη satisfies:

EP [h(xη, ξ)] J + η. (27)

Statistical Properties of Robust Satisficing

τϵ = (1 + ϵ) inf x E ˆ PN [h(x, ξ)] (28)

(1 + ϵ)E ˆ PN [h(xη, ξ)] (29)

(1 + ϵ) h L d W (P , ˆPN) + EP [h(xη, ξ)] i (30)

(1 + ϵ)L d W (P , ˆPN) + (1 + ϵ)(J + η), (31)

for all η > 0. The second inequality is derived from (20) and the third inequality uses (27). Due to the arbitrariness of η, we have τϵ (1 + ϵ)L d W (P , ˆPN) + (1 + ϵ)J , which can be solved:

L d W (P , ˆPN) + τϵ 1 + ϵ J .

For (9), we can simply utilize (2) :

P n L r N + τϵ 1 + ϵ J kτϵ r N + τϵ o P n d W P , ˆPN r N o 1 βN, (32)

where βN is given in (9).

Then we denote events AN = n L r N + τϵ 1+ϵ J kτϵ r N + τϵ o . Then P(Ac N) βN. Because P N=1 βN < , we use Borel-Cantelli Lemma and the limit supremum of the sequence of events satisfies:

P( lim N sup Ac N) = 0. (33)

So P(lim N inf AN) = 1, which implies the consistency.

Finally, if lim N r N(βN) = 0, we can take N and obtain (11).

B.3. Proof of Theorem 3.

Utilize Theorem 1 and Remark 1, we can obtain that:

J AN kτϵ d W (P , ˆPN) + τϵ

L d W (P , ˆPN) + τϵ

(2 + ϵ) L d W (P , ˆPN) + (1 + ϵ) J ,

where the second inequality utilizes Lemma 1.

Then we have the similar step to (32):

P n J AN (1 + ϵ) J + (2 + ϵ) L r N o P n d W P, ˆPN r N o 1 βN.

Then we take the expectation on data and we have:

E Pdata[d W (P , ˆPN)] = Z

0 Pdata n d W P, ˆPN r o dr

0 c1 exp c2Nrmax{m,2} dr + Z

1 c1 exp c2Nra dr

= N 1 max{m,2} Z N 1 max{m,2}

0 c1 exp c2tmax{m,2} dt + N 1

N 1 a c1 exp c2ta dt

N 1 max{m,2} Z

0 c1 exp c2tmax{m,2} dt + N 1

0 c1 exp c2ta dt

= O(N min{ 1

Statistical Properties of Robust Satisficing

The first inequality is derived from (2) and we utilize the convergence of two exponential integrals. Finally, we have:Here a can be removed: When a > 2 satisfies Assumption 1, it can be weaken for a = 2; when a < 2 in Assumption 1, here we have 1

a can be omitted due to minimum. The result should be O(N min{ 1

J E Pdata AN (2 + ϵ) E Pdata[d W (P , ˆPN)] + (1 + ϵ) J

= (1 + ϵ) J + O( 1

B.4. Proof of Theorem 5.

The proof here is very straightforward following the proof of Theorem 1 and 3. Combine with the formula (12) that was proven earlier, we can easily obtain similar result:

L d W ( P, ˆPN) + τϵ 1 + ϵ J E P [h(ˆx N, ξ)] kτϵ d W ( P, ˆPN) + τϵ. (34)

Then we use the triangle inequality of distance: d W ( P, ˆPN) d W (P , ˆPN) + d W ( P, P ) and we get the extra term d W ( P, P ). And for the term d W (P , ˆPN), continue to use the tail probability (2) to get guarantees for various probabilities and expected values.

C. Supplementary results for Section 5

C.1. Derivation of Equivalent Models in Section 5.1

Proof of (17). For convenience, let xa denote the augmented parameter vector ( x, 1)T . Consider Lipschitz loss h(x, ξ) = L(xa ξ). For DRO, under mild assumptions, Esfahani & Kuhn have shown that:

max P {P :d(P, ˆ PN) r} EP [h(x, ξ)] = inf λ 0 λr + 1

i=1 sup ξ (h(x, ξ) λc(ξ, ξi)). (35)

Next, denote = u ui, Utilizing the proof given by Shafieezadeh-Abadeh et al., we can obtain:

sup ξ (h(x, ξ) λc(ξ, ξi)) = sup ξ (L(xa ξ) λc(ξ, ξi)) = sup u (L(x u yi) λ||u ui||2)

= sup (L((ui + ) x yi) λ|| ||2) = n L(ui x yi) ifλ ||x||2, + otherwise.

The second equality here utilizes the definition of our fine-tuned cost function: if y in ξ and yi in ξi are not equal, the distance will become , thereby making the entire expression . Therefore, only the distance of the feature variable u is retained.

Back to (35), we have:

inf λ 0 λr + 1

i=1 sup ξ (h(x, ξ) λc(ξ, ξi)) = inf λ ||x||2 λr + 1

i=1 L(ui x yi) = r||x||2 + 1

i=1 L(ui x yi).

Then we have

min x X max P {P :d(P, ˆ PN) r} EP [h(x, ξ)] = min x 1 N

i=1 L(yi ui x) + r ||x||2. (36)

Proof of (18). We have already given the reformulation of the RS model(7) in Section 2.2. The constraint condition is:

i=1 [sup z Ξ h(x, z) kc(ξi, z)] τ. (37)

Statistical Properties of Robust Satisficing

The proof will follow the proof of (17):

i=1 [sup z Ξ h(x, z) kc(ξi, z)] = n 1 N PN i=1 L(ui x yi) ifk ||x||2, + otherwise.

Since τ serves as the upper bound in (37), to ensure that the left side of (37) is not infinite, it is necessary to satisfy k ||x||2. Therefore, in (7), taking the minimum value of k is equivalent to minimizing ||x||2 and let k = minx ||x||2, while satisfying the constraint condition 1

N PN i=1 L(ui x yi) τ.

Proof of (19). Following (18), we can express it in its dual form:

inf x sup λ>0 ||x||2 + λ ( 1

i=1 L(yi ui x) τ).

Notably, this equation represents a convex problem. As long as τ > minx 1

N PN i=1 L(yi ui x) which is mentioned in (4), there exists a point in the relative interior, hence the Slater s strong duality condition holds. Subsequently, to facilitate comparative analysis with DRO, we replace λ with 1

λ to obtain:

inf x sup λ>0

1 N PN i=1 L(yi ui x) + λ ||x||2 τ

Finally, note that the above equation is convex with respect to x and concave with respect to λ. According to the Mini-max theorem, we can interchange the order of sup and inf to obtain the desired formula.

C.2. Other Equivalent Conclusions

This section answers the question mentioned in the previous footnote. If we still use the l2 norm of the entire vector as the cost function i.e. c(ξ1, ξ2) = ||ξ1 ξ2||2, then the DRO model will be equivalent to:

min x X max P {P :d(P, ˆ PN) r} EP [h(x, ξ)] = min x 1 N

i=1 L(yi ui x) + r ||xa||2,

where xa is the augmented vector (x, 1)T . Similarly, RS model is equilvalent to:

min x ||xa||2,

i=1 L(yi ui x) τ.

And we can also write its dual equivalent form as:

sup λ>0 inf x

1 N PN i=1 L(yi ui x) + λ ||xa||2 τ

Therefore, after modifying the definition of the cost function, the only difference is whether one term in the model is the l2 norm of the parameter x itself or the l2 norm of the augmented vector (x, 1)T obtained by adding an element 1.

C.3. Function r(ϵ) in Ten Dimensions

In Section 5.1, we set the feature variable to be ten-dimensional. As a supplement to Section 5.2 , we also plot the τ ϵ relationship graph under the ten-dimensional situation of the feature variable.

Figure 6 shows similar relationship between the robustness-controlling hyperparameters of the two models: the DRO radius r and the RS tolerance rate ϵ. The overall trend of the graph is concave. It tends to flatten when 1+ϵ = 1.4. Moreover, when the function tends to be flat, the corresponding radius value r is smaller than that in the two-dimensional case in Figure 3.

Statistical Properties of Robust Satisficing

Figure 6. function r-ϵ(mu = 10)

D. Some Classical Loss Functions

Here we present a few common loss functions.

L(Z) CLASSIFICATION(C) OR REGRESSION(R) LEARNING MODEL

HINGE LOSS max{0, 1 z} C SVM

SMOOTH HINGE LOSS

( 1 2 z ifz 0 1 2(1 z)2 if0 < z < 1 0 z 1 C SMOOTH SVM

LOGLOSS log(1 + e z) C LOGISTIC REGRESSION SQUARED LOSS z2 R MSE L1 LOSS |z| R MAE

( 1 2z2 if|z| δ δ(|z| 1

2δ) otherwise R HUBER REGRESSION

δ-INSENSITIVE LOSS max{0, |z| δ} R SUPPORT VECTOR REGRESSION PINBALL LOSS max{ δz, (1 δ)z} R QUANTILE REGRESSION

Table 2. Some Classical Loss Functions

Here we consider h(x, ξ) as a loss function for machine learning applications, where x denotes the parameters in the classification or regression model, and ξ = (ξf, ξl)T represents the data with ξf as the feature variable and ξl as the label variable. For binary classification problems, the loss function can be defined as follows:

h(x, ξ) = L(ξl x T ξf). (38)

For regression problems, the loss function can be defined as follows:

h(x, ξ) = L(ξl x T ξf). (39)

Here we present a few common loss functions (See Table 2). Apart from the squared loss, all other loss functions in Table 2 are Lipschitz, so our Assumption 2 is relatively weak and reasonable. Furthermore, in practical applications, x and ξ are often bounded, so even if we use squared loss, it is Lipschitz in the case of a bounded domain.

Statistical Properties of Robust Satisficing

E. Discussion on Lipschitz continuity Assumption

In our paper, we do not need to assume the Lipschitz continuity of the loss function with respect to x, but with respect to ξ. This assumption follows that of (Esfahani & Kuhn, 2015), with the aim of using the inequality in Proposition 1. We however understand that it is a common condition to assume the Lipschitz continuity of parameter x. In response to this, we provide a conservative answer: at least for the regression problem in Appendix D where h(x, ξ) = L(ξl x T ξf), if both the random variable space Ξ and the parameter space X are bounded, then as long as we assume that L( ) is a Lipschitz function, it can be simultaneously derived that h(x, ξ) is Lipschitz with respect to both x and ξ = (ξf, ξl). In such scenarios, the Lipschitz assumptions for x and ξ hold simultaneously.