# learning_configurations_for_datadriven_multiobjective_optimization__2f6c0d34.pdf

Learning Configurations for Data-Driven Multi-Objective Optimization

Zhiyang Chen 1 Hailong Yao 2 3 Xia Yin 1

Multi-objective optimization problems arise widely in various fields. In practice, multiobjective optimization is generally solved by heuristics with tunable parameters that are highly application-specific. Tuning parameters based on real-world instances (a.k.a. algorithm configuration) are generally empirical without theoretical guarantees. In this work, we establish the theoretical foundation of data-driven multi-objective optimization through the lens of machine learning theory. We provide generalization guarantees on selecting parameters for multi-objective optimization algorithms based on sampled problem instances. Moreover, if the performance metric of the algorithm is the Pareto volume, we can PAC-learn the approximately optimal configuration in polynomial time. We apply our framework to various algorithms, including approximation algorithms, local search, and linear programming. Experiments on multiple problems verify our theoretical findings.

1. Introduction

Multi-objective optimization problems arise widely in various fields, including operations research (Herzel et al., 2021), machine learning (Sener & Koltun, 2018), and design automation (Chen & Young, 2020). A multi-objective optimization problem may involve multiple contradictory objectives that require simultaneous optimization. Therefore, a best-of-all-world solution that is optimal for every objective may not exist. Algorithm designers instead seek a trade-off between multiple objectives.

A notion of optimality for multi-objective optimization is the so-called Pareto optimality. However, computing the

1Tsinghua University, China 2University of Science and Technology Beijing, China 3Key Laboratory of Advanced Materials and Devices for Post-Moore Chips, Ministry of Education of China. Correspondence to: Hailong Yao <hailongyao@ustb.edu.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

exact set of Pareto-optimal solutions is generally intractable for most multi-objective optimization problems, since the optimal Pareto set may contain an exponential number of solutions. In practice, heuristic algorithms are often applied to tackle multi-objective optimization. For example, approximation algorithms (Ravi & Goemans, 1996; Chen & Young, 2020) are designed to simultaneously approximate multiple objectives. Another widely used heuristic is the weighted sum method, e.g., turning a two-objective optimization problem maxx(c1(x), c2(x)) into a single-objective one by optimizing a convex combination of objectives,

max x (1 ρ) c1(x) + ρ c2(x),

where 0 ρ 1.

However, both approximation algorithms and the weighted sum method contain tunable parameters. An approximation algorithm (Ravi & Goemans, 1996; Chen & Young, 2020) may contain parameters to balance the approximation ratio of different objectives. The weighted sum method also requires setting the weight ρ. The user may select a parameter configuration to optimize the algorithmic performance, or even select multiple configurations to form a solution portfolio to provide more decision options. The selection of parameters significantly impacts the algorithmic performance. There is usually no universal best configuration, and the user needs to tune the parameters based on application-specific problem instances carefully. Many algorithm configuration methods (Schede et al., 2022) are proposed to automatically select parameters with good performances. Researchers even apply neural networks (Li et al., 2021) to predict good parameters for multi-objective optimization.

Although multi-objective optimization algorithms are widely used in practice, the theoretical foundation of parameter selection is limited. In this work, we study the problem of algorithm configuration for multi-objective optimization through the lens of machine learning theory. Recently, there has been a growing interest in the sample complexity of learning algorithm parameters for a given application domain, a.k.a., data-driven algorithm design (Balcan, 2020). Concretely, there is a probabilistic distribution over instances of a multi-objective optimization problem. Given a training set sampled from this distribution, the learning algorithm selects one or multiple parameters that optimize the algorithmic performance from a parameterized family

Learning Configurations for Data-Driven Multi-Objective Optimization

using empirical risk minimization. We aim to provide generalization guarantees (a.k.a., sample complexity bounds) that bound the number of required samples to guarantee the performance difference to be small enough between training samples and the real distribution.

1.1. Main results

We summarize the contributions of this work as follows.

The first theoretical framework for data-driven multiobjective optimization. We establish a general framework for data-driven multi-objective optimization algorithms. Previous work (Gupta & Roughgarden, 2017; Balcan et al., 2018a; 2021a; Sakaue & Oki, 2022) on data-driven algorithm design shows that many parameterized algorithms, especially combinatorial algorithms, exhibit a piecewise structural property. The same structure also arises in multiobjective optimization algorithms. We present a general theorem that bounds the sample complexity of parameterized multi-objective optimization algorithms with respect to the complexity of the piecewise structure.

We also discuss how to efficiently learn the parameter from training samples (a.k.a., empirical risk minimization): If the performance metric function is the Pareto (hyper-)volume, we can 1 1

e -approximately PAC-learn the optimal parameter in polynomial time by exploiting the submodular property of the Pareto volume. We further show that the approximation ratio is tight assuming P = NP.

Sample complexity bounds of learning configurations for various algorithms. We apply our general framework to various multi-objective optimization algorithms, including approximation algorithms (shallow-light Steiner trees and bi-criterion minimum spanning trees), local search (vanilla local search and simulated annealing), and linear programming (general LPs and Markov decision processes).

For local search and linear programs, we find that a technical challenge is that the complexities of the piecewise structures may be exponentially large in the worst case, leading to polynomial sample complexity with respect to the problem size. We can overcome this barrier by conducting a beyond-worstcase analysis. The main idea is to use smoothed analysis. If the problem instance distribution contains a small noise, we can reduce the number of pieces in performance metric functions to polynomial size, thus leading to logarithmic sample complexity bounds.

Table 1 gives an overview of our sample complexity results. As an extension, we also discuss the application of machine learning models in predicting parameters in Appendix D.

Empirical verifications of theoretical bounds. We verify our theoretical findings with experimental results. We show that the empirical risk minimization algorithm on shallow-

Table 1. An overview of sample complexity results for parameterized multi-objective optimization, where k is the number of parameters and ε is the generalization error.

Problem Worst-case Smoothed

Shallow-light trees O(k ε 2) N/A Bi-criterion MSTs O(k ε 2) N/A Local search O(k T ε 2) O(k ε 2) Simulated annealing O(k T ε 2) O(k ε 2) General linear programs O(k n ε 2) O(k ε 2) Markov decision processes O(k |S| ε 2) O(k ε 2)

light trees indeed improves the Pareto volume metric of multi-objective optimization. We also compute the number of pieces in the piecewise structure on two algorithms, local search for VLSI circuit partitioning and a resource-gathering Markov decision process. We find that, in practice, the intrinsic complexity of the performance metric function is indeed small, thus leading to small generalization bounds. This shows that the smoothed analysis setting is reasonable.

1.2. Organization

The rest of this paper is organized as follows: Section 2 introduces basic notations and the problem formulation of this work. Section 3 presents the general framework of sample complexity analysis for data-driven multi-objective optimization algorithms. In Sections 4, 5, and 6, we apply our framework to approximation algorithms, local search, and linear programs, respectively. Section 7 provides empirical verifications of our results. Finally, we conclude this paper and discuss directions for future work in Section 8.

A comparison with related work is discussed in Appendix A. Preliminaries are introduced in Appendix B. Technical proofs are given in Appendix C. Appendix D discusses some extensions of our framework. Appendix E supplements the detailed experimental settings.

2. Notations and Problem Formulation

Notations. We use O(f(n)) = f(n) poly log n to suppress logarithmic factors. We also use [n] to denote the set {1, 2, . . . , n} and I( ) to denote the indicator function. Preliminary notations for multi-objective optimization is omitted to Appendix B.1.

Problem formulation. Given a multi-objective optimization problem Π, we use I to denote the set of possible problem instances that the algorithm takes as input. We assume there is an unknown, domain-specific probabilistic distribution D over I.

We have an algorithm A to solve problem Π parameterized by ρ P. Suppose the problem Π has d optimization objectives, we use A : P I Rd + to denote the objective

Learning Configurations for Data-Driven Multi-Objective Optimization

vector of algorithm A s output on a particular instance I I with parameter ρ P.

The aim of algorithm configuration is to select k parameters P = (ρ1, . . . , ρk) Pk for A with good performance in expectation over the distribution D. Suppose the outputs of algorithm A are A(ρ1, I), . . . , A(ρk, I) using parameters ρ1, . . . , ρk on an instance I. We use u(P, I) : Pk I [0, H] to denote a performance metric for A(ρ1, I), . . . , A(ρk, I), where H is a constant1 that upper bounds the performance metric. For example, the performance metric can be the Pareto volume Vol (see Definition B.1). We aim to maximize the expected performance of A, i.e., max P Pk(EI D[u(P, I)]).

We provide generalization guarantees (a.k.a., sample complexity bounds) for learning the parameters P. Given m training data I1, . . . , Im I that are independently sampled from D, a generalization guarantee bounds the difference between the average performance on training data and the expected performance on D. We use the idea of uniform convergence from statistical learning theory. Uniform convergence provides generalization guarantees for any choice of parameters P Pk. Therefore, it is independent of the learning algorithm.

Definition 2.1 (Uniform convergence). The parameterized algorithm A has generalization error ε(m, δ), if for any distribution D over X, with probability at least 1 δ over m i.i.d. samples I1, . . . , Im D, we have

m Pm i=1 u(P, Ii) EI D[u(P, I)] ε, for any P

Pk. The least number of samples m(ε, δ) that ensures such a guarantee is called the sample complexity of u(P, I) on distribution D.

We are also interested in providing guarantees for the complete learning procedure. Given m training data I1, . . . , Im, we use an empirical risk minimization algorithm to solve max P Pk 1

m Pm i=1 u(P, Ii). By uniform convergence bounds, the learned P is probably approximately correct (PAC) for the optimal parameter.

Definition 2.2 (PAC-learning). A learning algorithm (α, β, δ)-PAC-learns the (approximate) optimal parameter, if given m i.i.d. samples from D, with probability at least 1 δ, the learning algorithm finds k parameters P such that

EI D[u(P, I)] α max P Pk EI D[u(P , I)] β,

for some 0 < α 1 and β > 0.

1This assumption is standard in previous work of data-driven algorithm design (Gupta & Roughgarden, 2017), and is indeed reasonable. Although the performance metric may depend on the problem size, we can always normalize the metric and consider the relative performance, given a fixed problem instance distribution.

We omit basic tools of statistical learning theory for proving sample complexity bounds to Appendix B.2.

3. General Framework for Sample Complexity

3.1. Uniform convergence

In our main results, we consider the simple case that the problem has only 2 objectives, and hence the algorithm generally has only one real parameter. We discuss the extension to more than 2 objectives in Section D.

We study the setting where the performance of the algorithm has a piecewise-constant structure with respect to the parameters. This structure widely appears in many optimization algorithms, including greedy heuristics (Gupta & Roughgarden, 2017), branch-and-bound (Balcan et al., 2018a), dynamic programming (Balcan et al., 2021a), A* search (Sakaue & Oki, 2022), etc. Formally, the piecewiseconstant structure is defined as follows.

Definition 3.1 (Piecewise-constant algorithm). For any problem instance I, we can partition the parameter space P R into t intervals, such that in each interval, the output of the algorithm A( , I) is a constant function.

Theorem 3.2. Given m i.i.d. training samples I1, . . . , Im from the instance distribution D, with probability at least 1 δ, for any set of k parameters P = (ρ1, . . . , ρk)

Pk, we have

m Pm i=1 u(P, Ii) EI D[u(P, I)] =

1 m k log t + log 1

Equivalently, this generalization guarantee can be presented using the language of sample complexity: For any ε > 0 and 0 < δ < 1, m = Ω 1

ε2 (k log t + log δ 1) samples are sufficient to ensure that with probability 1 δ, the generalization error is at most ε. This theorem can be proved by classic pseudo-dimension techniques.

However, as we will show in the following sections, the number of pieces t can be exponential with respect to the problem size. This gives a polynomial bound on the sample complexity (or generalization gap). To overcome this barrier, we apply smoothed analysis. In practice, worst-case bounds are generally loose. These bounds are seldom tight since random noises appear in real-world instances. Smoothed analysis assumes the instance distribution D is not arbitrary, but a distribution of smoothed instances.

Concretely, to sample an instance I, we first sample a smoothed instance I from some distribution D. A smoothed instance is an instance with a small random perturbation, i.e., a probabilistic distribution. Then, we obtain instance I by sampling the random perturbation from I. We set a smoothness parameter κ to evaluate the randomness of the

Learning Configurations for Data-Driven Multi-Objective Optimization

perturbation. (As κ goes from 1 to + , our analysis interpolates between average-case and worst-case analysis.) In this setting, the number of pieces t is a random variable with respect to I. We can also obtain generalization guarantees by upper bounds on EI I[t] for any smoothed instance I. Since t is a random variable, we can no longer use pseudodimension. Instead, we consider a smoothed version of Rademacher complexity and apply Massart s lemma.

Theorem 3.3. Given m i.i.d. instances I1, . . . , Im where Ii Ii and Ii D from a smoothed instance distribution D for each i [m], with probability at least 1 δ, for any set of k parameters P = (ρ1, . . . , ρk) Pk,

m Pm i=1 u(P, Ii) E I D EI I[u(P, I)] =

1 m k log EI I[t] + k log m + log 1

3.2. Empirical risk minimization for Pareto volume

We also study the problem of empirical risk minimization to learn the optimal parameters. If the performance metric function u is the Pareto volume Vol (see Appendix B.1), i.e., u(P, I) = Vol({A(ρ1, I), . . . , A(ρk, I)}), we can approximately PAC-learn k parameters P Pk by exploiting the submodular property of the Pareto volume.

Definition 3.4. A set function f : 2S R is submodular, if for any T1 T2 S, and any x S \ T2, we have f(T1 {x}) f(T1) f(T2 {x}) f(T2).

Lemma 3.5. Let S = {u1, . . . , un} be a set of objective vectors. The function Vol : 2S [0, H], which maps a subset T S to the Pareto volume of T, is monotone and submodular.

Therefore, we can use the folklore 1 1

e -approximation algorithm for submodular maximization to perform empirical risk minimization. Let I1, . . . , Im denote the training samples from instance distribution D. Empirical risk minimization finds k parameters P = (ρ1, . . . , ρk) to maximize the empirical Pareto volume, i.e.,

max P Pk 1 m

i=1 Vol({A(ρ1, Ii), . . . , A(ρk, Ii)}). (1)

Theorem 3.6. Given m i.i.d. samples from the instance distribution D, and suppose for any instance I, we have an oracle to compute each piece of A( , I). There exists a learning algorithm that finds k parameters P = (ρ1, . . . , ρk) that 1 1

e, ε, δ -PAC-learns the optimal parameter using m = O( 1

ε2 (k log t + log δ 1)) samples in poly(m, t) time.

Note that a similar result also holds in the smoothed analysis setting by taking an expectation. In practice, the oracle can be implemented by, e.g., performing a binary search on the parameter space. See Appendix E for details.

On the negative side, we show that (1 1

e)-approximation is the best bound we can achieve assuming P = NP by making a reduction from maximum k-set-cover.

Theorem 3.7. The empirical risk minimization problem (1) cannot be (1 1

e + ε)-approximated in poly(m, t) time for any ε > 0, unless P = NP.

4. Applications to Approximation Algorithms

We first apply our framework to approximation algorithms. We consider two approximation heuristics that are of practical values in VLSI design and network optimization.

4.1. Shallow-light Steiner trees

Given a weighted and undirected graph G = (V, E) with a set of terminals S V and a root r S, a Steiner tree of G is a sub-tree that inter-connects S with root r. For any sub-graph G G, let d G (u, v) denote the shortest path distance between u and v on G , and w(G ) denote the total edge weight of G .

In the shallow-light Steiner tree (SLT) problem (Chen & Young, 2020), the objective is to construct a Steiner tree that simultaneously minimizes the total weight and the path length from the root r to each terminal in S. We say a Steiner tree T is α-shallow, if d T (r, s) α d G(r, s) for each s S, and is β-light, if w(T) β w(T ), where T is a Steiner tree with the minimum total weight (a.k.a., the minimum Steiner tree). Our goal is to find (α, β)-SLTs minimizing both α and β. An SLT is an interpolation between the minimum Steiner tree and the shortest path tree.

A direct application of SLTs is the construction of timingdriven routing trees for VLSI design. In VLSI design, we consider rectilinear routing on a 2D plane, i.e., SLTs in metric space (R2, 1). (In fact, our results can be extended to the general case (Rd, p) for any d, p 1.)

Chen & Young (2020) propose a practical heuristic, namely SALT, for constructing rectilinear SLTs. For a parameter ρ 0, it obtains a (1 + ρ, 2 + log 2

ρ )-SLT. However, this bound is generally loose in practice. It is important to tune the value of ρ to obtain a tighter Pareto curve. We study the problem of tuning the parameter ρ of SALT.

The SALT algorithm is described as follows: The algorithm begins by calling a heuristic (Chu & Wong, 2008) to obtain an approximate Steiner minimum tree T0. It then performs a depth-first search on T0. During the search process, the distance dist(v) from the root to each point v on T0 is maintained. Whenever a node v is accessed such that dist(v) > (1 + ρ) r v 1 (which induces shallowness larger than 1 + ρ), we call v a breakpoint and update dist(v) = r v 1. After the search, a shortest path tree is reconstructed for all breakpoints by calling a heuristic by

Learning Configurations for Data-Driven Multi-Objective Optimization

C ordova & Lee (1994). The algorithm finally post-optimizes the tree by performing some safe refinements.

Let A(ρ, I) be the output of SALT on an instance I with the parameter ρ, we have the following lemma. Lemma 4.1. For any SLT instance I, we can partition [0, + ) into at most n2 intervals, such that in each interval, A( , I) is constant.

This lemma can be proved by counting the values of ρ that affect the breakpoints. This gives a sample complexity bound by Theorem 3.2.

Corollary 4.2. Let u( , I) denote any performance metric function for SLT instance I. The sample complexity of u for SALT is O k

4.2. Bi-criterion minimum spanning trees

Given an undirected graph G = (V, E) with each edge e E associated with two positive weights w1(e) and w2(e), we consider the problem of bi-criterion minimum spanning trees, i.e., constructing spanning trees T such that the total weights (W1(T), W2(T)) = (P

e T w1(e), P

e T w2(e)) are both minimized. Ravi & Goemans (1996) consider the budget-constrained version of this problem, and propose a (1, 2)-approximation algorithm. Concretely, given a budget constraint W2, the algorithm considers the following problem,

e E w1(e) xe,

e E w2(e) xe W2,

xe {0, 1}, e E,

{e | xe = 1, e E} forms a spanning tree.

Let W 1 be the optimal weight of this problem. The Ravi Goemans algorithm finds a spanning tree with total weights (W1, W2) such that W1 W 1 and W2 2 W2 in almostlinear time. Note that Ravi-Goemans cannot find a solution T that exactly satisfies the constraint W2(T) W2. To balance W1(T) and W2(T), we need to carefully tune the value of W2. In the following, we consider the problem of learning parameters W2 for Ravi-Goemans.

The algorithm can be described as follows: Prune all edges with w2(e) > W2. Let

L(z) = min spanning tree x

e E (w1(e) + w2(e) z)xe W2 z

denote the Lagrangian function of the problem. The algorithm solves the Lagrangian relaxation maxz 0 L(z). Let z denote the optimal solution. The algorithm finds the spanning tree T with the least P

e T w2(e) such that P

e T w2(e) W2. See Figure 1 for an illustration.

slope = 𝑊2 𝑊2

Figure 1. An illustration of Goemans-Ravi algorithm.

Let A(ρ, I) denote the output of Goemans-Ravi on an instance I with W2 equal to ρ, we have the following lemma.

Lemma 4.3. For any instance I, A( , I) is a piecewiseconstant function with at most O(|V |2) pieces.

Corollary 4.4. Let u( , I) denote any performance metric function for bicriterion MST instance I. The sample complexity of u for Goemans-Ravi is O k

5. Applications to Local Search

5.1. Vanilla local search

In this section, we consider multi-objective optimization using local search algorithms. Local search, as well as its variant, simulated annealing, is widely used in various combinatorial optimization problems. If the problem has multiple objectives, the most natural and popular method is to optimize a weighted sum of the objectives. Such examples include Sechen & Sangiovanni-Vincentelli (1986), Chen et al. (2008), Yu et al. (2022), etc.

We define the following general model for local search: Let S denote the states of the search algorithm. (Note that |S| can be very large, e.g., exponential in the problem size.) Each state S S has a set of neighbors N(S) S \ {S}. We use n to denote the maximum number of neighbors for all states S. Each state S is also associated with two costs c1(S), c2(S). Without loss of generality, we assume c1(S), c2(S) [0, 1]. We use a weighted sum method to find a state ˆS such that c( ˆS) = (1 ρ) c1( ˆS) + ρ c2( ˆS) is minimized.

We consider the standard local search algorithm by Johnson et al. (1988). Initially, the local search algorithm starts from an arbitrary but fixed state S0 S. For each iteration t [T], the algorithm enumerates the neighbors of the current state S, and finds S N(S) with the minimal c(S ). If c(S ) c(S), let S S . Otherwise, the algorithm stops. Notice that a local search algorithm may converge after an exponential number of iterations. Thus, we define T, an upper bound of the iteration number. If the algorithm does not converge, we directly output the state S after iteration T, which is a technique generally applied in practice.

Learning Configurations for Data-Driven Multi-Objective Optimization

Let D be a distribution of problem instances, A(ρ, I) denote the output of the local search algorithm using parameter ρ on instance I, and u(P, I) denote the performance metric. We have the following lemma.

Lemma 5.1. For any instance I, we can partition [0, 1] into at most (n + 1)T intervals, such that A( , I) is constant in each interval. Moreover, there exists an instance such that the number of pieces is at least nΩ(T ).

By Theorem 3.2, this gives a sample complexity bound of O k T

ε2 . We cannot further improve this bound using Theorem 3.2 since Lemma 5.1 also gives an asymptotically tight lower bound.

However, we can overcome this barrier using smoothed analysis. The philosophy of smoothed analysis is that the instances in practice always contain some randomness so that the worst case almost never exists.

We say a real random variable is κ-bounded, if its probability density function is upper bounded by κ. For example, a uniform distribution over an interval of length 1/κ is κbounded. We call an instance κ-smoothed, if the costs c1(S) and c2(S) are sampled individually and independently from κ-bounded distributions on [0, 1] for each S S. Therefore, a κ-smoothed instance is an instance distribution such that the costs are perturbed with a small random noise.

Using smoothed analysis, we can reduce the number of pieces to polynomial size, leading to an O k

ε2 sample complexity. The key proof technique is that, due to the random noise, the discontinuities produced in each iteration are (poly(T, n) κ)-bounded. Thus, if the length of a piece is small, in the next iteration, it will not produce new pieces in expectation as much as in the worst case. By an induction on the iteration number, we show a polynomial upper bound on the expected number of pieces.

Theorem 5.2. Let D be a distribution over κ-smoothed local search instances. The sample complexity of u on D is O k

5.2. Simulated annealing

Simulated annealing (SA) is a variant of the local search method, which has gained great success in many practical problems. We extend our sample complexity results for multi-objective local search to simulated annealing.

We consider the following simulated annealing algorithm:

1. Initialize the state S S0 for some arbitrary but fixed S0 and t 1. Mark S0 as visited. 2. Randomly sample a state S N(S), and compute = c(S) c(S ). 3. If S is not visited, mark S as visited and with probability min{exp( /τt), 1}, set S S (where τt > 0

for any t [T] is the temperature of the algorithm). 4. Let t t + 1. If t < T, goto Step 2.

Note that we assume the algorithm does not access any state twice. We make this assumption to provide enough probabilistic independence to make the smoothed analysis possible. This assumption is not very strong since the problem state is usually high-dimensional in practice, and the algorithm seldom repetitively accesses a state.

Since simulated annealing is a randomized algorithm, we also make the following assumption for simplicity of analysis. We assume the randomness of the algorithm is predetermined as part of the input. Concretely, in Step 2 of the algorithm, for each pair of state S and iteration t, we assume the sample S = sample(S, t) N(S) is fixed as we vary the value of ρ. In Step 3, the algorithm accepts the new state by checking whether prob(t) min{exp( /τt), 1} where prob(t) is uniformly sampled over [0, 1] for t = 1, . . . , T before executing the algorithm. Now, an instance I is in the form of (inst, sample( , ), prob( )), where inst denotes the local search problem instance, and sample, prob denote the randomness of the algorithm. Then, the distribution is a joint distribution over problem instances and algorithmic randomness. This formulation is reasonable, since in practice, for an instance I in the training set, we cannot obtain the expected performance of the algorithm on I, but rather observe the performance of the algorithm on I for some random seed. Thus, we can regard the random seed coming from the instance distribution.

Let D be a joint distribution over instances and algorithmic randomness, A(ρ, I) denote the output of the algorithm using parameter ρ on I, and u(P, I) denote the performance metric. Using a similar analysis to Lemma 5.1, the worstcase sample complexity is also O k T

ε2 , and we can reduce the bound by smoothed analysis. Theorem 5.3. The sample complexity of u for κ-smoothed simulated annealing is O k

6. Applications to Linear Programming

6.1. General LP

Another application of our framework is multi-objective linear programs. Consider the following problem,

max x c T x, s.t. Ax b,

where x Rn, A Rr n, b Rr, and c = (1 ρ) c1 + ρ c2 Rn is the convex combination of two objectives c1 and c2. Let A(ρ, I) denote the output on instance I with parameter ρ. Let u(P, I) denote a metric function on multiobjective LP instance I with ρ P = {ρ1, . . . , ρk}. We study the sample complexity of tuning P to optimize the metric EI[u( , I)].

Learning Configurations for Data-Driven Multi-Objective Optimization

Shadow path

Figure 2. An illustration of shadow path (blue) and its projection to the (c T 1x, c T 2x) plane.

The sample complexity of u( , I) for multi-objective LPs is O kn

ε2 . This is obvious since the number of pieces for A( , I) is upper bounded by the number of vertices of the polyhedron Q = {x | Ax b}, which is O(rn).

We can also use smoothed analysis to improve the worstcase sample complexity bound. Let Q[c] denote the face of Q maximizing c Tx. We aim to bound the smoothed number of faces Q[c] as ρ varies from zero to one. This is the same as the shadow path of Q, which is proposed in previous work studying the smoothed complexity of the Simplex method for LPs. We can hence apply off-the-shelf results of shadow paths to obtain sample complexity bounds.

Definition 6.1 (Shadow path). Let Q Rn be a polyhedron. Given two vectors c1, c2 Rn, there exists a unique sequence of faces Q(c1, c2) = (v0, e1, v1, . . . , el, vl) of Q, and a sequence 0 = ρ0 < ρ1 < < ρl < ρl+1 = 1 such that (1) for 1 i l, ei = Q[(1 ρi)c1 + ρic2]; (2) for 0 i l and ρ (ρi, ρi+1), vi = Q[(1 ρ)c1 + ρc2]; (3) for 1 i l 1, vi = ei ei+1. Q(c1, c2) is called the shadow path of Q with respect to c1, c2 (see Figure 2).

Under random perturbations, we have dim ei = 1 for 1 i l almost surely. Therefore, without loss of generality, the shadow path is indeed a path where {vi} are vertices and {ei} are edges.

To illustrate the optimization objectives, we can project the shadow path to the plane by mapping each point x Q(c1, c2) to (c T 1 x, c T 2 x). This results in a concave curve and we use slope(ei) to denote the slope of ei after projection.

Spielman & Teng (2004) study the shadow path size of the smoothed LP model, max c Tx, (A + ˆA)x b + ˆb, where the rows of (A, b) are normalized with l2 norms at most 1, and each entry of ˆA,ˆb is an independent Gaussian noise2

with zero mean and 1/κ2 variance.

Theorem 6.2 (Spielman & Teng (2004)). The expected size of any shadow path for Q = {x | (A + ˆA)x b + ˆb} is upper bounded by poly(n, r, κ).

2The Gaussian distribution N(0, κ 2) is κ 2π -bounded. Sadly, current results of smoothed analysis for linear programs only hold for particular distributions, such as Gaussian or Laplacian noises.

Corollary 6.3. The sample complexity for smoothed multiobjective linear programs is O k

6.2. Special LP: Markov decision processes

The smoothed shadow path bound for general linear programs requires Gaussian noises on all coefficient entries. We can relax the smoothed analysis model if the linear program has a special structure.

In the following, we study multi-objective Markov decision processes (MDPs), which can be formulated as a special kind of linear programming. MDP is the fundamental model for reinforcement learning and stochastic optimization. An MDP consists of the following components: A set of states S, a set of actions A, a transition probability map P[St+1 | St, At], a reward function R : S A [ 1, 1], and a discount factor γ (0, 1). We aim to find a policy π : S A to maximize the expected accumulated reward using π. For the case of multi-objective MDPs, a natural approach is to optimize the weighted sum of multiple reward functions, e.g., Goldie & Mirhoseini (2020) and Yang et al. (2023).

It is well-known that MDPs can be solved by linear programming (Bertsekas & Tsitsiklis, 1996):

s S d(s) V (s),

s.t. V (s) R(s, a) + γ X

s S P[s | s, a]V (s ),

Here, V is the value of Bellman s equation, and d can be interpreted as a distribution over S such that P

s S d(s) = 1. Then, the objective is equal to the total expected accumulated reward if the initial state is sampled from this distribution. Since the optimal policy is independent of the initial state, we can set d as an arbitrary positive vector. For simplicity of analysis, we set d(s) = 1/|S|, i.e., a uniform distribution over S.

To define a multi-objective MDP, we assume the reward function R(s, a) = (1 ρ) R1(s, a) + ρ R2(s, a) for 0 ρ 1. We aim to learn a set of k values of ρ to maximize the performance metric u. In the worst case, an upper bound for the sample complexity of u( , I) is O(k|S|ε 2), since there are at most |A||S| choices of the policy. Thus, we can partition [0, 1] into at most |A||S| intervals such that the optimal policy is constant for ρ in each interval.

Now, we consider the smoothed analysis setting. We assume that the rewards R1(s, a) and R2(s, a) for each pair (s, a) are sampled independently from κ-bounded distributions over [ 1, 1]. Such instances are called κ-smoothed. Theorem 6.4. The sample complexity for κ-smoothed multiobjective MDPs is O k

Learning Configurations for Data-Driven Multi-Objective Optimization

To apply the shadow path method, we instead consider the dual problem of the MDP:

a A R(s, a) µ(s, a),

a A µ(s, a) = d(s) + γ X

a A P[s | s , a]µ(s , a),

µ(s, a) 0, s S, a A.

Lemma 6.5. The expected size of the shadow path of the κ-smoothed dual MDP problem is upper bounded by O |S|2|A|κ

Theorem 6.4 is a direct corollary of Lemma 6.5. The main proof technique is to observe that for each vi on the shadow path, there exists some (s, a) such that µ(s, a) = 0 for vi and µ(s, a) > 0 for vi+1. We show that the contribution of such (s, a) to slope(ei+1) is O(|S|κ/(1 γ))-bounded, thus leading to a polynomial bound in expectation. Remark 6.6. Although we prove a polynomial bound for the shadow path size of MDPs in the smoothed setting, we are unable to prove a super-polynomial lower bound in the worst case. Note that the classic policy iteration method is known to solve MDPs in polynomial time (Ye, 2011). It is possible that the shadow path size is also polynomial in the worst case. We left such analysis as future work.

7. Experiments

In this section, we perform empirical verifications of our theoretical results. Due to the page limit, we omit the detailed experimental settings to Appendix E.

7.1. Generalization bounds

First, we verify the effectiveness of smoothed analysis in our sample complexity analysis. We evaluate the generalization bound on local search and Markov decision processes. For local search, we consider the VLSI circuit partitioning problem (Hu et al., 2022). For MDP, we consider a resourcegathering problem in previous work on multi-objective reinforcement learning (Barrett & Narayanan, 2008).

We tune the parameter ρ of the weighted sum method, and compute the maximum number of pieces in the piecewise structure of the solution found by the algorithm. We report the maximum number of pieces over problem instances of a particular size. We also report the generalization error. Since the performance metric may depend on the specific application, we simply compute the maximum gap of two objectives averaged on training and evaluation instances. Note that the generalization error depends on the range of the objective. We normalize the objective by the maximum value among data sets to make the error between [0, 1].

20 40 60 80 100 Problem size

Number of pieces

Number of pieces Gen. Error

Generalization error

Local search

10 20 30 40 Problem size

Number of pieces

Number of pieces Gen. Error

Generalization error

Figure 3. The number of discontinuities in the piecewise structure (red) and the generalization error (blue) for local search (left) and MDPs (right).

Table 2. The Pareto volume of learned parameters for SALT. Our empirical risk minimization algorithm achieves larger Pareto volumes, compared with the vanilla geometric series setting in SALT.

k ERM (ours) Vanilla SALT k ERM (ours) Vanilla SALT

1 0.1157 0.0159 7 0.1618 0.1233 2 0.1338 0.0206 8 0.1625 0.1399 3 0.1487 0.0374 9 0.1632 0.1511 4 0.1529 0.0557 10 0.1636 0.1573 5 0.1559 0.0782 11 0.1638 0.1610 6 0.1607 0.1030 12 0.1639 0.1626

The results are illustrated in Figure 3. The problem size denotes the number of nodes (for circuit partitioning) and the number of rows and columns of the grid environment (for MDP). Although in the worst case, the number of pieces may be exponential in the problem size, the empirical results show that the number of pieces grows mildly. This indicates that the worst-case behavior does not appear in practice and our smoothed analysis matches experiments. As a consequence, the generalization error is also small.

7.2. Optimizing Pareto volume

Then, we experimentally show that the empirical risk minimization algorithm in Section 3.2 can optimize the Pareto volume of learned parameters. We conduct experiments on shallow-light Steiner tree benchmarks from real-world VLSI designs.

The original implementation of SALT uses a simple heuristic that selects the values of ρ from a geometric series. We compare our method with this baseline. Let k denote the number of learned parameters. The result is listed in Table 2. The empirical risk minimization algorithm obtains larger Pareto volume for the shallowness-lightness curve, especially when k is small. This indicates the strength of an empirical risk minimization algorithm in tuning a multiobjective optimization algorithm.

8. Conclusion

This work studies the sample complexity of data-driven algorithm configuration for multi-objective optimization algorithms. We present sample complexity bounds for several applications, in which the algorithmic behavior exhibits a piecewise-constant structure.

Learning Configurations for Data-Driven Multi-Objective Optimization

Our work suggests several directions for future research. First, as discussed in Section D.2, can we extend the smoothed analysis to the case of more than 2 objectives? We conjecture that a smoothed sample complexity of O(kdε 2) is possible for d objectives. Second, can we extend the sample complexity bound for MDPs to environments beyond tabular cases, e.g., MDPs with continuous features? Finally, can we prove sample complexity bounds for branch-andbound solvers of integer linear programs? The analysis will be much harder since the algorithm solves a relaxed linear program on each search tree node.

Acknowledgements

This work was supported by the Key Program of National Natural Science Foundation of China (No. 62034005).

Impact Statement

This paper presents theoretical studies on data-driven algorithm configuration of multiple-objective optimization. There are no potential societal consequences of our work that must be specifically highlighted here.

Balcan, M.-F. Data-driven algorithm design. ar Xiv preprint ar Xiv 2011.07177, 2020.

Balcan, M. F. and Sharma, D. Learning accurate and interpretable decision trees. In The 40th Conference on Uncertainty in Artificial Intelligence, 2024.

Balcan, M.-F., Dick, T., Sandholm, T., and Vitercik, E. Learning to branch. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 344 353. PMLR, 2018a.

Balcan, M.-F., Dick, T., and Vitercik, E. Dispersion for data-driven algorithm design, online learning, and private optimization. In IEEE 59th Annual Symposium on Foundations of Computer Science, pp. 603 614, 2018b.

Balcan, M.-F., Dick, T., and Pegden, W. Semi-bandit optimization in the dispersed setting. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, volume 124, pp. 909 918, 2020a.

Balcan, M.-F., Sandholm, T., and Vitercik, E. Refined bounds for algorithm configuration: The knife-edge of dual class approximability. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 580 590, 2020b.

Balcan, M.-F., De Blasio, D., Dick, T., Kingsford, C., Sandholm, T., and Vitercik, E. How much data is sufficient to

learn high-performing algorithms? Generalization guarantees for data-driven algorithm design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 919 932, 2021a.

Balcan, M.-F., Sandholm, T., and Vitercik, E. Generalization in portfolio-based algorithm selection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 12225 12232, 2021b.

Balcan, M.-F., Prasad, S., Sandholm, T., and Vitercik, E. Structural analysis of branch-and-cut and the learnability of Gomory mixed integer cuts. In Advances in Neural Information Processing Systems, volume 35, pp. 33890 33903, 2022.

Barrett, L. and Narayanan, S. Learning all optimal policies with multiple criteria. In Proceedings of the 25th International Conference on Machine Learning, pp. 41 47, 2008.

Bartlett, P., Indyk, P., and Wagner, T. Generalization bounds for data-driven numerical linear algebra. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178, pp. 2013 2040, 2022.

Bertsekas, D. P. and Tsitsiklis, J. Neuro-Dynamic Programming. Athena Scientific, 1996.

Blot, A., Hoos, H. H., Jourdan, L., Kessaci-Marmion, M.- E., and Trautmann, H. MO-Param ILS: A multi-objective automatic algorithm configuration framework. In Learning and Intelligent Optimization, pp. 32 47, 2016.

Blot, A., Marmion, M.-E., Jourdan, L., and Hoos, H. H. Automatic configuration of multi-objective local search algorithms for permutation problems. Evolutionary Computation, 27(1):147 171, 2019.

Chen, G. and Young, E. F. Y. SALT: Provably good routing topology by a novel Steiner shallow-light tree algorithm. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39(6):1217 1230, 2020.

Chen, T.-C., Yuh, P.-H., Chang, Y.-W., Huang, F.-J., and Liu, T.-Y. MP-Trees: A packing-based macro placement algorithm for modern mixed-size designs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 27(9):1621 1634, 2008.

Cheng, H., Khalife, S., Fiedorowicz, B., and Basu, A. Sample complexity of algorithm selection using neural networks and its applications to branch-and-cut. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Learning Configurations for Data-Driven Multi-Objective Optimization

Chu, C. and Wong, Y.-C. FLUTE: Fast lookup table based rectilinear Steiner minimal tree algorithm for VLSI design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 27(1):70 83, 2008.

C ordova, J. and Lee, Y.-H. A heuristic algorithm for the rectilinear Steiner arborescence problem. 1994.

Feige, U. A threshold of ln n for approximating set cover. Journal of the ACM, 45(4):634 652, 1998.

Goldie, A. and Mirhoseini, A. Placement optimization with deep reinforcement learning. In Proceedings of the 2020 International Symposium on Physical Design, pp. 3 7, 2020.

Gupta, R. and Roughgarden, T. A PAC approach to application-specific algorithm selection. SIAM Journal on Computing, 46(3):992 1017, 2017.

Herzel, A., Ruzika, S., and Thielen, C. Approximation methods for multiobjective optimization problems: A survey. INFORMS Journal on Computing, 33(4):1284 1299, 2021.

Hu, K.-S., Lin, I.-J., Huang, Y.-H., Chi, H.-Y., Wu, Y.-H., and Shen, C.-F. C. ICCAD 2022 CAD contest problem B: 3D placement with D2D vertical connections. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, 2022.

Johnson, D. S., Papadimitriou, C. H., and Yannakakis, M. How easy is local search? Journal of Computer and System Sciences, 37(1):79 100, 1988.

Khodak, M., Chow, E., Balcan, M. F., and Talwalkar, A. Learning to relax: Setting solver parameters across a sequence of linear system instances. In The Twelfth International Conference on Learning Representations, 2024.

Kim, M.-C. and Hu, J. Incremental timing-driven placement and benchmark suite. Available at https://www.iccadcontest.org/2015/problem c/default.html. 2015.

Li, W., Qu, Y., Chen, G., Ma, Y., and Yu, B. Tree Net: Deep point cloud embedding for routing tree construction. In Asia and South Pacific Design Automation Conference, pp. 164 169, 2021.

Lin, Y., Dhar, S., Li, W., Ren, H., Khailany, B., and Pan, D. Z. DREAMPlace: Deep learning toolkit-enabled GPU acceleration for modern VLSI placement. In Proceedings of the 56th Annual Design Automation Conference, pp. 1 6, 2019.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. The MIT Press, 2nd edition, 2018.

Rakhlin, A., Sridharan, K., and Tewari, A. Online learning: Stochastic, constrained, and smoothed adversaries. In Advances in Neural Information Processing Systems, volume 24, 2011.

Ravi, R. and Goemans, M. X. The constrained minimum spanning tree problem (Extended abstract). In Proceedings of the 5th Scandinavian Workshop on Algorithm Theory, pp. 66 75, 1996.

R oglin, H. and Teng, S.-H. Smoothed analysis of multiobjective optimization. In 50th Annual IEEE Symposium on Foundations of Computer Science, pp. 681 690, 2009.

Sakaue, S. and Oki, T. Sample complexity of learning heuristic functions for greedy-best-first and A* search. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022.

Schede, E., Brandt, J., Tornede, A., Wever, M., Bengs, V., Hullermeier, E., and Tierney, K. A survey of methods for automated algorithm configuration. Journal of Artificial Intelligence Research, 75:425 487, 2022.

Sechen, C. and Sangiovanni-Vincentelli, A. Timber Wolf3.2: A new standard cell placement and global routing package. In Proceedings of the 23rd Annual Design Automation Conference, pp. 432 439, 1986.

Sener, O. and Koltun, V. Multi-task learning as multiobjective optimization. In Advances in Neural Information Processing Systems, pp. 525 536, 2018.

Spielman, D. A. and Teng, S.-H. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM, 51(3):385 463, 2004.

Wang, J., Zheng, Y., Zhang, Z., Peng, H., and Wang, H. A novel multi-state reinforcement learning-based multiobjective evolutionary algorithm. Information Sciences, 688:121397, 2025.

Wolsey, L. A. and Nemhauser, G. L. Integer and Combinatorial Optimization. Wiley-Interscience, 1999.

Xue, K., Xu, J., Yuan, L., Li, M., Qian, C., Zhang, Z., and Yu, Y. Multi-agent dynamic algorithm configuration. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022.

Yang, L., Sun, G., and Ding, H. Towards timing-driven routing: An efficient learning based geometric approach. In International Conference on Computer Aided Design, pp. 1 9, 2023.

Ye, Y. The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed

Learning Configurations for Data-Driven Multi-Objective Optimization

discount rate. Mathematics of Operations Research, 36: 593 603, 2011.

Yu, H.-C., Lin, Y.-H., Chen, Z., Li, B., Huang, X., Schlichtmann, U., Ho, T.-Y., and Yao, H. Contamination-aware synthesis for programmable microfluidic devices. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(11):5016 5029, 2022.

Yu, X., Xu, P., Wang, F., and Wang, X. Reinforcement learning-based differential evolution algorithm for constrained multi-objective optimization problems. Engineering Applications of Artificial Intelligence, 131:107817, 2024.

Zhang, T., Georgiopoulos, M., and Anagnostopoulos, G. C. SPRINT multi-objective model racing. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 1383 1390, 2015.

Zhang, X., Lin, X., Xue, B., Chen, Y., and Zhang, Q. Hypervolume maximization: A geometric view of Pareto set learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Zhao, Y., Wang, L., Yang, K., Zhang, T., Guo, T., and Tian, Y. Multi-objective optimization by learning space partition. In International Conference on Learning Representations, 2022.

Learning Configurations for Data-Driven Multi-Objective Optimization

A. Related Work

In the following, we briefly survey related work to ours, and discuss comparisons with them.

Data-driven algorithm design. Gupta & Roughgarden (2017) initially establish the framework of data-driven algorithm design, where both statistical learning and online learning settings are considered. For statistical learning, this framework has been applied to learning branch policies for branch-and-bound (Balcan et al., 2018a), learning cuts for branch-andcut (Balcan et al., 2022; Cheng et al., 2024), parameter tuning for string alignment and mechanism design (Balcan et al., 2021a), learning embeddings for A* search (Sakaue & Oki, 2022), data-driven numerical linear algebra (Bartlett et al., 2022), tuning decision tree models (Balcan & Sharma, 2024), to name a few. For online learning, this framework has been applied to tuning greedy heuristics (Gupta & Roughgarden, 2017; Balcan et al., 2018b), tuning clustering algorithms (Balcan et al., 2020a), tuning linear system solvers (Khodak et al., 2024), etc. Our work focuses on statistical learning of multi-objective optimization algorithms.

Our main technical tool is smoothed analysis, which reduces the intrinsic complexity of the parameterized multi-objective optimization algorithm family. Note that smoothed analysis is also applied in online learning of data-driven algorithm design (Gupta & Roughgarden, 2017; Balcan et al., 2018b) (i.e., the dispersion property in Balcan et al. (2018b)). In fact, their regret bounds can be turned into statistical ones by applying online-to-batch results. However, we emphasize that our method is different from dispersion. Dispersion is proposed to overcome the impossibility of no-regret online learning for worst-case instances. For a family of parameterized algorithms with K discontinuities in the piecewise structure, the dispersion property says that these K discontinuities do not concentrate tightly in the smoothed analysis setting, and thus no-regret online learning is possible by making a discretization-based method. However, the dispersion property does not present tools for bounding the number of discontinuities K. Our work shows that the number of discontinuities in the piecewise structure can be greatly reduced (i.e., from exponential to polynomial) using smoothed analysis. Therefore, our technique is orthogonal to dispersion.

Balcan et al. (2020b) also propose a method to reduce the intrinsic complexity of the piecewise structures. They use a surrogate function with a simpler structure to approximate the original performance metric function. This approach yields a data-dependent numerical bound for the generalization error. In contrast, our work considers asymptotic bounds without accessing the concrete training data.

Finally, we remark that smoothed analysis is also applied to interpolate between distributional and adversarial settings of classic learning problems, such as online prediction (e.g., Rakhlin et al. (2011)). Our technique is different from classic learning problems since our analysis focuses on the combinatorial structure of multi-objective optimization algorithm configuration.

Algorithm configuration for multi-objective optimization. Algorithm configuration is concerned with the problem of searching parameter configurations of a parameterized algorithm. Several work (Zhang et al., 2015; Blot et al., 2016; 2019; Xue et al., 2022; Yu et al., 2024; Wang et al., 2025) have been devoted to algorithm configuration with multiple objectives. We refer readers to a comprehensive survey of algorithm configuration (Schede et al., 2022). These work are generally empirical, and the theoretical foundation of multi-objective algorithm configuration is lacking.

Machine learning for multi-objective optimization. Previous work also explores the integration of machine learning techniques into multi-objective optimization. Zhao et al. (2022) combine learning models with Monte-Carlo tree search to improve multi-objective black-box optimization algorithms. Zhang et al. (2023) consider modeling the Pareto set using neural networks. These work are different from ours since we focus on tuning the parameter of multi-objective optimization algorithms.

Smoothed analysis of algorithms. Smoothed analysis is proposed by Spielman & Teng (2004) to explain the efficiency of the Simplex method for linear programs in practice. They show that, with a small Gaussian noise on the linear programming instance, the expected time complexity of Simplex is polynomial, breaking the exponential lower bound in worst-case analysis. Smoothed analysis has then been applied to go beyond worst-case analysis of many algorithms.

A related work to ours is due to R oglin & Teng (2009), which shows that a multi-objective binary integer program has only a polynomial number of Pareto-optimal solutions under smoothed analysis. However, their work is different from ours: (1) They consider the optimal Pareto solution set of the problem, while we focus on the behavior of algorithms (which may not produce optimal solutions); (2) Their analysis focuses on binary integer programs, while our framework can be applied to other problems.

Learning Configurations for Data-Driven Multi-Objective Optimization

B. Preliminaries

In this section, we introduce preliminaries on multi-objective optimization and statistical learning theory, which serve as fundamental tools for our analysis. We refer readers to, e.g., Mohri et al. (2018) for more details of learning theory.

B.1. Preliminaries on multi-objective optimization

For a maximization problem3 with d objectives, we use an objective vector o Rd + to denote the d objectives of a solution. An objective vector o is said to be dominated by another o if oi o i for any 1 i d. We use o o to denote domination. Definition B.1 (Pareto volume). For a set of solutions S = o(1), . . . , o(n) , the Pareto volume Vol(S) is the (hyper-)volume of the set that contains points dominated by at least one point in S, i.e.,

o Rd + I( o S, o o )do.

B.2. Preliminaries on learning theory

Definition B.2 (VCand pseudo-dimension). Let H denote a set of functions from some domain X to R. A finite set S = {x1, . . . , xm} X is shattered by H, if there exist witnesses r1, . . . , rm such that, for each of the 2m subsets T S, there exists h H such that h(xi) > ri if and only if xi T. The pseudo-dimension Pdim(H) is the cardinality of the largest subset of X shattered by H. Specially, if h H are binary functions (h : X { 1, 1}), the pseudo-dimension is also called the VC-dimension VCdim(H).

By definition, we have the following fact. Fact B.3. Let H be a function class from some domain X to R, and H = {h | h : (x, r) 7 sgn(h(x) r), h H, x X, r R}. Then, we have VCdim(H) = Pdim(H ). Lemma B.4 (Sauer). Let H be a function class from some domain X to {1, 1} with VCdim(H) = d. For any S = {x1, . . . , xm} X, we have

|{(h(x1), . . . , h(xm)) | h H}|

Definition B.5 (Rademacher complexity). Let H denote a set of functions from some domain X to [0, H]. The empirical Rademacher complexity of H for a finite set S = {x1, . . . , xm} X is

ˆRS(H) = Eσ { 1,1}m

i=1 σih(xi)

where each σi is independently and uniformly sampled from { 1, 1}. The Rademacher complexity of H is RD(H) = ES Dm[ ˆRS(H)] where each xi S is independently sampled from some distribution D. Lemma B.6 (Massart). For a function class H from X to [0, H], and S = {x1, . . . , xm} X, we have

where A = {(h(x1), . . . , h(xm)) | h H} is the set of all possible values of h H on S. Theorem B.7 (Uniform convergence). Given m i.i.d. samples S = {x1, . . . , xm} from a distribution D, for any δ (0, 1), with probability at least 1 δ, for any h H, we have 1 m

i=1 h(xi) Ex D[h(x)]

complexity(H) + H

where complexity(H) is the Rademacher complexity RD(H), the empirical Rademacher complexity ˆRS(H), or a pseudo-

dimension-based term H q

3For minimization problems, we can transform it into a maximization problem by selecting a baseline with objectives (o 1, . . . , o d), and consider the Pareto volume of (o 1 o1, . . . , o d od) for objectives (o1, . . . , od).

Learning Configurations for Data-Driven Multi-Objective Optimization

C. Omitted Proofs

C.1. Proofs in Section 3.1

Proof of Theorem 3.2. Let U = {u(P, ) | P Pk}. By Theorem B.7, it suffices to bound the pseudo-dimension Pdim(U) = O(k log t).

Suppose Pdim(U) = m. Let S = {I1, . . . , Im} be m instances that can be shattered by U and r1, . . . , rm R be the witnesses. For every S S, there exists PS Pk such that Ii S if and only if u(P, Ii) > ri. Let M = {PS | S S} and |M| = 2m.

By Definition 3.1, the parameter space P can be partitioned into t intervals such that in each interval, A( , I) is constant. Combining the intervals for m instances, we can partition P into m t intervals such that the function is constant on all m instances. Since P Pk, the vector (u(P, I1), . . . , u(P, Im)) can take at most (mt)k values. Therefore, we have 2m = |M| (mt)k. Solving for m yields m = O(k log t).

Proof of Theorem 3.3. For simplicity, we use EI D[ ] to denote E I DEI I[ ]. Given instances I[m] = {I1, . . . , Im}, let ˆRI[m](U) = Eσ sup P Pk 1

m Pm i=1 σiu(P, Ii) be the empirical Rademacher complexity of U = {u(P, ) | P Pk}. Let R(U) = EI[m] Dm[ ˆRI[m](U)] be the Rademacher complexity. Our goal is to bound the Rademacher complexity of U.

Notice that R(U) = E I[m] Dm E I[m] I[m] [ ˆRI[m](U)] sup I[m] E I[m] I[m] [ ˆRI[m](U)].

For any m smoothed instances I1, . . . , Im, we define the smoothed empirical Rademacher complexity as R I[m](U) =

EI[m] I[m][ ˆRI[m](U)]. Combining the t intervals for each instances, and considering the choice of k parameters, we have Pm i=1 u(P, Ii) takes at most (mt)k values. By Massart s lemma, we have

R I[m](U) O(1) E I[m] I[m]

= O(1) E I[m] I[m]

k(log t + log m)

k(log EI I[t] + log m)

where the last inequality is due to Jensen s inequality. This yields the desired result by Theorem B.7.

C.2. Proofs in Section 3.2

Proof of Lemma 3.5. It is obvious that Vol is monotone. We only prove the submodularity. For any T1 T2 S, and u0 S \ T2, we have

(Vol(T1 {u0}) Vol(T1)) (Vol(T2 {u0}) Vol(T2))

u (I(( u T1, u u ) (u u0)) I(( u T2, u u ) (u u0)))du

u I(( u T1, u u ) ( u T2 \ T1, u u ) (u u0))du

which satisfies Definition 3.4.

Proof of Theorem 3.6. Since A(ρ, I) is piecewise-constant, it suffices to consider selecting ρ from at most t m pieces. Note that the average of Vol is also monotone and submodular. We can use the greedy algorithm to approximately maximize the empirical Pareto volume.

Fact C.1 (Wolsey & Nemhauser (1999)). The greedy algorithm that selects k elements to maximize a monotone and submodular function yields (1 1

e)-approximation to the optimal solution.

Learning Configurations for Data-Driven Multi-Objective Optimization

Let P denote the optimal set of parameters that maximize EI D[u(P, I)], ˆP denote the optimal solution to problem (1), and ˆP denote the solution found by the greedy algorithm in Fact C.1. We have, with probability at least 1 δ,

EI D[u( ˆP, I)] 1

i=1 u( ˆP, Ii) O

1 m (k log t + log δ 1)

i=1 u( ˆP , Ii) O

1 m (k log t + log δ 1)

i=1 u(P , Ii) O

1 m (k log t + log δ 1)

EI D[u(P , I)] 2 1

1 m (k log t + log δ 1)

which is the desired result by plugging in m. The first and the last inequality comes from Theorem 3.2, the second inequality comes from Fact C.1, and the third inequality is due to the optimality of ˆP .

Proof of Theorem 3.7. Proof by reduction from maximum k-set-cover. The max k-cover problem is defined as follows: Given a universe U = {a1, . . . , am}, t subsets S1, . . . , St U, and an integer k. We aim to select k sets from S1, . . . , St, such that the union of the k sets has maximum cardinality.

Theorem C.2 (Theorem 5.3 of Feige (1998)). For any ε > 0, max k-cover cannot be approximated in polynomial time within a ratio of (1 1

e + ε), unless P = NP.

Given a max k-cover instance, we construct an instance of problem (1). Each element in the universe corresponds to a training sample Ii. The algorithm A( , Ii) contains t pieces, and we let the partition of pieces be the same for each sample xi. In piece j of Ii, if ai Sj, we let A( , Ii) = (1, 1) (i.e., with Pareto volume 1). Otherwise, we let A( , Ii) = (0, 0) (i.e., with Pareto volume 0). Therefore, selecting k subsets from S1, . . . , St is equivalent to selecting k pieces to maximize the Pareto volume. Since the reduction preserves the approximation ratio, our result follows from Theorem C.2.

C.3. Proofs in Section 4

Proof of Lemma 4.1. As the value of ρ varies from 0 to + , the output of SALT changes as some point v becomes or is no longer a breakpoint. Such discontinuities happen as ρ = dist(v) r v 1 1. Fix the point v. The value of r v 1 is a constant, and the distance from the root r to v on the tree dist(v) is in the form of r v 1 + d T (v , v) for some v S, where d T (v , v) is the distance between v and v on the initial Steiner minimum tree. Therefore, dist(v) can take at most n 1 values. Combining the breakpoints for each v S, the number of discontinuities is at most n(n 1) + 1 n2.

Proof of Lemma 4.3. It is easy to notice that the behavior of the algorithm is fixed if z is fixed (see Figure 1). Therefore, it suffices to consider the number of pieces of L(z) (the red segments in Figure 1).

The following property is crucial to our analysis.

Lemma C.3 (Lemma 3.2 of Ravi & Goemans (1996)). Suppose T and T are two spanning trees corresponding to two adjacent segments of L(z). Then, T and T differ by only a single edge swap (i.e., there exist e T and e T such that T = (T e) e ).

Let T1, . . . , Tn denote the spanning tree that minimizes L(z) as z varies from 0 to + . We use (ei, e i) E2 to denote the edge swap between Ti and Ti+1 in Lemma C.3 for i [n 1]. We claim that each pair of edges (e, e ) E2 only appears at most once in {(ei, e i) | i [n 1]}.

Proof by contradiction. Suppose there exist i = j such that (ei, e i) = (ej, e j). We consider the intersection point of W1 + W2 z for Ti, Ti+1 and Tj, Tj+1. The line W1(Ti) + W2(Ti) z intersects W1(Ti+1) + W2(Ti+1) z at z = |W1(Ti) W1(Ti+1)|

|W2(Ti) W2(Ti+1)| = |w1(ei)+w1(e i)| |w2(ei)+w2(e i)|. Similarly, W1(Tj) + W2(Tj) z also intersects W1(Tj+1) + W2(Tj+1) z at

Learning Configurations for Data-Driven Multi-Objective Optimization

𝑐2(𝑆) 𝑐1(𝑆 )

Figure 4. The discontinuity point of local search for two states S and S .

z = |w1(ej)+w1(e j)| |w2(ej)+w2(e j)|. Both points have the same z coordinate. However, it is impossible that Ti, Ti+1, Tj, Tj+1 all lie on the

minimum segments of L(z). This is a contradiction.

C.4. Proof of Lemma 5.1

Proof. (Upper bound.) Proof by induction. We claim that after iteration t, the algorithmic behavior is piecewise-constant with at most (n + 1)t pieces.

Initially, t = 0, and there is only one interval [0, 1]. Suppose our claim holds for iteration t, and consider the next iteration t + 1. Fix an interval [a, b] from the total of (n + 1)t sub-intervals after iteration t. If ρ [a, b], after t iterations, the algorithm finds a fixed state S, and explores the neighborhood N(S). Consider the weighted sum of costs for states in N(S). The cost c(S ) = (1 ρ) c1(S ) + ρ c2(S ) for each S {S} N(S) is a linear function with respect to ρ. In iteration t + 1, the algorithm finds the state S with the least cost c(S ). (If c(S) is the minimum, the algorithm terminates, and no more sub-intervals are created.) Note that the minimum of n + 1 lines is a piecewise-linear function with at most n + 1 pieces. Therefore, the interval [a, b] is divided into at most n + 1 sub-intervals such that in each sub-interval the algorithmic behavior is fixed in iteration t + 1. Combining the sub-intervals for each interval [a, b] yields a total of at most (n + 1) (n + 1)t = (n + 1)t+1 sub-intervals for iteration t + 1, which is the desired result.

(Lower bound.) We construct the instance as follows. The states and the neighborhood relations form a rooted tree. The root is the initial state S0, and the tree is a full (n 1)-ary tree with depth T. Each non-leaf node has n 1 children.

The cost of the root is c1(S0) = c2(S0) = 1, and corresponds to the whole interval [0, 1]. We recursively set the costs of each node. Suppose a node p corresponding to state S and interval [a, b] is not a leaf, and c1(S) = C1, c2(S) = C2. We let the n 1 children p1, . . . , pn 1 of p correspond to states S1, . . . , Sn 1, respectively.

n 1 and γ > 0 be a real number to be determined later. We set c2(Si) = C2 (n i) γ for any i [n 1]. Then,

we set c1(S1) = C1 γ(a+ )

1 a , c1(S2) = c1(S1) γ(a+2 )

1 a 2 , , c1(Sn 1) = c1(Sn 2) γ(a+(n 1) )

1 a (n 1) . We can let γ small enough such that c1(Sn 1) > 0 and c2(S1) > 0. It is easy to verify that the solution of (1 ρ) c1(Si) + ρ c2(Si) = (1 ρ) c1(Si+1) + ρ c2(Si+1) is ρ = a + i . Since C1 > c1(S1) > > c1(Sn 1) and c2(S1) < c2(S2) < < c2(Sn 1) < C2, we have p1, . . . , pn 1 correspond to intervals [a, a + ], [a + , a + 2 ], . . . , [b , b], respectively.

Therefore, the tree has a total of (n 1)T leaves, and each leaf corresponds to an interval of length 1/(n 1)T . This gives the (n 1)T n T/2 = nΩ(T ) lower bound.

C.5. Proof of Theorem 5.2

To prove Theorem 5.2, we first prove a few lemmas. We fix a smoothed instance I and I I. Let E(α) denote the event that A(α, I) = A(ρ, I) for any ρ [0, α). That is to say, if E(α) happens, we only need to consider ρ [α, 1]. The value of α is to be determined later.

Lemma C.4. For any α < 0.5, we have PI[E(α)] 1 2(n + 1)2Tκα.

Proof. Let Et(α) denote the event that the behavior of the local search algorithm is fixed for ρ [0, α] in the t-th iteration. We first bound the probability P[Et(α)].

Fix t [T] and consider the iteration t. Let Si denote the state after iteration i (i [t]). By the principle of deferred

Learning Configurations for Data-Driven Multi-Objective Optimization

decisions, we assume the costs c(S ) for any S Spast are fixed, where Spast = t 1 i=0(N(Si) {Si}) contains the states the algorithm has already accessed. Notice that any state S Spast satisfies c(S ) c(St 1). Let Snext = {S | S N(St 1), S / Spast}. In iteration t, the algorithm chooses the state S Snext with the minimal cost c(S ) and updates the state. Therefore, the behavior of the algorithm changes as ρ passes a discontinuity point satisfying c(S) = c(S ), i.e.,

(1 ρ) c1(S) + ρ c2(S) = (1 ρ) c1(S ) + ρ c2(S ), (2)

for some S, S {St} Snext, S = S . If ρ [0, 1], the discontinuity point is ρ = λ1 λ1+λ2 , where λ1 = |c1(S) c1(S )|, and λ2 = |c2(S) c2(S )| (see Figure 4). If ρ [0, α], we have 0 λ1 α 1 α λ2 2α. Since λ1 is 2κ-bounded (due to the absolute value), we have P[Et(α)] 1 2(n + 1)2ακ using an union bound over choices of S, S .

Finally, we consider the variance of t. A union bound over t = 1, . . . , T gives P[E(α)] 1 2(n + 1)2Tακ.

Lemma C.5. For a κ-smoothed instance I and I I, we can partition [α, 1] into #int intervals, such that in each interval,

A( , I) is fixed. We have EI[#int] = O n2T κ

Proof. We bound the expectation of the interval number by induction. Let #inti denote the number of intervals in [α, 1] after iteration i. Similarly to Lemma C.4, we fix t [T] and consider the iteration t. We follow the notations St, Spast, Snext as in the proof of Lemma C.4.

Before iteration t, there are #intt 1 intervals. Each interval Q [α, 1] is divided into at most (n + 1)2 sub-intervals if the value of ρ satisfying (2) lies in Q. As shown in the proof of Lemma C.4, the value of the discontinuity point ρ = λ1 λ1+λ2 (Figure 4), where λ1 and λ2 are random variables with support on [0, 1] and probability density at most 2κ. By the principle of deferred decisions, we fix the value of λ1. We claim that the probability density of the discontinuity point ρ is at most O(κα 2), due to the following claim.

Claim C.6. If a real random variable X has density at most κ, and Y = z/(X + z ) satisfies Y α > 0 almost surely. Then, Y has density at most O(|z|α 2κ).

Proof. Let p X(x) and p Y (y) denote the probability density function of X and Y . We have X = g(Y ) = z/Y z . Therefore, p Y (y) = p X(g(y)) |g (y)| κ |z|

α2 , which is the desired result.

Suppose the length of interval Q is l. In expectation, Q produces at most O(ln2κα 2) new sub-intervals. Combining all intervals Q, the total number of new intervals after iteration t in expectation is at most O n2κα 2 .

Therefore, we have E[#intt] = E[#intt 1] + O n2κα 2 , which means E[#int] = O n2T κ

Now we are ready to prove Theorem 5.2. The general idea is to show that with high probability, the event E(α) happens for all m instances. The total number of possible values for Pm i=1 u(P, Ii) is upper bounded by (m #int)k, which gives an O( p

k/m) bound for the smoothed Rademacher complexity.

Proof of Theorem 5.2. Given m κ-smoothed instances I1, . . . , Im, we aim to bound the smoothed Rademacher complexity R I[m](U), where U denotes the function class {u([ρ1, . . . , ρk], ) | 0 ρi 1}.

By Lemma C.4 and a union bound over m instances, we have with probability at least 1 2m(n + 1)2Tκα,

i=1 u([ρ1, . . . , ρk], Ii) =

i=1 u([α, . . . , α], Ii), 0 ρ1, . . . , ρk α. (3)

Then, it suffices to consider ρ [α, 1]. We use E[m] to denote the event that (3) happens. To make the probability that E[m] happens high enough, we set α = 1 m(n+1)2T 2κ. Therefore, we have P[E[m]] 1 1

T . We also have E[#int] = O(n6T 5κ3),

which is exponentially smaller than the worst-case n O(T ) bound. This gives a high-probability logarithmic generalization error bound.

Learning Configurations for Data-Driven Multi-Objective Optimization

Let #int[0,1] denote the total number of intervals for ρ [0, 1]. By Massart s lemma, we have

R I[m](U) O(1) E

k log(m #int[0,1])

k log(m #int)

P(E[m]) + E

k log(m #int[0,1])

Recall that in the worst-case, #int[0,1] (n + 1)T . Since P[E[m]] > 1 1

T , we have

R I[m](U) O(1)

k log(m #int)

k T log(mn)

k log(m #int)

k log(m E[#int])

k m(log T + log n + log κ + log m)

The second inequality is due to

E[X | E[m]] = E[X I(E[m])]

P[E[m]] 1 1 1/T E[X] 1 + O 1

for any random variable X. The third inequality is due to Jensen s inequality for concave functions.

C.6. Proof of Theorem 5.3

Proof. The proof is similar to Theorem 5.2. Fix a smoothed instance I and I I. Let E(α) denote the event that A(α, I) = A(ρ, I) for any ρ [0, α). Let #intt denote the number of intervals after iteration i. Initially, #int0 = 1. We study the recursion on #intt.

For any t [T], let S denote the state after iteration t and S be the sampled neighbor. If S has not been visited, the behavior of the algorithm changes as ρ passes a discontinuity satisfying

exp (1 ρ) c1(S) + ρ c2(S) (1 ρ) c1(S ) ρ c2(S )

Let λ1 = c1(S ) c1(S), λ2 = c2(S ) c2(S). Solving for ρ yields

ρ = τt log prob(t) + λ1

Lemma C.7. PI[E(α)] 1 4Tκα.

Proof. Note that since 0 c1(S), c2(S) 1 for any S, we have

τt log prob(t) = (1 ρ) c1(S) + ρ c2(S) (1 ρ) c1(S ) ρ c2(S ) [ 1, 1]

as long as ρ [0, 1].

Since |λ1 λ2| 2, we have ρ > α as long as |τt log prob(t) + λ1| > 2α. Since λ1 is κ-bounded, we have P[|τt log prob(t) + λ1| 2α] 4ακ. A union bound over t = 1, . . . , T proves Lemma C.7.

Learning Configurations for Data-Driven Multi-Objective Optimization

Lemma C.8. We can partition [α, 1] into #int intervals, such that A( , I) is constant in each interval. We have EI[#int] = O T κ

Proof. Let #intt denote the number of intervals in [α, 1] after iteration t. By Claim C.6, the probability density of ρ on [α, 1] is at most O(κα 2). Thus, the expected number of new intervals for iteration t is at most O(κα 2), i.e., E[#intt] = E[#intt 1] + O(κα 2). Solving the recurrence on t shows that E[#int] = O(Tκα 2).

Therefore, following the proof of Theorem 5.2, we can bound the smoothed Rademacher complexity by O( p

k/m) similarly and obtain the desired result.

C.7. Proof of Lemma 6.5

Proof. To prove this lemma, we first introduce a few notations. We use Q to denote the polyhedron of the dual MDP. Let Q(R1, R2) = (v0, e1, v1, . . . , el, vl) denote the shadow path of Q. Moreover, we use Qs,a = {µ | µ P, µ(s, a) = 0} to denote the polyhedron P, but fixing µ(s, a) = 0. We also use Qs,a(R1, R2) = (vs,a 0 , es,a 1 , . . . , vs,a ls,a) to denote the shadow path of P s,a.

The following claim can be easily verified.

Claim C.9. The optimal solution to the dual MDP is µ (s, a) = P

t 0 γt P[St = s, At = a], where P[St = s, At = a] denotes the probability of St = s and At = a at time t with the initial state S0 following P[S0 = s] = d(s) and the policy being the optimal policy π .

By Claim C.9, for each state s, the optimal solution µ (s, a) = 0 if a = π (s), since the optimal policy π : S A is deterministic. (Note that in the smoothed analysis setting, the optimal policy is unique with probability 1.) Given a solution µ, we use inactive(µ) = {(s, a) | µ(s, a) = 0} to denote the set of actions that is not used by the corresponding policy πµ : s 7 arg maxa µ(s, a).

Suppose the optimal solution moves from vi to vi+1 through ei as ρ increases. There must exist some action (s, a) such that (s, a) inactive(vi) and (s, a) / inactive(vi+1) by Claim C.9. Thus, we have

|Q(R1, R2)|

a A I[(s, a) inactive(vi) (s, a) / inactive(vi+1)]. (4)

Interchange the order of summations and fix a pair of (s, a). We bound the expectation of Pl 1 i=0 I[(s, a) inactive(vi) (s, a) / inactive(vi+1)].

Since vi 1, ei, vi, ei+1, vi+1 are on the shadow path Q(R1, R2) (assuming 1 i < l), we have slope(ei) > slope(ei+1). If (s, a) inactive(vi), then, vi P s,a, i.e., vi is also on the shadow path Qs,a(R1, R2). Suppose vs,a j = vi for some j. Note that Qs,a Q, we have slope(es,a j ) slope(ei) and slope(ei+1) slope(es,a j+1). Thus, we have

I[(s, a) inactive(vi) (s, a) / inactive(vi+1)] I[vi P s,a slope(es,a j ) slope(ei+1) slope(es,a j+1)].

Plugging this bound into (4) yields

|Q(R1, R2)| X

j=1 I[slope(ei+1) [slope(es,a j+1), slope(es,a j )]]

where ei+1 depends on j. Note that P s,a fixes µ(s, a) = 0. Therefore, the interval [slope(es,a j+1), slope(es,a j )] does not depend on the value of R1(s, a) and R2(s, a). By the principle of deferred decisions, we fix the reward on actions other than (s, a) and consider the randomness of R1(s, a) and R2(s, a). Note that

slope(ei+1) = R2, vi R2, vi+1

R1, vi R1, vi+1 ,

Learning Configurations for Data-Driven Multi-Objective Optimization

where R, µ = P

s,a R(s, a)µ(s, a) is the inner product. Let µ and µ denote the solution of vi and vi+1, respectively. Recall that µ(s, a) = 0 and µ (s, a) 0. We can hence represent slope(ei+1) as

µ (s, a) R2(s, a) + deterministic term µ (s, a) R1(s, a) + deterministic term.

Note that we set d(s) = 1/|S| for any s S. By Claim C.9, we have µ (s, a) 1/|S|. Since R2(s, a) is κ-bounded, and the absolute value of the denominator is upper bounded by O(1/(1 γ)), we claim that the probability density of

slope(ei+1) is upper bounded by O |S|κ 1 γ . Therefore, taking the expectation of (5) yields

E|Q(R1, R2)| O(1) X

j=1 (slope(es,a j ) slope(es,a j+1)) |S|κ

O(1) + |S|κ

1 γ (slope(es,a 0 ) slope(es,a ls,a)) .

By symmetry, we can break the shadow path into two parts. The slopes of the segments in the first part lie in [ 1, 0), and the ones in the second part lie in ( , 1). The expected sizes of the two parts are asymptotically equivalent since we can swap R1 and R2 without loss of generality. Therefore, we assume slope(es,a 0 ) slope(es,a ls,a) 1. We thus have

E|Q(R1, R2)| O |S|2|A|κ

which is the desired bound.

D. Extensions of our Framework

D.1. Feature-based parameter predictor

A natural extension of algorithm configuration is to train a machine learning model to predict the algorithm parameter based on the features of the instance. Balcan et al. (2021b) present generalization bounds for selecting parameters from a portfolio of finite algorithm configurations. Cheng et al. (2024) study the sample complexity of predicting parameters using neural networks. In the following, we extend our framework for multi-objective optimization to the setting of predicting k parameters using any PAC-learnable (i.e., with finite pseudo-dimension) machine learning models.

Let D be a distribution of (maybe smoothed) instances, A(ρ, I) be the output of the algorithm with parameter ρ on instance I, and u(P, I) be the performance metric for a set P = {ρ1, . . . , ρk} of parameter values of ρ. Instead of directly learning P, we learn a predictor pred(θ, I) : Θ I Pk to predict P. For example, if k = 1 and x I is the feature vector of instance I, we may choose a linear model pred : (θ, I) 7 θTx I to predict the parameter ρ. We study the sample complexity of learning θ Θ for performance metric u(pred( , I), I). We assume the prediction model has a finite pseudo-dimension so that sample complexity bounds are possible: Let pred(θ, I) = (pred1(θ, I), . . . , predk(θ, I)) be the predicted parameters. We assume Pdim ({predi(θ, ) | θ Θ}) d for any i [k].

Theorem D.1. Suppose A( , I) is piecewise-constant with at most or in expectation t pieces (in worstor smoothed-case).

The sample complexity of u(pred( , I), I) is O kd log t

Proof. Fix an instance I. With a little abuse of notations, we use I to denote either an instance or a sample from a smoothed instance. By definition, we can partition the parameter space P into t intervals, such that the algorithmic behavior is fixed on I in each interval. Let ρ0 < ρ1 < < ρt denote the terminals of these intervals. Let sgn(predi(θ, I) ρj) denote the sign of predi(θ, I) ρj for some 1 i k, 1 j < t. As we vary the prediction model parameter θ, u(pred( , I), I) is fixed if

Φ = (sgn(predi(θ, I) ρj))1 i k, 1 j<t { 1, 1}kt

is fixed due to the piecewise-constant structure of A. Since the pseudo-dimension of predi is at most d, by Fact B.3, the VC-dimension of {sgn(predi(θ, I) ρj) | θ Θ} is also at most d. Therefore, by Sauer s lemma, the number of values of Φ is at most O(tkd).

Learning Configurations for Data-Driven Multi-Objective Optimization

Now, we vary the fixed instance I. Given m instances, the number of values of

(sgn(predi(θ, Il) ρj))1 i k, 1 j t m,1 l m

is at most O((mt)kd). Here, ρj denotes the union of all interval terminals for m instances. Using Massart s lemma, we

obtain that the Rademacher complexity is at most O q

. This proves the desired result.

D.2. Extension to more than two objectives

Previous results focus on learning configurations for multi-objective optimization problems with two objectives. Now, we consider the case of three or more objectives. We show that we can prove similar worst-case sample complexity bounds. For example, we have the following result for local search.

Theorem D.2. Suppose D is a distribution over local search instances with d objectives c1, . . . , cd. We use local search to optimize the weighted sum objective c = Pd i=1 ρi ci, with Pd i=1 ρi = 1. Let A(ρ, I) denote the result on instance I with parameter ρ = (ρ1, . . . , ρd). The sample complexity of u( , I) is O kd T

Proof. Fix an instance I. Since each state has at most n neighbors, there are at most n T trajectories of the state in T local search iterations.

Fix a trajectory S0, S1, . . . , ST , where Si denote the state after iteration i. We show that the set of parameters ρ lie in the union of T n half-spaces. For each t = 1, . . . , T, the algorithm selects St N(St 1) in iteration t if and only if for any S {St 1} N(St) \ {St}, we have

i=1 ρi ci(S )

i=1 ρi ci(St),

which is a half-space with respect to ρ. Altogether, there are T n half-spaces. Combining the half-spaces for n T different trajectories yields a total number of Tn T +1.

Now, we vary the instance I. Given m instances I1, . . . , Im, there are at most m Tn T +1 half-spaces partitioning the simplex = {ρ | Pd i=1 ρi = 1} such that in each region, A(ρ, Ii) is fixed for any i [m]. Note that t half-spaces in Rd

partition the space into at most O(td) sub-regions. Therefore, we can partition k into at most O((m Tn T +1)kd) regions such that u(P, Ii) is fixed for any i [m]. Similar to the proof of Theorem 3.2, this yields a pseudo-dimension bound Pdim(U) = O(kd T log n), thus leading to O kd T

ε2 sample complexity.

Similar results also hold for simulated annealing and linear programs. However, the smoothed analysis technique cannot be directly extended to 3 or more objectives. The reason is that in the 2-objective case, we can directly bound the probability that a discontinuity point lies in a length-l interval by O(poly(κ) l). However, this approach does not work for a multidimensional space. Even if a set of half-spaces are close to each other, the (hyper-)volume of the union of half-spaces can be large. We left the smoothed analysis of multi-dimensional parameter spaces to future work.

E. Details of Experiments

In this section, we present the implementation details of our experiments in Section 7.

Local search. We consider the VLSI circuit partitioning problem for 3D chips (Hu et al., 2022). Given a circuit netlist (represented by a hypergraph), we aim to partition the netlist into two parts. The first and second parts correspond to the top and bottom dies, respectively. If a net (i.e., a hyperedge in the hypergraph) crosses both dies, a cross-die terminal is required to connect both dies. We need to minimize the number of terminals (corresponding to the routability and the cost of the chip), as well as the total wirelength after circuit placement (corresponding to the performance of the chip).

A direct method for circuit partitioning is local search. We initialize an arbitrary partition, and improve the solution by locally moving a cell (i.e., a node in the hypergraph) from the current die to the other. We optimize a weighted sum of both objectives, where the number of terminals can be directly computed and the total wirelength is evaluated by running a VLSI placer. Notice that VLSI placers (e.g., Lin et al. (2019)) are randomized algorithms, and the outcomes of the

Learning Configurations for Data-Driven Multi-Objective Optimization

wirelength naturally contain random perturbations. This shows that the smoothed analysis setting is indeed reasonable. In our experiments, we fix the random seed of the placer to make the placement result deterministic.

We evaluate the generalization guarantee for local search on ICCAD-22 benchmark (Hu et al., 2022). Since the benchmark consists of only several testcases, we randomly sample subgraphs of the original circuit to obtain instances with various sizes.

Markov decision processes. We consider a resource-gathering Markov decision process in previous work on multi-objective reinforcement learning (Barrett & Narayanan, 2008). The environment is a grid graph with a home base. The agent starts from the home base, gathers resources in the environment, and takes them back to the home. There are multiple kinds of resources, R1, R2, etc. If the agent reaches a grid node with resource Ri, it picks up one unit of Ri. The agent can simultaneously carry at most one unit for each kind of resource. When the agent reaches the home base, it gets one reward for each kind of resource it carries. Some grid nodes have enemies. If the agent passes a node with enemies, with a particular probability, it receives a penalty, loses the resources it carries, and is reset to the home base. The objective is to maximize the reward and minimize the penalty.

We set the discount factor γ = 0.9, and find the optimal policy by optimizing a weighted sum of the reward and the penalty. We randomly synthesize the environment with different sizes to test the generalization ability. We synthesize 100 instances for each problem size (since finding all pieces is time-consuming), with 50 being the training set, and the rest being the test set.

For both local search and MDP, we compute the number of pieces in the piecewise-constant structure. To compute this number, starting from ρ = 0, we use binary search to find the next value of ρ such that the algorithm outcome varies until ρ = 1.

Shallow-light trees. The open-source code of SALT (Chen & Young, 2020) and the ICCAD-15 benchmark (Kim & Hu, 2015) for timing-driven VLSI placement are used. The ICCAD-15 benchmark consists of 9.9 106 shallow-light tree instances, in which about 4.8 104 instances are large-degree instances, i.e., with more than 32 nodes. The experiments are conducted on large-degree instances, since the tradeoff between shallowness and lightness is not significant for small-degree nets. We randomly select 80% nets as the training set and the rest as the test set. We learn a set of parameter values ρ to optimize the Pareto volume of shallowness and lightness.

Let α and β denote the shallowness and lightness of a solution. Since the shallow-light tree is a minimization problem, we define the Pareto volume as the area of domination for (αFLUTE α, βCL β), where αFLUTE and βCL are baselines for single-objective Steiner trees in the original paper (Chen & Young, 2020).