# asymptotic_risk_of_bézier_simplex_fitting__8c3f9810.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Asymptotic Risk of B ezier Simplex Fitting

Akinori Tanaka RIKEN AIP, Keio University akinori.tanaka@riken.jp

Akiyoshi Sannai RIKEN AIP, Keio University akiyoshi.sannai@riken.jp

Ken Kobayashi Fujitsu Laboratories LTD., RIKEN AIP, Tokyo Tech ken-kobayashi@fujitsu.com

Naoki Hamada Fujitsu Laboratories LTD., RIKEN AIP hamada-naoki@fujitsu.com

The B ezier simplex ﬁtting is a novel data modeling technique which utilizes geometric structures of data to approximate the Pareto set of multi-objective optimization problems. There are two ﬁtting methods based on different sampling strategies. The inductive skeleton ﬁtting employs a stratiﬁed subsampling from skeletons of a simplex, whereas the all-atonce ﬁtting uses a non-stratiﬁed sampling which treats a simplex as a single object. In this paper, we analyze the asymptotic risks of those B ezier simplex ﬁtting methods and derive the optimal subsample ratio for the inductive skeleton ﬁtting. It is shown that the inductive skeleton ﬁtting with the optimal ratio has a smaller risk when the degree of a B ezier simplex is less than three. Those results are veriﬁed numerically under small to moderate sample sizes. In addition, we provide two complementary applications of our theory: a generalized location problem and a multi-objective hyper-parameter tuning of the group lasso. The former can be represented by a B ezier simplex of degree two where the inductive skeleton ﬁtting outperforms. The latter can be represented by a B ezier simplex of degree three where the all-at-once ﬁtting gets an advantage.

1 Introduction Given functions f1, . . . , f M : X R on a subset X of a Euclidean space RN, consider the multi-objective optimization problem

minimize f(x) := (f1(x), . . . , f M(x))

subject to x X( RN)

with respect to the Pareto ordering deﬁned as follows:

x y def = i [fi(x) fi(y)] j [fj(x) < fj(y)] .

The goal is to ﬁnd the Pareto set and its image, called the Pareto front, which are denoted by

X (f) := { x X | y X [y x] }

f(X (f)) := f(x) RM x X (f) ,

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

respectively. Most numerical optimization approaches (e.g., goal programming (Miettinen 1999; Eichfelder 2008), evolutionary computation (Deb 2001; Zhang and Li 2007; Deb and Jain 2014), homotopy methods (Hillermeier 2001; Harada et al. 2007), Bayesian optimization (Hernandez Lobato et al. 2016; Yang et al. 2019)) give a ﬁnite number of points as an approximation of the Pareto set or front. Since the Pareto set and front usually have an inﬁnite number of points, such a point approximation cannot reveal the complete shapes of the Pareto set and front. In order to gain richer information, we consider in this paper a ﬁtting problem of the Pareto set and front.

It is known that the Pareto set and front often have skeleton structures that can be used to enhance ﬁtting accuracy. An M-objective problem is simplicial if the Pareto set and front are homeomorphic to an (M 1)-dimensional simplex and each m-dimensional subsimplex corresponds to the Pareto set of an (m + 1)-objective subproblem for all 0 m M 1 (see (Hamada et al. 2019) for precise deﬁnition and an example is shown in Figure 1). There are a lot of practical problems being simplicial: location problems (Kuhn 1967) and a phenotypic divergence model in evolutionary biology (Shoval et al. 2012) are shown to be simplicial, and an airplane design (Mastroddi and Gemma 2013) and a hydrologic modeling (Vrugt et al. 2003) have numerical solutions which imply those problems are simplicial. The Pareto set and front of any simplicial problem can be approximated with arbitrary accuracy by a B ezier simplex of an appropriate degree (Kobayashi et al. 2019). There are two ﬁtting algorithms for B ezier simplices: the all-at-once ﬁtting is a na ıve extension of Borges-Pastva algorithm for B ezier curves (Borges and Pastva 2002), and the inductive skeleton ﬁtting (Kobayashi et al. 2019) exploits the skeleton structure of simplicial problems discussed above.

An important problem class which is (generically) simplicial is strongly convex problems. It has been shown that many practical problems can be considered as strongly convex via appropriate transformations preserving the essential problem structure, i.e., the Pareto ordering and the topology (Hamada et al. 2019). For example, the multi-objective location problem (Kuhn 1967) becomes strongly convex by squaring each objective function. The resulting problem has

Δ{1,3} Δ{2,3}

Φ: homeomorphism s.t. Φ(ΔI)=X (f I)

(a) Simplex Δ2.

X (f{1}) X (f{2})

X (f{1,3}) X (f{2,3})

X (f{1,2,3})

f|X (f) embedding

(b) Pareto set X (f).

f(X (f{1}))

f(X (f{2}))

f(X (f{3}))

f(X (f{1,2}))

f(X (f{1,3})) f(X (f{2,3}))

f(X (f{1,2,3}))

(c) Pareto front f(X (f)).

Figure 1: A simplicial problem f = (f1, f2, f3) : R3 R3. An M-objective problem f is simplicial if the following conditions are satisﬁed: (i) there exists a homeomorphism Φ : ΔM 1 X (f) such that Φ(ΔI) = X (f I) for all I { 1, . . . , M }; (ii) the restriction f|X (f) : X (f) RM is a topological embedding (and thus so is f Φ : ΔM 1 RM).

a Pareto set that can be represented by a B ezier simplex of degree two (Hamada et al. 2019). As we will show in this paper, the group lasso (Yuan and Lin 2006) can be reformulated as a multi-objective simplicial problem. It has a twice-curving Pareto set that requires a B ezier simplex of degree three. The same reformulation can also be applied to a broad range of sparse modeling methods, including the (original) lasso (Tibshirani 1996), the fused lasso (Tibshirani et al. 2005), the smooth lasso (Hebiri and van de Geer 2011), and the elastic net (Zou and Hastie 2005). Since the required degree is observed to be problem-dependent, we need to understand the performance of the two B ezier simplex ﬁttings with respect to the degree. Moreover, use cases of the B ezier simplex ﬁtting are not limited to post-optimal analysis. It can be applied to general data modeling problems as well. In the ﬁled of evolutionary biology, (Shoval et al. 2012) showed that the phenotype of a species distributes like a curved simplex. Such a distribution can be modeled by a B ezier simplex for a better understanding of biological phenomena. In this paper, we study the asymptotic risk of the two ﬁtting methods of the B ezier simplex: the all-at-once ﬁtting and the inductive skeleton ﬁtting, and compare their performance with respect to the degree. While asymptotics on a Euclidean space (having no boundary) is well-studied, the B ezier simplex ﬁtting is a regression method on a simplex (having a complex boundary, i.e., the skeleton), and its asymptotics have not been studied ever. Our contributions are as follows:

We have evaluated the asymptotic ℓ2-risk, as the sample size tends to inﬁnity, of two B ezier simplex ﬁtting methods: the all-at-once ﬁtting and the inductive skeleton ﬁtting.

In terms of minimizing the asymptotic risk, we have derived the optimal ratio of subsample sizes for the inductive skeleton ﬁtting.

We have shown when the inductive skeleton ﬁtting with optimal ratio outperforms the all-at-once ﬁtting when the degree of a B ezier simplex is two, whereas the all-at-once has an advantage at degree three.

We have demonstrated that the location problem and the group lasso are transformed into strongly convex problems, and their Pareto sets and fronts are approximated by a B ezier simplex, which numerically veriﬁes the asymptotic results. The rest of this paper is organized as follows: Section 2 describes the problem deﬁnition. Section 3 analyzes the asymptotic risks of the all-at-once ﬁtting and the inductive skeleton ﬁtting. For the inductive skeleton ﬁtting, the optimal subsample ratio in terms of minimizing the risk is derived. Those analyses are veriﬁed in Section 4 via numerical experiments. Section 5 concludes the paper and addresses future work.

2 Problem deﬁnition Let M be a non-negative integer. The standard (M 1)- simplex is denoted by

(t1, . . . , t M) RM

m=1 tm = 1, tm 0

We deﬁne the I-subsimplex for an index set I { 1, . . . , M } by ΔM 1 I = { (t1, . . . , t M) ΔM 1 | tm = 0 (m I) }. In addition, the m-skeleton of ΔM 1 for an integer 0 m M 1 is deﬁned by

I { 1,...,M } s.t. |I|=m+1 ΔM 1 I .

2.1 B ezier simplex and its ﬁtting methods We denote the set of non-negative integers (including zero!) by N. Let M, D, L be arbitrary integers in N and NM D := { (d1, . . . , d M) NM | M m=1 dm = D }. As shown in Figure 2, an (M 1)-B ezier simplex of degree D is a mapping b : ΔM 1 RL determined by control points pd RL (d NM D ) as follows:

Figure 2: A B ezier simplex for M = 3, D = 3.

where D d := D! d1!d2! d M! is a multinomial coefﬁcient, and td := td1 1 td2 2 td M M is a monomial (not vector) for each t := (t1, . . . , t M) ΔM 1 and d := (d1, . . . , d M) NM D . (Kobayashi et al. 2019) proposed two B ezier simplex ﬁtting algorithms: the all-at-once ﬁtting and the inductive skeleton ﬁtting. They are different in not an only ﬁtting algorithm but also sampling strategy. The all-at-once ﬁtting requires a training set SN := { (tn, xn) ΔM 1 RL | n = 1, . . . , N } and adjusts all control points at once by minimizing the OLS loss: 1 N N n=1 xn b(tn) 2. The inductive skeleton ﬁtting, on the other hand, requires skeleton-wise sampled training sets SN (m) := { (t(m) n , x(m) n ) Δ(m) RL | n = 1, . . . , N (m) } (m = 0, . . . , M 1). It also divides control points as pd(m) such that d(m) has exactly m + 1 non-zero elements. Such pd(m) determines the m-skeleton of a B ezier simplex. The inductive skeleton ﬁtting inductively adjusts pd(m) from m = 0 to M 1 by minimizing the OLS loss of the m-skeleton

1 N (m) N (m)

n=1 x(m) n b(t(m) n ) 2 .

2.2 The ℓ2-risk In this paper, we consider the following ﬁtting problem: As Figure 3 illustrates, a sample point (t, x) ΔM 1 RL is taken from an unknown B ezier simplex b : ΔM 1 RL with additive Gaussian noise ε N(0, σ2I), that is, x = b(t) + ε. For the all-at-once ﬁtting, SN = { (tn, xn) } follows the uniform distribution on the domain of the B ezier simplex: tn U(ΔM 1) and xn = b(tn) + εn. For the inductive skeleton ﬁtting, SN (m) = { (t(m) n , x(m) n ) } follows the uniform distribution on the m-skeleton of the domain of the B ezier simplex: t(m) n U(Δ(m)) and x(m) n = b(t(m) n )+ ε(m) n . A B ezier simplex estimated from SN is denoted by ˆb(t|SN). For both method, we asymptotically evaluate the ℓ2-risk below as N .

Et U(ΔM 1) b(t) ˆb(t|SN) 2 . (2)

For the inductive skeleton ﬁtting, we put SN = SN (0) SN (M 1) subject to N = N (0) + + N (M 1).

Figure 3: An illustration of taking a sample point on the true B ezier simplex with additive noise.

3 Asymptotic risk of B ezier simplex ﬁtting Let us ﬁrst focus on the fact: the subtraction inside the ℓ2norm in (2) can be also written as a B ezier simplex:

b(t) ˆb(t|SN) = |NM D |

td Ap d A. (3)

Each A-th control point p d A of this B ezier simplex is deﬁned by difference between the target control point pd A and the model control point ˆ pd A(SN), i.e. p d A = pd A ˆpd A(SN). To simplify the summation notation, we introduce a size NM D L matrix P composed by the l-th element of the A-th control point and a column vector z:

(P )Al = (p d A)l, z = D d1

td1, . . . , D d|NM D |

t d|NM D | .

Note that td = td1 1 td2 2 td M M is scalar, and z is a vector. Then, the B ezier simplex (3) can be represented by column vector P z, and its squared norm is equal to P z 2 = z P P z, or A,B z Az B(P P )AB in component. The risk (2) is deﬁned by an expectation value of this norm. Et only acts to z and ESN only acts to P . Therefore, we arrive at

d A,d B NM D

ΣABESN (P P )AB , (5)

ΣAB = Et[z Az B] = D d A

Et[td A+d B]. (6)

We can get closed form of ΣAB by performing integral Et explicitly. The following theorem provides the result.

Theorem 1 The matrix element ΣAB is calculated by

ΣAB = 2D + M 1 M 1

2D d A + d B

The proof is provided in the supplementary materials. 1. The equation (5) means that the asymptotic value of the risk function depends only on a choice of the matrix P .

1A longer version of this paper including appendix is available at https://arxiv.org/abs/1906.06924

3.1 All-at-once ﬁtting The matrix P determined by the all-at-once ﬁtting algorithm, PAAO, is minimizing the OLS loss:

b(tn) + εn xn

ˆb(tn) 2 = 1

P zn + εn 2

N ZP + Y 2 F , (8)

where F is the Frobenius norm. Here, we introduced an N NM D matrix Z and an N L matrix Y :

Z = [z1z2 z N] , Y = [ε1ε2 εN] . (9) Minimizing (8) is a traditional problem and we get PAAO = Z Z 1 Z Y . Note that Z includes N sample points on ΔM 1 and Y is a set of N noises on RL. These are all independent, so the expectation ESN can be factorized to EZEY .

Calculation of the asymptotics We need to calculate the expectation value of the matrix PAAOP AAO = Z Z 1 Z Y Y Z Z Z 1 over Z and Y . As easily checked, EY [Y Y ] = σ2LIN N, so we get

ESN PAAOP AAO = σ2L EZ Z Z 1 . (10)

Now, the matrix (Z Z) is an average over the sample:

1 N td A+d B n , (11)

and it converges to the matrix ΣAB deﬁned in (6) and (7) as N by using the law of large numbers: Z Z

AB p NΣAB. To substitute it to (10), however, we need to guarantee ΣAB has the inverse matrix. We can show it by the following theorem.

Theorem 2 Let VM,D be a vector space spanned by

d = (d1, . . . , d M) NM D

Then the map L : VM,D VM,D R

ΔM 1 P(t)Q(t)dt ,

is a non-degenerate bilinear form. Moreover, the matrix corresponding to this bilinear form is ΣAB in (7). In particular, for any D, M, the matrix ΣAB is non-singular.

The precise proof is given in the supplementary materials. In summary, our formula for the asymptotic form of the risk for the all-at-once ﬁtting is

A,B ΣABΣ 1 AB = σ2L

We can further simplify the result by using:

AB ΣABΣ 1 AB = NM D = D+M 1 D , which is relatively easy to show (see the supplementary materials).

3.2 Inductive skeleton ﬁtting So far, we did not take any explicit order of the control point indices A. From now on, let us take a speciﬁc order

P = [P (0) P (1) P (M 1) ], (12)

where P (m) is the submatrix of P composed by control points on Δ(m). Similarly, we introduce an order of control point indices d A as follows:

[d(0) 1 , . . . , d(0) n0 , d(1) 1 , . . . , d(1) n1 , . . . , d(M 1) 1 , . . . , d(M 1) n M 1 ],

where d(m) n is the n-th index of control points on the mskeleton and nm is the number of control points on the mskeleton. The inductive skeleton ﬁtting is described by an inductive procedure of determining control points matrices P (m) from low m = 0, 1, . . . , M 1. That is, ﬁrst it ﬁts the vertices of a B ezier simplex by moving the control points of the lowest dimension (P (0)); then, it ﬁts the edges by moving the control points of the second lowest dimension (P (1)); this process goes on with increasing dimensions and ﬁnishes at the highest dimension (P (M 1)). In the m-th step, sample points t(m) on Δ(m) are given. The corresponding z(m) deﬁned in (4) has the following form:

z(m) = [z(m)[0] z(m)[1] z(m)[m] 0 0], where

z(m)[k] = D d(k) 1

(t(m))d(k) 1 , . . . , D d(k) nk

(t(m))d(k) nk ,

because (t(m))d(k>m) includes 0d(k) =0 = 0 by deﬁnition. Thanks to these zeros, the OLS loss reduces as follows:

(P (m))T z(m)[m] n +

k<m (P (k))T z(m)[k] n + ε(m) n

Z(m)[m]P (m) +

k<m Z(m)[k]P (k) + Y (m) F . (13)

Each matrix is deﬁned as follows.

Z(m)[k] = z(m)[k] 1 z(m)[k] N(m) , Y (m) = ε(m) 1 ε(m) N(m) .

In addition, we regard lower-dimensional control points already ﬁxed, so the net objective control points are ones included in P (m). By repeating similar procedure done in the all-at-once ﬁtting, we can conclude P (m) is determined as

P (m) ISK = [(Z(m)) Z(m)] 1(Z(m))

k<m Z(m)[k]P (k) ISK

Calculation of the asymptotics From this expression, we get PISKPISK = M 1 i,j=0P (i) ISK(P (j) ISK) . The expected value of each (i, j)-term is needed to evaluate the risk (5). The following theorem provides us an algorithm to asymptotically calculate the expectation.

Theorem 3 Let Id = { i | di = 0 } { 1, . . . , M } and Λ(m)[k] be an nm nk matrix deﬁned by

(Λ(m)[k])d(m)d(k) = 1Id(m) Id(k) M m + 1

1 Σ(m) d(m)d(k),

where Σ(m) d Ad B := 2D + m m

2D d A + d B

and 1X = 1 if X is true, otherwise 0. Then we get the asymptotic submatrix X(i)(j)

ESN P (i) ISK(P (j) ISK) p X(i)(j) :=

m k1< <k <i m l1< <l <j

N (m) Λ Λ(m)Λ (15)

where the summation runs for all possible increasing sequences [k1, . . . , k ] and [l1, . . . , l ], and

Λ = Λ(i)Λ(i)[k ]Λ(k ) Λ(k1)[m],

Λ = Λ[m](l1) Λ(l )Λ[l ](j)Λ(j),

Λ[k](m) = (Λ(m)[k]) , Λ(m) = (Λ(m)[m]) 1. (16)

For the complete derivation, see the supplementary materials. The asymptotic form of the risk for the inductive skeleton ﬁtting is, therefore, calculated by

RN (0),...,N (M 1) =

d A,d B NM D

i,j=0 X(i)(j)

We found a candidate of closed-form with M = 2 risks as (D 1)/N (1) + 4/D(D + 2)N (0), but postpone deriving the closed-form of it for arbitrary (D, M) for future work. Instead, we show numerically computed risks in Table 1.

Table 1: Numerically computed asymptotic risks of the inductive skeleton ﬁtting RN (0),...,N (M 1) (M: the dimension of the B ezier simplex, D: the degree of the B ezier simplex, N (m): the sample size for the m-skeleton).

M D = 2 D = 3

2 1.00 N (1) + 0.50

N (0) 2.00 N (1) + 0.27

3 3.00 N (1) + 0.38

N (0) 1.00 N (2) + 3.54

N (1) + 0.15

4 5.14 N (1) + 0.46

N (0) 5.33 N (2) + 4.70

N (1) + 0.17

5 7.14 N (1) + 0.64

N (0) 13.33

N (2) + 6.67

N (1) + 0.21

6 8.93 N (1) + 0.82

N (0) 24.24

N (2) + 9.74

N (1) + 0.26

N (1) + 1.02

N (0) 37.10

N (2) + 13.84

N (1) + 0.31

N (1) + 1.21

N (0) 51.17

N (2) + 18.73

N (1) + 0.37

3.3 All-at-once vs. Inductive skeleton Given a total sample size N, we can minimize the ISK-risk by ﬁnding the optimally-decoupled subsample sizes:

RN := min N (0),...,N (M 1)

N=N (0)+ +N (M 1)

RN (0),...,N (M 1) . (17)

We calculated optimal risks for all cases shown in Table 1 and compared them to the risks of the all-at-once ﬁtting. Table 2 shows the results.

Table 2: Comparison of asymptotic risks of the all-at-once RAAO N vs. the inductive skeleton with the optimal subsample ratio RISK N (M: the dimension of the B ezier simplex, D: the degree of the B ezier simplex, N: the sample size). The winner is shown in bold.

D = 2 D = 3 RAAO N RISK N RAAO N RISK N M = 2 3.0/N 2.91/N 4.0/N 3.73/N M = 3 6.0/N 5.50/N 10.0/N 10.650/N M = 4 10.0/N 8.67/N 20.0/N 23.88/N M = 5 15.0/N 11.99/N 35.0/N 44.76/N M = 6 21.0/N 15.17/N 56.0/N 73.14/N M = 7 28.0/N 18.07/N 84.0/N 107.57/N M = 8 36.0/N 20.68/N 120.0/N 146.21/N

As one can see, the optimum inductive skeleton ﬁtting outperforms the all-at-once ﬁtting for D = 2, but it is not always true for D = 3. On D = 2, in fact, we can show that the minimum value of the inductive skeleton always less than the asymptotic risk of the corresponding all-at-one ﬁtting.

4 Numerical examples We examine the empirical performances of the all-at-once ﬁtting and the inductive skeleton ﬁtting and verify the asymptotic risks derived in Section 3.1 over synthetic instances and multi-objective optimization instances. Experiment programs were implemented in Python 3.7.1 and run on a Windows 7 PC with an Intel Core i7-4790CPU (3.60 GHz) and 16 GB RAM2.

4.1 Synthetic instances We consider the ﬁtting problem where the true B ezier simplex b(t) (t ΔM 1) is an (M 1)-dimensional unit simplex on RL, and randomly generate N training points { (tn, xn) }N n=1 as xn = b(tn) + εn (εn N(0, 0.12I)). This synthetic instance is parameterized by a tuple (L, M, N). The detailed data generation processes are shown in the supplementary materials. In this experiment, we estimated the B ezier simplex with degree D = 2 or 3, and compared the following three ﬁtting methods: all-at-once the all-at-once ﬁtting (Section 3.1);

2The source code and library dependencies are provided in https://github.com/rafcc/aaai-20.1534.

inductive skeleton (non-optimal) the inductive skeleton ﬁtting (Section 3.2) with N (0) = = N (M 1) = N/M, which does not provide the optimal value of the risk shown in Table 1;

inductive skeleton (optimal) the inductive skeleton ﬁtting (Section 3.2) where N (0), . . . , N (M 1) are determined by minimizing the risk shown in Table 1 under the constraints M 1 m=0 N (m) = N and N (m) 0 (m = 0, . . . , M 1). The actual sample size N (m) for each (D, M) are shown in the supplementary materials.

When we calculated an approximation of the expected risk for each method, we randomly chose other 10000 parameters { ˆtn } 10000 n=1 from U(ΔM 1) as a test set and measured the mean squared error, MSE :=

1 10000 10000 n=1 b(ˆtn) ˆb(ˆtn) 2 , where ˆb is the estimated B ezier simplex. This experiment was conducted with the following tuple (L, M, N) to observe how the empirical MSEs depend on L, M and N respectively:

N { 250, 500, 1000, 2000 } with (L, M) = (100, 8),

M { 3, 4, 5, 6, 7, 8 } with (L, N) = (100, 1000),

L { 8, 25, 50, 100 } with (M, N) = (8, 1000),

For each (L, M, N) with D { 2, 3 }, we ran 20 trials and measured MSEs. Owing to space limitation, we only present typical results here. The remaining results are provided in the supplementary materials. Figure 4 shows box plots of MSEs over 20 trials and our theoretical risks (5) and Table 1 for each N { 250, 500, 1000, 2000 } with (L, M) = (100, 8) and D { 2, 3 }. We observe that these ﬁgures empirically show that our theoretical risks are correct for both D = 2 and 3, and the gaps between the MSEs and the risks are sufﬁciently small at N = 1000. For both D = 2 and 3, the inductive skeleton (optimal) always achieved lower MSEs than that of the inductive skeleton (non-optimal). This result suggests the effectiveness of minimizing the risk (Table 2) with respect to the sample size of each skeleton. In addition, the inductive skeleton ﬁtting (optimal) also outperformed the all-at-once ﬁtting in the case of D = 2. This result also supports the discussion described in Section 3.3.

4.2 Multi-objective optimization instances

To investigate the relationship between the generalization performance and our theoretical risk, we provide two complementary instances of multi-objective optimization problems: a generalized location problem called MED (Harada, Sakuma, and Kobayashi 2006; Hamada et al. 2010) and a multi-objective hyper-parameter tuning of the group lasso (Yuan and Lin 2006) on the Birthwt dataset (Hosmer and Lemeshow 1989; Venables and Ripley 2002). The location problem has 3 objectives and 100 variables (that is (M, L) = (3, 100)). Its Pareto set/front can be represented by a B ezier simplex with degree D = 2. On the other hand, the group lasso has 3 objectives and 6 variables (that is (M, L) = (3, 6)). Its Pareto set/front cannot be represented

with degree D = 2 but can be with D = 3 (see the supplementary materials). We will describe the details of problem settings in the subsequent sections.

A generalized location problem This problem is a generalization of the multi-objective location problem (Kuhn 1967) to a higher dimension:

minimize f(x) = (f1(x), f2(x), f3(x)) subject to x R100

where fm(x) = x em 2 (m = 1, . . . , 3)

e1 = (1, 0, 0, 0, 0, . . . , 0) R100,

e2 = (0, 1, 0, 0, 0, . . . , 0) R100,

e3 = (0, 0, 1, 0, 0, . . . , 0) R100. (18)

Note that this is a special case of the MED benchmark problem (Hamada et al. 2010). The MED problem is simplicial (Hamada 2017) and its Pareto set is known to be the convex hull of the minimizers of separate objective functions, i.e., the 2-simplex spanned by e1, e2, e3. For each vertex, edge, face of this simplex, which is the Pareto set of each 1-, 2-, 3-objective subproblem, we generate a subsample according to the uniform distribution on it.

The group lasso We applied the B ezier simplex ﬁttings to multi-objective hyper-parameter tuning of the group lasso. In this problem, we used the dataset, Birthwt in the Rpackage MASS, which contains 189 births at the Baystate Medical Centre, Springﬁeld, Massachusetts during 1986 (Hosmer and Lemeshow 1989; Venables and Ripley 2002). From the dataset, we adopted six continuous features age1, age2, age3, lwt1, lwt2, lwt3 as predictors and one continuous feature bwt as a response for regression analysis. Since the predictors are classiﬁed into two groups, age and lwt, the group lasso (Yuan and Lin 2006) was employed. Put N = 189 and M = 6. Let A be an N M matrix of observations of the predictors, x RM be a row vector of the predictor coefﬁcients to be estimated, separated into two groups xage = (x1, x2, x3) and xlwt = (x4, x5, x6) , and y RN be a row vector of observations of the response. The group lasso regressor is the solution to the following problem:

minimize 1 2N Ax y 2 + λ

3 ( xage + xlwt )

subject to x R6 (19)

where is the Euclidean norm, and λ is a positive number to be tuned by users. This original form suffers from two drawbacks:

Choosing an appropriate value for λ involves a grid search on an unbounded domain.

Since two groups have physically different units of measurement, same weights are not always appropriate even if their values are normalized.

(a) D = 2 (b) D = 3

Figure 4: Sample size N vs. MSE with (L, M) = (100, 8) (boxplots: empirical MSEs over 20 trials, lines: theoretical risks).

Table 3: MSE (avg. s.d. over 20 trials) for the Pareto sets of the location problem and the group lasso. The winners with signiﬁcance level p < 0.05 are shown in bold.

(a) Location problem

D N All-at-once Inductive-skeleton (optimal)

2 250 1.0246e+00 1.7031e-03 1.0227e+00 2.3045e-03 500 1.0119e+00 5.3916e-04 1.0108e+00 7.5912e-04 1000 1.0060e+00 2.6182e-04 1.0055e+00 4.7063e-04 2000 1.0029e+00 1.8406e-04 1.0028e+00 2.7489e-04

3 250 1.0430e+00 3.3097e-03 1.0458e+00 4.4068e-03 500 1.0203e+00 1.1845e-03 1.0219e+00 1.4369e-03 1000 1.0100e+00 3.9162e-04 1.0112e+00 6.2094e-04 2000 1.0049e+00 2.3927e-04 1.0056e+00 3.4550e-04

(b) Group lasso

D N All-at-once Inductive-skeleton (optimal)

2 250 2.9440e-04 9.8629e-06 1.6387e-03 2.5544e-05 500 2.8576e-04 3.3213e-06 1.6213e-03 1.8418e-05 1000 2.8395e-04 2.3915e-06 1.6133e-03 1.4468e-05 2000 2.8219e-04 1.3681e-06 1.6110e-03 8.2639e-06

3 250 9.4367e-05 8.2106e-06 3.6896e-04 1.0013e-05 500 8.7906e-05 3.2759e-06 3.6264e-04 4.6550e-06 1000 8.6296e-05 1.9045e-06 3.5979e-04 2.5247e-06 2000 8.5007e-05 9.5520e-07 3.5846e-04 1.8742e-06

Instead, we consider each term in (19) as a separate objective function:

minimize f(x) = (f1(x), f2(x), f3(x)) subject to x R6

where f1(x) = Ax y 2 , f2(x) = xage 2 , f3(x) = xlwt 2. Notice that the use of the squared norm in f2 and f3 does not change their solutions. It is easy to see that every objective function in (20) is convex but not strongly convex. We make them strongly convex by the following perturbation: f1 = f1 + ε x 2 , f2 = f2 + ε x 2 , f3 = f3 + ε x 2

where ε is an arbitrarily small positive number (we set ε = 10 4). Now the problem of minimizing a mapping f = ( f1, f2, f3) is strongly convex. By (Hamada et al. 2019, Theorems 1.1 and 3.1), this problem is weakly simplicial and the mapping

x (w) = arg min x w, f(x) (21)

is well-deﬁned and continuous on Δ2, satisfying x (Δ2 I) = X ( f I) for all I { 1, 2, 3 }.

Then, we obtained subsamples by solving (21) repeatedly with varying w Δ2 I for each I { 1, 2, 3 }. For each such I, the weight w was drawn from the uniform distribution on Δ2 I and the problem (21) was solved by the steepest descent method. The same idea can be applied to a broad range of sparse learning methods, including the original lasso (Tibshirani 1996), the fused lasso (Tibshirani et al. 2005), the smooth lasso (Hebiri and van de Geer 2011), and the elastic net (Zou and Hastie 2005). For those methods, their group-wise regularization terms can be considered as separate objectives, and the resulting problems would be many-objective (fourobjective or more) where the all-at-once ﬁtting will much outperform over the inductive skeleton ﬁtting. We however remark that the bridge regression (lldiko E. Frank and Friedman 1993) is not the case since its regularization term using a nonconvex ℓp-norm (i.e., p < 1) cannot change into a strongly convex function via perturbations.

Data generation process and evaluation As we conducted in the previous experiments, we generated a training set and a test set on a Pareto set/front randomly. For the location problem, to evaluate the generalized performance for the noisy test data, we added the Gaussian noise (N(0, 0.12I)) to each point of the training and test sets. Then, we ﬁtted a B ezier simplex to the training set and eval-

uated the MSE between the estimated B ezier simplex and the test set. We changed the size of the training set from N { 250, 500, 1000, 2000 }. The size of the test set is 10000 and 1000 for the location problem and the group lasso, respectively. We repeated experiments 20 times for each (D, N).

Results and discussion Here, we show the results of ﬁtting Pareto sets. The results of ﬁtting Pareto fronts are provided in the supplementary materials. For each problem instance and method, the average and the standard deviation of the MSE are shown in Table 3. In the table, we highlighted the best score of MSE out of all-at-once ﬁtting and inductive skeleton ﬁtting (optimal) and added the results of one-sided Student s t-test3 with signiﬁcance level 0.05. Table 3a shows results that the inductive skeleton (optimal) outperformed the all-at-once for D = 2, and the opposite results for D = 3. Note that these magnitude relationships of MSEs for the test data accord with those of the theoretical risks described in Table 2. Therefore, we found that the difference in MSEs can be derived from the risk of each ﬁtting method. Since the variance of added noise is relatively so large that the Pareto ordering of the training data points may be changed (see the scatter plots shown in the supplementary materials), this experimental setting is more challenging than those of real multi-objective optimization problems. Thus, the result of the location problem suggests that, even for a real problem, we are expected to see a signiﬁcant difference in the generalized performance between the all-at-once and the inductive skeleton (optimal). In case of the group lasso, on the other hand, Table 3b shows that the all-at-once was always better for both D = 2 and 3, and the differences are almost all signiﬁcant. While our analysis assumes that the target hyper-surface to be ﬁtted can be represented by a B ezier simplex, the Pareto set of the group lasso cannot for D = 2 but for D = 3. Therefore, we can see that the results for D = 3 that the all-at-once achieved better MSEs accords with our analysis. From the above results, the validity of the analytic results is conﬁrmed in practical situations.

5 Conclusion

In this paper, we have shown that the asymptotic ℓ2-risk of the two B ezier simplex ﬁtting methods developed previously: the all-at-once ﬁtting and the inductive skeleton ﬁtting. From our risk analysis, the optimal ratio of subsamples for the inductive skeleton ﬁtting has been derived, which is useful for design of experiments to maximize the goodness of ﬁt. We have discussed that superiority between the two ﬁtting methods depends on the degree of a B ezier simplex to be ﬁt: the inductive skeleton ﬁtting with optimallydecoupled subsamples outperforms for degree two whereas the all-at-once ﬁtting becomes the better for degree three, independent of the dimensionality of the B ezier simplex and its ambient space. The above theoretical results have been

3When we conducted a one-sided Student s t-test, we used a log transformation to MSEs in advance.

conﬁrmed via numerical experiments under small to moderate sample sizes. We have demonstrated two applications of the analytic results in multi-objective optimization: a generalized location problem and a hyperparameter tuning of the group lasso. As a remark for future work, we point out two important cases which the current theory does not cover. The ﬁrst one is the case discussed in Section 4.2 that the true surface is not representable by a model. The second one is presented in the literature (Kobayashi et al. 2019). When the parameters of a B ezier simplex are not given in a sample and to be estimated as well as the control points, the inductive skeleton ﬁtting outperforms the all-at-once ﬁtting even if the B ezier simplex is of degree three. We believe that those cases would offer insightful examples to extend the scope of our theory.

Acknowledgement

We wish to thank Prof. Shuji Yamamoto for making a number of valuable suggestions.

Borges, C. F., and Pastva, T. 2002. Total least squares ﬁtting of B ezier and B-spline curves to ordered data. Computer Aided Geometric Design 19(4):275 289. Deb, K., and Jain, H. 2014. An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part I: Solving problems with box constraints. IEEE Transactions on Evolutionary Computation 18(4):577 601. Deb, K. 2001. Multi-objective Optimization Using Evolutionary Algorithms. New York, NY, USA: John Wiley & Sons, Inc. Eichfelder, G. 2008. Adaptive Scalarization Methods in Multiobjective Optimization. Springer-Verlag, Berlin, Heidelberg. Hamada, N.; Nagata, Y.; Kobayashi, S.; and Ono, I. 2010. Adaptive weighted aggregation: A multiobjective function optimization framework taking account of spread and evenness of approximate solutions. In Proceedings of the 2010 IEEE Congress on Evolutionary Computation, CEC 2010, 787 794. Hamada, N.; Hayano, K.; Ichiki, S.; Kabata, Y.; and Teramoto, H. 2019. Topology of Pareto sets of strongly convex problems. Ar Xiv e-prints. http://arxiv.org/abs/1904.03615. Hamada, N. 2017. Simple problems: The simplicial gluing structure of Pareto sets and Pareto fronts. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO 17, 315 316. New York, NY, USA: ACM. Harada, K.; Sakuma, J.; Kobayashi, S.; and Ono, I. 2007. Uniform sampling of local Pareto-optimal solution curves by Pareto path following and its applications in multiobjective GA. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), 813 820. New York, NY, USA: ACM.

Harada, K.; Sakuma, J.; and Kobayashi, S. 2006. Local search for multiobjective function optimization: Pareto descent method. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, GECCO 06, 659 666. New York, NY, USA: ACM. Hebiri, M., and van de Geer, S. 2011. The smooth-lasso and other ℓ1 + ℓ2-penalized methods. Electron. J. Statist. 5:1184 1226. Hernandez-Lobato, D.; Hernandez-Lobato, J.; Shah, A.; and Adams, R. 2016. Predictive entropy search for multiobjective bayesian optimization. In Balcan, M. F., and Weinberger, K. Q., eds., Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, 1492 1501. New York, New York, USA: PMLR. Hillermeier, C. 2001. Nonlinear Multiobjective Optimization: A Generalized Homotopy Approach, volume 25 of International Series of Numerical Mathematics. Birkh auser Verlag, Basel, Boston, Berlin. Hosmer, D. W., and Lemeshow, S. 1989. Applied Logistic Regression. New York: Wiley. Kobayashi, K.; Hamada, N.; Sannai, A.; Tanaka, A.; Bannai, K.; and Sugiyama, M. 2019. B ezier simplex ﬁtting: Describing pareto fronts of simplicial problems with small samples in multi-objective optimization. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, 2304 2313. Kuhn, H. W. 1967. On a pair of dual nonlinear programs. Nonlinear Programming 1:38 45. lldiko E. Frank, and Friedman, J. H. 1993. A statistical view of some chemometrics regression tools. Technometrics 35(2):109 135. Mastroddi, F., and Gemma, S. 2013. Analysis of Pareto frontiers for multidisciplinary design optimization of aircraft. Aerosp. Sci. Technol. 28(1):40 55. Miettinen, K. M. 1999. Nonlinear Multiobjective Optimization, volume 12 of International Series in Operations Research & Management Science. Springer-Verlag, Gmb H. Shoval, O.; Sheftel, H.; Shinar, G.; Hart, Y.; Ramote, O.; Mayo, A.; Dekel, E.; Kavanagh, K.; and Alon, U. 2012. Evolutionary trade-offs, Pareto optimality, and the geometry of phenotype space. Science 336(6085):1157 1160. Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; and Knight, K. 2005. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(1):91 108. Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1):267 288. Venables, W. N., and Ripley, B. D. 2002. Modern Applied Statistics with S. Springer, fourth edition. Vrugt, J. A.; Gupta, H. V.; Bastidas, L. A.; Bouten, W.; and Sorooshian, S. 2003. Effective and efﬁcient algorithm for multiobjective optimization of hydrologic models. Water Resources Research 39(8):1214 1232.

Yang, K.; Emmerich, M.; Deutz, A.; and B ack, T. 2019. Multi-objective Bayesian global optimization using expected hypervolume improvement gradient. Swarm and Evolutionary Computation 44:945 956. Yuan, M., and Lin, Y. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1):49 67. Zhang, Q., and Li, H. 2007. MOEA/D: A multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation 11(6):712 731. Zou, H., and Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 67(2):301 320.