# asymptotic_risk_of_bézier_simplex_fitting__8c3f9810.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Asymptotic Risk of B ezier Simplex Fitting Akinori Tanaka RIKEN AIP, Keio University akinori.tanaka@riken.jp Akiyoshi Sannai RIKEN AIP, Keio University akiyoshi.sannai@riken.jp Ken Kobayashi Fujitsu Laboratories LTD., RIKEN AIP, Tokyo Tech ken-kobayashi@fujitsu.com Naoki Hamada Fujitsu Laboratories LTD., RIKEN AIP hamada-naoki@fujitsu.com The B ezier simplex fitting is a novel data modeling technique which utilizes geometric structures of data to approximate the Pareto set of multi-objective optimization problems. There are two fitting methods based on different sampling strategies. The inductive skeleton fitting employs a stratified subsampling from skeletons of a simplex, whereas the all-atonce fitting uses a non-stratified sampling which treats a simplex as a single object. In this paper, we analyze the asymptotic risks of those B ezier simplex fitting methods and derive the optimal subsample ratio for the inductive skeleton fitting. It is shown that the inductive skeleton fitting with the optimal ratio has a smaller risk when the degree of a B ezier simplex is less than three. Those results are verified numerically under small to moderate sample sizes. In addition, we provide two complementary applications of our theory: a generalized location problem and a multi-objective hyper-parameter tuning of the group lasso. The former can be represented by a B ezier simplex of degree two where the inductive skeleton fitting outperforms. The latter can be represented by a B ezier simplex of degree three where the all-at-once fitting gets an advantage. 1 Introduction Given functions f1, . . . , f M : X R on a subset X of a Euclidean space RN, consider the multi-objective optimization problem minimize f(x) := (f1(x), . . . , f M(x)) subject to x X( RN) with respect to the Pareto ordering defined as follows: x y def = i [fi(x) fi(y)] j [fj(x) < fj(y)] . The goal is to find the Pareto set and its image, called the Pareto front, which are denoted by X (f) := { x X | y X [y x] } f(X (f)) := f(x) RM x X (f) , Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. respectively. Most numerical optimization approaches (e.g., goal programming (Miettinen 1999; Eichfelder 2008), evolutionary computation (Deb 2001; Zhang and Li 2007; Deb and Jain 2014), homotopy methods (Hillermeier 2001; Harada et al. 2007), Bayesian optimization (Hernandez Lobato et al. 2016; Yang et al. 2019)) give a finite number of points as an approximation of the Pareto set or front. Since the Pareto set and front usually have an infinite number of points, such a point approximation cannot reveal the complete shapes of the Pareto set and front. In order to gain richer information, we consider in this paper a fitting problem of the Pareto set and front. It is known that the Pareto set and front often have skeleton structures that can be used to enhance fitting accuracy. An M-objective problem is simplicial if the Pareto set and front are homeomorphic to an (M 1)-dimensional simplex and each m-dimensional subsimplex corresponds to the Pareto set of an (m + 1)-objective subproblem for all 0 m M 1 (see (Hamada et al. 2019) for precise definition and an example is shown in Figure 1). There are a lot of practical problems being simplicial: location problems (Kuhn 1967) and a phenotypic divergence model in evolutionary biology (Shoval et al. 2012) are shown to be simplicial, and an airplane design (Mastroddi and Gemma 2013) and a hydrologic modeling (Vrugt et al. 2003) have numerical solutions which imply those problems are simplicial. The Pareto set and front of any simplicial problem can be approximated with arbitrary accuracy by a B ezier simplex of an appropriate degree (Kobayashi et al. 2019). There are two fitting algorithms for B ezier simplices: the all-at-once fitting is a na ıve extension of Borges-Pastva algorithm for B ezier curves (Borges and Pastva 2002), and the inductive skeleton fitting (Kobayashi et al. 2019) exploits the skeleton structure of simplicial problems discussed above. An important problem class which is (generically) simplicial is strongly convex problems. It has been shown that many practical problems can be considered as strongly convex via appropriate transformations preserving the essential problem structure, i.e., the Pareto ordering and the topology (Hamada et al. 2019). For example, the multi-objective location problem (Kuhn 1967) becomes strongly convex by squaring each objective function. The resulting problem has Δ{1,3} Δ{2,3} Φ: homeomorphism s.t. Φ(ΔI)=X (f I) (a) Simplex Δ2. X (f{1}) X (f{2}) X (f{1,3}) X (f{2,3}) X (f{1,2,3}) f|X (f) embedding (b) Pareto set X (f). f(X (f{1})) f(X (f{2})) f(X (f{3})) f(X (f{1,2})) f(X (f{1,3})) f(X (f{2,3})) f(X (f{1,2,3})) (c) Pareto front f(X (f)). Figure 1: A simplicial problem f = (f1, f2, f3) : R3 R3. An M-objective problem f is simplicial if the following conditions are satisfied: (i) there exists a homeomorphism Φ : ΔM 1 X (f) such that Φ(ΔI) = X (f I) for all I { 1, . . . , M }; (ii) the restriction f|X (f) : X (f) RM is a topological embedding (and thus so is f Φ : ΔM 1 RM). a Pareto set that can be represented by a B ezier simplex of degree two (Hamada et al. 2019). As we will show in this paper, the group lasso (Yuan and Lin 2006) can be reformulated as a multi-objective simplicial problem. It has a twice-curving Pareto set that requires a B ezier simplex of degree three. The same reformulation can also be applied to a broad range of sparse modeling methods, including the (original) lasso (Tibshirani 1996), the fused lasso (Tibshirani et al. 2005), the smooth lasso (Hebiri and van de Geer 2011), and the elastic net (Zou and Hastie 2005). Since the required degree is observed to be problem-dependent, we need to understand the performance of the two B ezier simplex fittings with respect to the degree. Moreover, use cases of the B ezier simplex fitting are not limited to post-optimal analysis. It can be applied to general data modeling problems as well. In the filed of evolutionary biology, (Shoval et al. 2012) showed that the phenotype of a species distributes like a curved simplex. Such a distribution can be modeled by a B ezier simplex for a better understanding of biological phenomena. In this paper, we study the asymptotic risk of the two fitting methods of the B ezier simplex: the all-at-once fitting and the inductive skeleton fitting, and compare their performance with respect to the degree. While asymptotics on a Euclidean space (having no boundary) is well-studied, the B ezier simplex fitting is a regression method on a simplex (having a complex boundary, i.e., the skeleton), and its asymptotics have not been studied ever. Our contributions are as follows: We have evaluated the asymptotic ℓ2-risk, as the sample size tends to infinity, of two B ezier simplex fitting methods: the all-at-once fitting and the inductive skeleton fitting. In terms of minimizing the asymptotic risk, we have derived the optimal ratio of subsample sizes for the inductive skeleton fitting. We have shown when the inductive skeleton fitting with optimal ratio outperforms the all-at-once fitting when the degree of a B ezier simplex is two, whereas the all-at-once has an advantage at degree three. We have demonstrated that the location problem and the group lasso are transformed into strongly convex problems, and their Pareto sets and fronts are approximated by a B ezier simplex, which numerically verifies the asymptotic results. The rest of this paper is organized as follows: Section 2 describes the problem definition. Section 3 analyzes the asymptotic risks of the all-at-once fitting and the inductive skeleton fitting. For the inductive skeleton fitting, the optimal subsample ratio in terms of minimizing the risk is derived. Those analyses are verified in Section 4 via numerical experiments. Section 5 concludes the paper and addresses future work. 2 Problem definition Let M be a non-negative integer. The standard (M 1)- simplex is denoted by (t1, . . . , t M) RM m=1 tm = 1, tm 0 We define the I-subsimplex for an index set I { 1, . . . , M } by ΔM 1 I = { (t1, . . . , t M) ΔM 1 | tm = 0 (m I) }. In addition, the m-skeleton of ΔM 1 for an integer 0 m M 1 is defined by I { 1,...,M } s.t. |I|=m+1 ΔM 1 I . 2.1 B ezier simplex and its fitting methods We denote the set of non-negative integers (including zero!) by N. Let M, D, L be arbitrary integers in N and NM D := { (d1, . . . , d M) NM | M m=1 dm = D }. As shown in Figure 2, an (M 1)-B ezier simplex of degree D is a mapping b : ΔM 1 RL determined by control points pd RL (d NM D ) as follows: Figure 2: A B ezier simplex for M = 3, D = 3. where D d := D! d1!d2! d M! is a multinomial coefficient, and td := td1 1 td2 2 td M M is a monomial (not vector) for each t := (t1, . . . , t M) ΔM 1 and d := (d1, . . . , d M) NM D . (Kobayashi et al. 2019) proposed two B ezier simplex fitting algorithms: the all-at-once fitting and the inductive skeleton fitting. They are different in not an only fitting algorithm but also sampling strategy. The all-at-once fitting requires a training set SN := { (tn, xn) ΔM 1 RL | n = 1, . . . , N } and adjusts all control points at once by minimizing the OLS loss: 1 N N n=1 xn b(tn) 2. The inductive skeleton fitting, on the other hand, requires skeleton-wise sampled training sets SN (m) := { (t(m) n , x(m) n ) Δ(m) RL | n = 1, . . . , N (m) } (m = 0, . . . , M 1). It also divides control points as pd(m) such that d(m) has exactly m + 1 non-zero elements. Such pd(m) determines the m-skeleton of a B ezier simplex. The inductive skeleton fitting inductively adjusts pd(m) from m = 0 to M 1 by minimizing the OLS loss of the m-skeleton 1 N (m) N (m) n=1 x(m) n b(t(m) n ) 2 . 2.2 The ℓ2-risk In this paper, we consider the following fitting problem: As Figure 3 illustrates, a sample point (t, x) ΔM 1 RL is taken from an unknown B ezier simplex b : ΔM 1 RL with additive Gaussian noise ε N(0, σ2I), that is, x = b(t) + ε. For the all-at-once fitting, SN = { (tn, xn) } follows the uniform distribution on the domain of the B ezier simplex: tn U(ΔM 1) and xn = b(tn) + εn. For the inductive skeleton fitting, SN (m) = { (t(m) n , x(m) n ) } follows the uniform distribution on the m-skeleton of the domain of the B ezier simplex: t(m) n U(Δ(m)) and x(m) n = b(t(m) n )+ ε(m) n . A B ezier simplex estimated from SN is denoted by ˆb(t|SN). For both method, we asymptotically evaluate the ℓ2-risk below as N . Et U(ΔM 1) b(t) ˆb(t|SN) 2 . (2) For the inductive skeleton fitting, we put SN = SN (0) SN (M 1) subject to N = N (0) + + N (M 1). Figure 3: An illustration of taking a sample point on the true B ezier simplex with additive noise. 3 Asymptotic risk of B ezier simplex fitting Let us first focus on the fact: the subtraction inside the ℓ2norm in (2) can be also written as a B ezier simplex: b(t) ˆb(t|SN) = |NM D | td Ap d A. (3) Each A-th control point p d A of this B ezier simplex is defined by difference between the target control point pd A and the model control point ˆ pd A(SN), i.e. p d A = pd A ˆpd A(SN). To simplify the summation notation, we introduce a size NM D L matrix P composed by the l-th element of the A-th control point and a column vector z: (P )Al = (p d A)l, z = D d1 td1, . . . , D d|NM D | t d|NM D | . Note that td = td1 1 td2 2 td M M is scalar, and z is a vector. Then, the B ezier simplex (3) can be represented by column vector P z, and its squared norm is equal to P z 2 = z P P z, or A,B z Az B(P P )AB in component. The risk (2) is defined by an expectation value of this norm. Et only acts to z and ESN only acts to P . Therefore, we arrive at d A,d B NM D ΣABESN (P P )AB , (5) ΣAB = Et[z Az B] = D d A Et[td A+d B]. (6) We can get closed form of ΣAB by performing integral Et explicitly. The following theorem provides the result. Theorem 1 The matrix element ΣAB is calculated by ΣAB = 2D + M 1 M 1 2D d A + d B The proof is provided in the supplementary materials. 1. The equation (5) means that the asymptotic value of the risk function depends only on a choice of the matrix P . 1A longer version of this paper including appendix is available at https://arxiv.org/abs/1906.06924 3.1 All-at-once fitting The matrix P determined by the all-at-once fitting algorithm, PAAO, is minimizing the OLS loss: b(tn) + εn xn ˆb(tn) 2 = 1 P zn + εn 2 N ZP + Y 2 F , (8) where F is the Frobenius norm. Here, we introduced an N NM D matrix Z and an N L matrix Y : Z = [z1z2 z N] , Y = [ε1ε2 εN] . (9) Minimizing (8) is a traditional problem and we get PAAO = Z Z 1 Z Y . Note that Z includes N sample points on ΔM 1 and Y is a set of N noises on RL. These are all independent, so the expectation ESN can be factorized to EZEY . Calculation of the asymptotics We need to calculate the expectation value of the matrix PAAOP AAO = Z Z 1 Z Y Y Z Z Z 1 over Z and Y . As easily checked, EY [Y Y ] = σ2LIN N, so we get ESN PAAOP AAO = σ2L EZ Z Z 1 . (10) Now, the matrix (Z Z) is an average over the sample: 1 N td A+d B n , (11) and it converges to the matrix ΣAB defined in (6) and (7) as N by using the law of large numbers: Z Z AB p NΣAB. To substitute it to (10), however, we need to guarantee ΣAB has the inverse matrix. We can show it by the following theorem. Theorem 2 Let VM,D be a vector space spanned by d = (d1, . . . , d M) NM D Then the map L : VM,D VM,D R ΔM 1 P(t)Q(t)dt , is a non-degenerate bilinear form. Moreover, the matrix corresponding to this bilinear form is ΣAB in (7). In particular, for any D, M, the matrix ΣAB is non-singular. The precise proof is given in the supplementary materials. In summary, our formula for the asymptotic form of the risk for the all-at-once fitting is A,B ΣABΣ 1 AB = σ2L We can further simplify the result by using: AB ΣABΣ 1 AB = NM D = D+M 1 D , which is relatively easy to show (see the supplementary materials). 3.2 Inductive skeleton fitting So far, we did not take any explicit order of the control point indices A. From now on, let us take a specific order P = [P (0) P (1) P (M 1) ], (12) where P (m) is the submatrix of P composed by control points on Δ(m). Similarly, we introduce an order of control point indices d A as follows: [d(0) 1 , . . . , d(0) n0 , d(1) 1 , . . . , d(1) n1 , . . . , d(M 1) 1 , . . . , d(M 1) n M 1 ], where d(m) n is the n-th index of control points on the mskeleton and nm is the number of control points on the mskeleton. The inductive skeleton fitting is described by an inductive procedure of determining control points matrices P (m) from low m = 0, 1, . . . , M 1. That is, first it fits the vertices of a B ezier simplex by moving the control points of the lowest dimension (P (0)); then, it fits the edges by moving the control points of the second lowest dimension (P (1)); this process goes on with increasing dimensions and finishes at the highest dimension (P (M 1)). In the m-th step, sample points t(m) on Δ(m) are given. The corresponding z(m) defined in (4) has the following form: z(m) = [z(m)[0] z(m)[1] z(m)[m] 0 0], where z(m)[k] = D d(k) 1 (t(m))d(k) 1 , . . . , D d(k) nk (t(m))d(k) nk , because (t(m))d(k>m) includes 0d(k) =0 = 0 by definition. Thanks to these zeros, the OLS loss reduces as follows: (P (m))T z(m)[m] n + k