# on_lphyperparameter_learning_via_bilevel_nonsmooth_optimization__74ff884a.pdf

Journal of Machine Learning Research 22 (2021) 1-47 Submitted 7/18; Revised 6/21; Published 9/21

On ℓp-hyperparameter Learning via Bilevel Nonsmooth Optimization

Takayuki Okuno TAKAYUKI.OKUNO.KS@RIKEN.JP Center for Advanced Intelligence Project, RIKEN Tokyo 103-0027, Japan

Akiko Takeda TAKEDA@MIST.I.U-TOKYO.AC.JP Graduate School of Information Science and Technology, The University of Tokyo Tokyo 113-8656, Japan; Center for Advanced Intelligence Project, RIKEN Tokyo 103-0027, Japan

Akihiro Kawana KAWANA.AK.PP@GMAIL.COM Department of Industrial Engineering and Economics, Tokyo Institute of Technology Tokyo 152-8550, Japan (This research was conducted when he was a student at Tokyo Institute of Technology, and is completely irrevalent to the present company.)

Motokazu Watanabe

MWATANABE@G.ECC.U-TOKYO.AC.JP Department of Mathematical Informatics, The University of Tokyo Tokyo 113-8656, Japan; Present Address: Tokio Marine & Nichido Fire Insurance Co., Ltd., Tokyo, Japan (This research was conducted when he was a student at The University of Tokyo, and is completely irrevalent to the present company.)

Editor: Ryota Tomioka

We propose a bilevel optimization strategy for selecting the best hyperparameter value for the nonsmooth ℓp regularizer with 0 < p 1. The concerned bilevel optimization problem has a nonsmooth, possibly nonconvex, ℓp-regularized problem as the lower-level problem. Despite the recent popularity of nonconvex ℓp-regularizer and the usefulness of bilevel optimization for selecting hyperparameters, algorithms for such bilevel problems have not been studied because of the difﬁculty of ℓp-regularizer. Our contribution is the proposal of the ﬁrst algorithm equipped with a theoretical guarantee for ﬁnding the best hyperparameter of ℓp-regularized supervised learning problems. Speciﬁcally, we propose a smoothing-type algorithm for the above mentioned bilevel optimization problems and provide a theoretical convergence guarantee for the algorithm. Indeed, since optimality conditions are not known for such bilevel optimization problems so far, new necessary optimality conditions, which are called the SB-KKT conditions, are derived and it is shown that a sequence generated by

c 2021 Takayuki Okuno, Akiko Takeda, Akihiro Kawana, and Motokazu Watanabe.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/18-485.html.

OKUNO, TAKEDA, KAWANA, AND WATANABE

the proposed algorithm actually accumulates at a point satisfying the SB-KKT conditions under some mild assumptions. The proposed algorithm is simple and scalable as our numerical comparison to Bayesian optimization and grid search indicates. Keywords: Hyperparameter optimization, bilevel optimization, ℓp-regularizer, smoothing method,

1. Introduction

Hyperparameters are parameters that are set manually outside of a learning algorithm in the context of machine learning. Hyperparameters often play important roles in exhibiting a high prediction performance. For example, a regularization parameter controls a trade-off between the regularization (that is, model complexity) and the training set error (that is, empirical error). If the hyperparameters are tuned properly, the predictive performance of learning algorithms will be increased. Hyperparameter optimization or learning is the task of ﬁnding (near) optimal values of hyperparameters. There are mainly a few methods currently in use for supervised learning. The most popular one would be grid search. The method is to divide the space of possible hyperparameter values into regular intervals (a grid), train a learning model using training data for all values on the grid sequentially or preferably in parallel, and choose the best one with the highest prediction accuracy tested on validation, for example, by using cross validation. There are other techniques for hyperparameter optimization; random search that evaluates learning models for randomly sampled hyperparameter values or more sophisticated method called Bayesian optimization (Mockus et al., 1978). To ﬁnd a classiﬁer/regressor with good prediction performance, it is reasonable to minimize the validation error in terms of hyperparameters. However we do not know the explicit form of the validation error function, say f, represented with hyperparameter, while we are able to compute the validation error of a classiﬁer/regressor obtained with given hyperparameters λ, namely, f(λ). For such a black-box (meaning unknown) objective function f, Bayesian optimization algorithms use previous observational values f(λ) at some hyperparameter values λ to determine the next point λ+ to evaluate f(λ+). This is based on the assumption that the function f is described by a Gaussian process as a prior. There are still essential questions unresolved; how to select a kernel for the Gaussian process, how to select the range of values to search in, and lots of implementation details. As regards a comprehensive survey of hyperparameter optimization, we refer to the article (Feurer and Hutter, 2019). Bilevel optimization is a more direct approach for ﬁnding a best set of hyperparameter values. A bilevel optimization problem consists of two-level optimization problems; the upper-level problem minimizes the validation error in terms of hyperparameters and the lower-level problem ﬁnds a best ﬁt line for training data combined with a regularizer using given hyperparameter values. Actually, the spirit of bilevel optimization underlies the methods introduced above. As mentioned below, some existing works pointed out the usefulness of the bilevel formulation for some classes of hyperparameters. However, there is still a lot of room to pursue the bilevel approach further, in particular, for hyperparameter optimization of nonsmooth regularizers.

1.1 Our Contribution

The purpose of this paper is to provide a bilevel optimization approach for ﬁnding a best set of hyperparameter values for the nonsmooth ℓp (p 1) regularizer. The nonsmooth bilevel optimization approaches examined here are entirely novel in the ﬁeld of mathematical optimization too. In recent years, research on sparse optimization using nonconvex nonsmooth regularizers has been actively

BILEVEL ℓp-HYPERPARAMETER LEARNING

conducted in machine learning (Gong et al., 2013; Hu et al., 2017), signal/image processing (Chen et al., 2010; Hintermüller and Wu, 2013; Wen et al., 2017; Marjanovic and Solo, 2013), and continuous optimization (Ge et al., 2011; Lai and Wang, 2011; Bian and Chen, 2013; Chen et al., 2014; Bian et al., 2015; Bian and Chen, 2017). In particular, for the purpose of ﬁnding a sparse solution, the ℓp-regularizer with 0 < p < 1 is reported to be effective in wide applications such as matrix completion (Marjanovic and Solo, 2012), de-noising (Marjanovic and Solo, 2013), compressing sensing (Zheng et al., 2016; Wen et al., 2016; Weng et al., 2016), CT (computed tomography) reconstruction (Miao and Yu, 2016), machine learning (Xu et al., 2012) and so on. Refer to the survey article (Wen et al., 2018) concerning nonconvex regularizers including the ℓp-regularizer, and also see references therein. In spite of plenty of researches supporting the efﬁciency of the ℓp-regularizer, there exist fewer studies on bilevel optimization approaches to hyperparameter learning or optimization of this regularizer. A possible speciﬁc reason is that tractable optimality conditions for the arising bilevel problem have not been developed yet because of the ℓp-regularizer s nonsmoothness or high nonconvexity when p < 1. Moreover, there are no practical algorithms ensured of convergence to a meaningful point for that nonsmooth bilevel problem.

Our contribution in this paper is the proposal of an algorithm with a theoretical convergence guarantee for solving an ℓp-hyperparameter optimization problem, namely, the problem of ﬁnding the best hyperparameter of an ℓp-regularized supervised learning problem. Speciﬁcally, we ﬁrst formulate it as a bilevel optimization problem with a nonsmooth and nonconvex lower-level problem having the ℓp-regularizer. Since no optimality conditions have been explored adequately for such bilevel problems so far, we develop new optimality conditions, named scaled bilevel KKT (SB-KKT) conditions. As a matter of fact, the SB-KKT conditions can be cast as an extension of the scaled ﬁrst-order optimality conditions for some class of non-Lipschitz optimization problems originally given in the articles (Chen et al., 2010, 2013; Bian and Chen, 2017). We prove that these conditions are nothing but necessary optimality conditions for the one-level optimization problem acquired by replacing the lower-level problem with its scaled ﬁrst-order optimality conditions. We moreover propose an iterative algorithm for solving the bilevel optimization problem with a nonsmooth and nonconvex lower-level problem. One natural way for tackling such a problem would be formulating it as a one-level problem by replacing its lower-level-problem constraints with the ﬁrst-order optimality condition formed by the subdifferential (that is, the set of subgradients) of the ℓp-regularizer. However, it is still nontrivial how we solve the resulting one-level problem with point-to-set mapping constraints. To avoid this difﬁculty, we apply a smoothing technique for the ℓp-regularizer, which enables us to make use of a gradient of the smoothed regularizer. As a result, we obtain a one-level problem whose constraints are represented in terms of only smooth equations and inequalities. In the presented smoothing algorithm, we generate a sequence of KKT solutions of the smoothed problems while we control the degree of smoothing approximation. We will prove that a sequence generated by this algorithm accumulates at a point satisfying the SB-KKT conditions under some mild assumptions. Numerical experiments support the scalability of our algorithm compared to Bayesian optimization and grid search. Finally, we discuss extension of the SB-KKT conditions and the proposed algorithm to other regularizers such as SCAD and MCP.

1.2 Related Work on Bilevel Approach

Most existing bilevel optimization models assume convexity and/or smoothness for all functions or at least once differentiability for the lower-level objective functions. If it is not once differentiable,

OKUNO, TAKEDA, KAWANA, AND WATANABE

we need to overcome the difﬁculty of selecting a subgradient to guarantee descent of the upper-level gradient when solving such a problem.

Bilevel Formulations for Hyperparameter Opt. There are no existing works on bilevel hyperparameter optimization approach for our model and existing works are restricted to smooth and convex machine learning models. A pioneer work in the line was by Bennett et al. (2006, 2008). They formulated the selection technique of cross-validation for support vector regression as a bilevel optimization problem, equivalently transformed it into a one-level nonconvex optimization problem whose constraints are the Karush-Kuhn-Tucker (KKT) optimality conditions of the lower-level problem and proposed two approaches to solve the nonconvex problem. Moore et al. (2009, 2011) gave a bilevel optimization formulation for a nonsmooth and convex machine learning model, support vector regression (SVR), while their proposed algorithms assume that the lower-level objective functions are at least once differentiable. Pedregosa (2016) gave a bilevel optimization formulation for more general supervised learning problems, but the assumption of differentiability has been still imposed for all functions. More recently, Franceschi et al. (2018) gave a uniﬁed bilevel perspective on hyperparameter optimization and meta learning, and presented a gradient-based algorithm using automatic differentiation techniques, which assumes the smoothness of the lower-level objective function.

Bilevel Optimization Algorithms As far as we investigated, the convergence analysis for bilevel problems with nonconvex nonsmooth regularizers has not been studied before. Many studies on bilevel optimization in optimization community transform bilevel optimization problems into the one-level formulations via the ﬁrst-order optimality conditions for lower-level problems by assuming the differentiability of the functions, and focus on investigating theoretical properties for constraint qualiﬁcations and optimality conditions. See, for example, (Ye and Zhu, 1995; Dempe et al., 2006; Dempe and Zemkoho, 2011, 2013; Dempe et al., 2015). Recently, Ochs et al. (2016) proposed techniques for solving bilevel optimization problems with non-smooth convex lower level problems. They considered a gradient-based method for the optimization problem obtained by substituting a smoothly approximated solution mapping of the lower-level problem into the upper level problem. However, theoretical analysis concerning the limiting behavior of the derivatives of the approximated solution mappings was left to future work and the proposed method was written to be heuristic in the paper. Kunisch and Pock (2013) and Rosset (2009) considered bilevel optimization problems having the ℓp-regularizer, which are similar to our problem, but the p was mainly restricted to 1 or 2. Especially, the case of p = 0.5 only appears in the numerical experiments in Kunisch and Pock (2013) without any theoretical support, though some convergence analysis is shown for the semismooth Newton algorithms for the case of p = 1. Another stream of bilevel algorithms is based on the reformulation as one-level problem by replacing the lower-level problem with a dynamical system, which arises in an iterative algorithm such as proximal gradient-type methods for solving the lower-level problem. The approach is employed for hyperparameter optimization by, for example, Lorraine et al. (2020), Franceschi et al. (2017, 2018), Maclaurin et al. (2015), and Shaban et al. (2019). In their theoretical analysis, nonconvex and nonsmooth functions, which we will handle in this paper, are not supposed to be contained by the lower-level objective one.

Notations. In this paper, we often denote a vector z Rd by z = (z1, z2, . . . , zd) and write limℓ L zℓ= z to represent that, given a sequence {zℓ}, a subsequence {zℓ}ℓ L with L

BILEVEL ℓp-HYPERPARAMETER LEARNING

{1, 2, . . . , } converges to z . The ℓ-th vector zℓ Rd is often represented as zℓ:= (zℓ 1, zℓ 2, . . . , zℓ d) . We also denote the d-dimensional non-negative (positive) orthants by Rd +(++) := {z Rd | zi (>)0 (i = 1, 2, . . . , d)}. For a set of vectors {vi}i I Rm with I := {i1, i2, . . . , ip}, we deﬁne (vi)i I := (vi1, vi2, . . . , vip) Rm p. We denote the sign function by sgn : R { 1, 0, +1}, that is, sgn(x) := 1 (x > 0), 0 (x = 0), and 1 (x < 0) for any x R. For a differentiable function h : Rn R, we denote the gradient function from Rn to Rn

by h, that is, h(x) := ( h(x)

x1 , . . . , h(x)

xn ) Rn for x Rn, where h(x)

xi stands for the partial differential of h with respect to xi for i = 1, 2, . . . , n. To express the gradient of h with respect to a sub-vector x := (xi) i I of x with I := {i1, i2, . . . , ip} {1, 2, . . . , n}, we write

xh(x) := h(x)

xi2 , . . . , h(x)

R|I|. We often write g(x)|x= x ( xg(x)|x= x) or h( x) ( xh( x)) to represent the (partial) gradient value of g at x = x. Moreover, when h is twice differentiable, we denote the Hessian of h by 2h : Rn Rn n, that is, 2h(x) := 2h(x) xi xj

1 i,j n Rn n.

Organization of the Paper The rest of this paper is organized as follows: In Section 2, we describe our problem setting precisely. In Section 3, we propose a smoothing algorithm for solving the targeted problem. In Section 4, we present new necessary optimality conditions of the problem, already refereed to as the SB-KKT conditions. We also conduct the convergence analysis of the proposed smoothing algorithm. In Section 5, we examine the efﬁciency of the proposed algorithm by means of numerical experiments using real data sets. In Section 6, we discuss extension of the proposed algorithm to other classes of problems. Finally, in Section 7, we conclude this paper. In Appendix, we provide some proofs omitted in the main part together with other supplementary materials.

2. Formulation

We consider the following bilevel optimization problem with a nonsmooth, possibly nonconvex, lower-level problem:

min w λ,λ f(w λ) s.t. w λ argmin w Rn

i=1 λi Ri(w)

Suppose that f : Rn R is once continuously differentiable, λ := (λ1, λ2, . . . , λr) Rr, R1(w) := w p p = Pn i=1 |wi|p (0 < p 1), and the functions R2, , Rr, and g are twice continuously differentiable functions. We call the whole problem (1) and minw Rn g(w) + Pr i=1 λi Ri(w) the upperand lower-level problem, respectively. To make our notation simple, we often use the function

G(w, λ) := g(w) +

i=2 λi Ri(w),

with λ := (λ2, . . . , λr) Rr 1 for expressing the lower-level problem as

min w Rn G(w, λ) + λ1R1(w).

Note that the function R1 is nonconvex when p < 1 and nonsmooth, though some differentiability is assumed for other terms. We also remark that the proposed smoothing algorithm can be tailored to

OKUNO, TAKEDA, KAWANA, AND WATANABE

problems that contain multiple nonsmooth regularizers as long as suitable smoothing functions (see Section 3 for the deﬁnition) are found. Nonetheless, it is unclear whether the theoretical analysis can be established under such a setting.

2.1 Examples of Functions g, Pr i=1 λi Ri, and f

When using the following loss function as the function g:

g(w) = Pmtr i=1( yi x i w)2 for training samples ( yi, xi) R Rn, i = 1, , mtr

g(w) = Pmtr i=1 log(1 + exp( yi x i w)) for training samples ( yi, xi) {+1, 1} Rn, i = 1, , mtr,

the lower-level optimization problem in (1) corresponds to minimizing the ℓ2-loss function for regression and the logistic-loss function for binary classiﬁcation, respectively, combined with some regularization including w p p for a given hyperparameter vector λ. This type of problem whose regularizer includes w p p is called a sparse optimization problem. Various well-known sparse regularizers can be expressed by Pr i=1 λi Ri(w). For example,

ℓ1 regularizer: λ1 w 1,

elastic net regularizer: λ1 w 1 + λ2 w 2 2,

nonconvex regularizer: λ1 w q q with 0 < q < 1.

What we want to do is to ﬁnd the best hyperparameter values of λ which lead to small validation error. The upper-level problem can ﬁnd such values for λ. By setting the same loss function with g for f but deﬁned by validation samples (ˆyj, ˆxj), j = 1, , mval, the upper-level problem ﬁnds the best hyperparameter values which minimize the validation error, which is deﬁned by f(w) = Pmval i=1 (ˆyi ˆx i w)2 for the ℓ2-loss or f(w) = Pmval i=1 log(1 + exp( ˆyi ˆx i w)) for the logistic-loss.

3. Smoothing Method for Nonconvex Nonsmooth Bilevel Program

For problem (1), one may think of the one-level problem obtained by replacing the lower problem constraint with its ﬁrst-order optimality condition (Rockafellar and Wets, 2009, 10.1 Theorem) represented in terms of (general) subgradient1, namely,

min w,λ f(w) s.t. 0 w(G(w, λ) + λ1R1(w)), λ 0. (2)

Notice that G(w, λ) + λ1R1(w) is not convex with respect to w generally. Hence, the feasible region of (2) can be larger than that of the original problem (1) because not only the global optimal solutions of the lower-level problem but also its local optimal solutions are feasible solutions for (2). In that sense, problem (2) is modiﬁed from the original one, but solving it may lead to better prediction performance because the best hyperparameter λ is searched in the wider space and above all, there is no way to solve the bilevel optimization problem (1) as it is.

1. For precise deﬁnitions of a subgradient of a nonconvex function, see Appendix A.2 in this paper or Chapter 8 of the book (Rockafellar and Wets, 2009).

BILEVEL ℓp-HYPERPARAMETER LEARNING

3.1 Smoothing method

In our approach for tackling problem (1), we will use the smoothing method, which is one of the most powerful methodologies developed for solving nonsmooth equations, nonsmooth optimization problems, and so on. Fundamentally, the smoothing method solves smoothed optimization problems or equations sequentially to produce a sequence converging to a point that satisﬁes some optimality conditions of the original nonsmooth problem. The smoothed problems solved therein are obtained by replacing the nonsmooth functions with so-called smoothing functions. Let ϕ0 : Rn R be a nonsmooth function. Then, we say that ϕ : Rn R+ R is a smoothing function of ϕ0 when (i) ϕ( , ) is continuous and ϕ( , µ) is continuously differentiable for any µ > 0; (ii) lim w w,µ 0+ ϕ( w, µ) = ϕ0(w) for any w Rn. In particular, we call µ 0 a smoothing parameter. For more details on smoothing methods, see the comprehensive survey article (Chen, 2012) and also relevant articles (Nesterov, 2005; Beck and Teboulle, 2012).

3.2 Our approach

We propose a smoothing based method for solving (1). In the method, we replace the nonsmooth, possibly nonconvex, term R1(w) = w p p in (1) by the following smoothing function:

i=1 (w2 i + µ2) p 2 .

We then have the following bilevel problem approximating the original one (1):

min w λ,λ f(w λ) s.t. w λ argmin w Rn G(w, λ) + λ1ϕµ(w) , λ 0

which naturally leads to the following one-level problem:

min w,λ f(w)

s.t. w G(w, λ) + λ1 ϕµ(w) = 0 λ 0.

Note that problem (3) is smooth since the function ϕµ is twice continuously differentiable 2 when µ = 0. Hence, we can consider the Karush-Kuhn-Tucker (KKT) conditions for this problem. Let us explain the proposed method in detail. To this end, for a parameter ˆε > 0, we deﬁne an ˆε-approximate KKT point for problem (3). We say that (w, λ, ζ, η) Rn Rr Rn Rr is an ˆεapproximate KKT point for (3) if there exists a vector (ε1, ε2, ε3, ε4, ε5) Rn R Rr 1 Rn R such that

f(w) + 2 ww G(w, λ) + λ1 2ϕµ(w) ζ = ε1, (4)

ϕµ(w) ζ η1 = ε2, (5)

Ri(w) ζ ηi = (ε3)i (i = 2, 3, . . . , r), (6)

w G(w, λ) + λ1 ϕµ(w) = ε4, (7)

0 λ, 0 η, λ η = ε5, (8)

2. Huber s function (Beck and Teboulle, 2012) is a popular smoothing function of R1( ), but is not twice continuously differentiable.

OKUNO, TAKEDA, KAWANA, AND WATANABE

Algorithm 1 Smoothing Method for Nonsmooth Bilevel Program

Require: Choose µ0 = 0, β1, β2 (0, 1) and ˆε0 0. Set k 0.

2: Find an ˆεk-approximate KKT point (wk+1, λk+1, ζk+1, ηk+1) for problem (3) with µ = µk.

3: Update the smoothing and error parameters by µk+1 β1µk and ˆεk+1 β2ˆεk.

4: k k + 1.

5: until convergence of (wk, λk, ζk, ηk).

and (ε1, ε2, ε3, ε4, ε5) ˆε,

where 2 ww G(w, λ) is the Hessian of G with respect to w. Notice that an ˆε-approximate KKT point is nothing but a KKT point3 for problem (3) if ˆε = 0. Hence, ζ Rn and η Rr

are regarded as approximate Lagrange multiplier vectors corresponding to the equality constraint w G(w, λ) + λ1 ϕµ(w) = 0 and the inequality constraints λ 0, respectively. The proposed algorithm produces a sequence of ˆε-approximate KKT points for problem (3) while decreasing the values of ˆε and µ to 0. Precisely, it is described as in Algorithm 1. Though, in Algorithm 1, we do not designate any means for computing an ˆε-approximate KKT point for (3), sequential quadratic programming (SQP) methods (Nocedal and Wright, 2006) are promising candidates. However, since such SQP methods are designed for solving general constrained problems, we may develop more efﬁcient algorithms by exploiting structure of individual problems. This issue will be discussed later in Section 5. See also Appendix B. As for practical stopping criteria of Algorithm 1, we make use of the scaled bilevel SB-KKT conditions studied in the subsequent section.

4. Theoretical Results

In this section, we will prove the global convergence of Algorithm 1 by investigating an accumulation point of a sequence generated by that algorithm. For this purpose, in Section 4.1, we ﬁrst present new optimality conditions for the original bilevel problem (1), named scaled bilevel KKT (SB-KKT) conditions. Moreover, in Section 4.2, we prove that any accumulation point of a sequence generated by Algorithm 1 actually satisﬁes the SB-KKT conditions under some assumptions. Throughout the section, we often use the following notations for w Rn:

I(w) := {i {1, 2, . . . , n} | wi = 0}, |w|p := (|w1|p, |w2|p, . . . , |wn|p) .

4.1 SB-KKT Conditions

Now, we deﬁne the SB-KKT conditions for problem (1):

Deﬁnition 1 We say that the scaled bilevel Karush-Kuhn-Tucker (SB-KKT) conditions hold at (w , λ ) Rn Rr for problem (1) when there exists a pair of vectors (ζ , η ) Rn Rr such

3. Note that (5) and (6) with (ε2, (ε3)2, . . . , (ε3)r) = 0 can be obtained from

λf(w) + λ w G(w, λ) + λ1 ϕµ(w) ζ η = 0.

BILEVEL ℓp-HYPERPARAMETER LEARNING

W 2 f(w ) + H(w , λ )ζ = 0, (9)

W w G(w , λ ) + pλ 1|w |p = 0, (10)

p P i/ I(w ) sgn(w i )|w i |p 1ζ i = η 1, (11)

ζ i = 0 (i I(w )), (12)

Ri(w ) ζ = η i (i = 2, 3, . . . , r), (13)

0 λ , 0 η , (λ ) η = 0, (14)

where W := diag(w ). Here, we write

H(w, λ) := W 2 2 w G(w, λ) + λ1p(p 1)diag(|w|p)

with W := diag(w) for w Rn and λ Rr. In particular, we call a point (w , λ ) Rn Rr

satisfying the above conditions (9) (14) an SB-KKT point for problem (1).

In fact, (0, λ ) is a trivial SB-KKT point for any λ . This can be checked by setting (ζ , η ) = (0, 0) in the conditions (9) (14). Experimentally, started from a point apart from such a trivial point, Algorithm 1 ﬁnds non-trivial SB-KKT points in many cases. We next prove that the SB-KKT conditions are necessary optimality conditions for a certain one-level problem. For this purpose, we derive the scaled ﬁrst-order necessary condition (Chen et al., 2010) for the lower-level problem in (1):

min w Rn G(w, λ) + λ1 w p p. (15)

We say that the scaled ﬁrst-order necessary condition of (15) holds at w if

W w G(w , λ) + pλ1|w |p = 0. (16)

Indeed, a local optimum w of (15) satisﬁes the above condition. This fact can be veriﬁed easily by following the proof of Theorem 2.1 in the article (Chen et al., 2010). The above scaled condition was originally presented in the articles (Chen et al., 2010, 2013; Bian and Chen, 2017) for some optimization problems admitting non-Lipschitz functions. As in deriving (2), we obtain the following one-level problem by replacing the lower problem in (1) with the scaled ﬁrst-order necessary condition (16):

min w,λ f(w) s.t. W w G(w, λ) + pλ1|w|p = 0, λ 0. (17)

As well as (2), the feasible region of (17) includes not only the global optimal solution of the lower-level problem in the original problem (1) but also its local solutions. Notice that the above problem is still nonsmooth due to the existence of |w|p. The following theorem states that the SB-KKT conditions are necessary optimality conditions for (17). Here, we just give an outline of the proof and defer its detail to Appendix A.1.

Theorem 2 Let (w , λ ) Rn Rr be a local optimum of (17). Then, (w , λ ) together with some vectors ζ Rn and η Rr satisﬁes the SB-KKT conditions (9) (14) under an appropriate constraint qualiﬁcation concerning the constraints G(w, λ)

wi + p sgn(wi)λ1|wi|p 1 = 0 (i / I(w )), wi = 0 (i I(w )), and λ 0.

OKUNO, TAKEDA, KAWANA, AND WATANABE

Sketch of the proof: Notice that (w , λ ) is a local optimum of the following problem:

min w,λ f(w)

s.t. G(w, λ)

wi + p sgn(wi)λ1|wi|p 1 = 0 (i / I(w ))

wi = 0 (i I(w )) λ 0.

This is because (w , λ ) is also feasible to (18) and the feasible region of (17) is larger than that of (18). Hence, the KKT conditions hold at (w , λ ) for (18) in the presence of a constraint qualiﬁcation. Finally, these KKT conditions can be equivalently transformed into the desired SB-KKT conditions. In the next section, we will study convergence analysis of Algorithm 1 to an SB-KKT point. Before proceeding to the convergence analysis, let us see the relationship between the two one-level problems (2) and (17). The following lemma concerns the feasible regions of (2) and (17).

Lemma 3 For w Rn and λ Rr +, if 0 w(G(w, λ) + λ1R1(w)), then W w G(w, λ) + pλ1|w|p = 0. In particular, when p < 1, the converse is also true.

Proof See Appendix A.2.

In view of the above lemma, we ﬁnd that the feasible region of (17) is larger than that of (2) in general. However, for the case of p < 1, we also see that these two regions are identical. These relationships are summarized as in the following diagram.

Feasible region of (1) Feasible region of (2) p=1 =p<1

Feasible region of (17)

From this observation and Theorem 2, we can derive the following theorem immediately:

Theorem 4 Let p < 1 and (w , λ ) Rn Rr be a local optimum of (2). Then, (w , λ ) together with some vectors ζ Rn and η Rr satisﬁes the SB-KKT conditions (9) (14) under an appropriate constraint qualiﬁcation concerning the constraints G(w, λ)

wi + p sgn(wi)λ1|wi|p 1 = 0 (i / I(w )), wi = 0 (i I(w )), and λ 0.

4.2 Convergence of Algorithm 1 to an SB-KKT Point

Hereafter, for convenience of explanation, we suppose that an ˆεk 1-approximate KKT point (wk, λk, ζk, ηk) is a solution satisfying conditions (4)-(8) with

(ε1, ε2, ε3, ε4, ε5) = (εk 1 1 , εk 1 2 , εk 1 3 , εk 1 4 , εk 1 5 ),

where {(εk 1, εk 2, εk 3, εk 4, εk 5)} is a sequence that converges to zero as k . Moreover, we suppose that the algorithm is well-deﬁned in the sense that an ˆεk-approximate KKT point of (3) is found in Step 2 at every iteration, and it generates an inﬁnite number of iteration points. In addition, we make the following assumptions:

Assumption A: Let {(wk, λk, ζk, ηk)} Rn Rr Rn Rr be a sequence produced by the proposed algorithm. Then, the following properties hold:

BILEVEL ℓp-HYPERPARAMETER LEARNING

A1: lim inf k λk 1 > 0.

A2: The sequence {(wk, λk, ζk, ηk)} is bounded.

A3: Let p = 1 and (w , λ ) be an arbitrary accumulation point of the sequence {(wk, λk)}. It then holds that λ 1 = G(w , λ )

for any i I(w ).

Assumption A1 means that the ℓp-regularization term, that is, the function R1 works effectively. We will discuss Assumption A2 at the end of this section. Speciﬁcally, we will prove that under certain conditions, the Lagrange multiplier part {ζk, ηk)} is actually bounded. Assumption A3 is a technical assumption for the case of p = 1. It indicates that, for all i I(w ), zero is not situated on the boundary of the subdifferential of G(w, λ) + λ w 1 w.r.t. wi. Interestingly, for the case of p < 1, we can establish the convergence of Algorithm 1 in the absence of A3. Under the presence of these assumptions, our goal is to prove the following convergence theorem, which motivates us to make a stopping criterion of the algorithm based on the SB-KKT conditions in the numerical experiments we will conduct later.

Theorem 5 Suppose that Assumptions A1 A3 hold. Then, any accumulation point of {(wk, λk, ζk, ηk)} generated by Algorithm 1 satisﬁes the SB-KKT conditions (9) (14) for problem (1).

We will prove this theorem by showing that passing the approximate KKT conditions (4)-(8) to the limit yields the SB-KKT conditions. For this purpose, in particular, we have to examine how ϕµk(wk) and 2ϕµk(wk) behave in the limit. We remark that each component of ϕµ(w) and each diagonal one of 2ϕµ(w) are expressed as

( ϕµ(w))i = pwi(w2 i + µ2) p 2 1, (19)

( 2ϕµ(w))ii = p(w2 i + µ2) p 2 1 + p(p 2)w2 i (w2 i + µ2) p 2 2 (20)

for i = 1, 2, . . . , n, µ > 0, and w Rn. Note that all the off-diagonal components of 2ϕµ(w) are zeros. We present the following proposition, whose proof is given in Appendix A.3.

Proposition 6 Let w be the point deﬁned in A3. Then, we have

lim k Wk ϕµk 1(wk) = p|w |p, (21)

lim k W 2 k 2ϕµk 1(wk) = p(p 1)diag(|w |p), (22)

where Wk := diag(wk) for each k.

Next, we prove that, for i I(w ), the i-th diagonal component of ( 2ϕµk 1(wk))ii diverges. The key of its proof is the approach speed of wi (i I(w )) towards zeros compared with that of the smoothing parameter µk 1. Actually, according to the next lemma, µk 1 gradually approaches 0 with the speed not faster than maxi I(w ) |wk i | 1 2 p . The proof will be given in Appendix A.4. Remarkably, when p < 1, it holds true in the absence of Assumption A3.

Lemma 7 Suppose that Assumptions A1 A3 hold. Let (w , λ ) be an arbitrary accumulation point of {(wk, λk)} and {(wk, λk)}k K( {(wk, λk)}) be an arbitrary subsequence converging to (w , λ ). Then, there exists some γ > 0 such that

µ2 k 1 γ|wk i | 2 2 p (i I(w ))

for all k K sufﬁciently large.

OKUNO, TAKEDA, KAWANA, AND WATANABE

From the above lemma, we can derive the following proposition.

Proposition 8 Suppose that Assumptions A1 A3 hold. Let w be an arbitrary accumulation point of the sequence {wk} and {wk}k K( {wk}) be an arbitrary subsequence converging to w . Then, for any i I(w ), lim k K

( 2ϕµk 1(wk))ii = .

Proof Choose i I(w ) arbitrarily. Note that

lim k K |wk i | = w i = 0. (23)

By Lemma 7, there is some γ > 0 such that

µ2 k 1 γ|wk i | 2 2 p (24)

for all k K sufﬁciently large. In view of this fact, we have, for all k K large enough,

µ2 k 1 + (p 1)(wk i )2 γ|wk i | 2 2 p + (p 1)|wk i |2

= |wk i | 2 2 p (γ + (p 1)|wk i |2 2 2 p )

where the second inequality can be veriﬁed by noting that γ + (p 1)|wk i |2 2 2 p > 0 holds for all

k K sufﬁciently large because γ > 0 and (p 1) limk K |wk i |2 2 2 p = (p 1)|w i |2 2 2 p = 0. Furthermore, notice that |wk i | 2 2 p |wk i |2

holds for all k K sufﬁciently large because 1 < 2 2 p 2 and |wk i | < 1 for all k K large enough by (23). Relation (24) then implies µ2 k 1 γ|wk i |2. (26)

From expression (20), it follows that ( 2ϕµk 1(wk))ii = p ((wk i )2 + µ2 k 1) p 2 2 (wk i )2 + µ2 k 1 + (p 2)(wk i )2

= p ((wk i )2 + µ2 k 1) p 2 2(µ2 k 1 + (p 1)(wk i )2)

= p((wk i )2 + µ2 k 1) p 2 2 µ2 k 1 + (p 1)(wk i )2

2 2 µp 4 k 1 µ2 k 1 + (p 1)(wk i )2

2 2 µp 2 k 1

1 + (p 1)(wk i )2

where the third equality follows from (25) and the ﬁrst inequality comes from (26) and p

Moreover, by (24), we see µ2(2 p) k 1 /γ2 p (wk i )2 and thus have

µ2 2p k 1 γ2 p (wk i )2

BILEVEL ℓp-HYPERPARAMETER LEARNING

By this inequality, it holds that

( = 0 (p < 1) 1

Thus, we obtain

lim k K (p 1)

which together with (27) and limk µp 2 k 1 = implies

( 2ϕµk 1(wk))ii = .

Since i I(w ) was arbitrarily chosen, the proof is complete.

Now, we are ready to prove Theorem 5 using Propositions 6 and 8.

Proof of Theorem 5

Consider the ˆεk 1-approximate KKT conditions (4), (6), and (7) with (w, λ, ζ, η) = (wk, λk, ζk, ηk) and (ε1, ε3, ε5) = (εk 1 1 , εk 1 3 , εk 1 5 ). Let (w , λ , ζ , η ) be an arbitrary accumulation point of {(wk, λk, ζk, ηk)}. By taking a subsequence if necessary, without loss of generality, we can suppose that lim k (wk, λk, ζk, ηk) = (w , λ , ζ , η ). (28)

To show the desired result, it sufﬁces to prove that (w , λ , ζ , η ) satisﬁes (9) (14). Now, we give the proof by three blocks as follows:

Proof of conditions (9), (10), (13) and (14): As for (9) and (10), multiplying (4) and (7) with (w, λ, ζ, η) = (wk, λk, ζk, ηk) by W 2 k and Wk on the left, respectively, we obtain

W 2 k f(wk) + W 2 k 2 ww G(wk, λk) + λk 1 2ϕµk 1(wk) ζk = W 2 k εk 1 1 ,

Wk w G(wk, λk) + λk 1Wk ϕµk 1(wk) = Wkεk 1 4 .

Note that the functions f, 2 ww G, and w G are continuous and let k in the above equations. Then, using (21) and (22) in Proposition 6 together with µk 1 0, (εk 1 1 , εk 1 4 ) (0, 0) as k , we get (9) and (10), that is to say,

W 2 f(w ) + H(w , λ )ζ = 0,

W w G(w , λ ) + pλ 1|w |p = 0.

Conditions (13) and (14) are obtained by driving k to in (6) and (8) with (w, λ, ζ, η) = (wk, λk, ζk, ηk).

Proof of condition (12): Choose i I(w ) arbitrarily. Note that the continuity of the functions 2 ww G and f. Then, from (28) and condition (4) with (w, λ, ζ) = (wk, λk, ζk), {λk 1 2ϕµk 1(wk)

ii ζk i } is bounded. On the other hand, recall that {( 2ϕµk 1(wk))ii} is unbounded from Proposition 8 and

OKUNO, TAKEDA, KAWANA, AND WATANABE

limk λk 1 = λ 1 > 0 from Assumption A1. Thus, we get limk ζk i = 0. Since the index i was chosen from I(w ) arbitrarily, we conclude condition (12).

Proof of condition (11): We begin with proving

i I(w ) wk i ((wk i )2 + µ2 k 1) p 2 1ζk i = 0. (29)

Choose i I(w ) arbitrarily again. Note that by Lemma 7, there exists some γ > 0 such that

µ2 k 1 γ|wk i | 2 2 p (30)

for all k sufﬁciently large. In what follows, we consider sufﬁciently large k so that the inequality (30) holds. Then, by 0 < p 1, we get

µ2 p k 1 γ 2 p

We then have wk i ((wk i )2 + µ2 k 1) p 2 1ζk i wk i µ 2( p

2 1) k 1 ζk i

µ2 p k 1 γ 2 p

2 1) k 1 ζk i

= γ p 2 1 ζk i . (31)

Relation (31) and condition (12), which was proved above, imply

wk i ((wk i )2 + µ2 k 1) p 2 1ζk i = 0

and hence summing up this equation over I(w ) gives the desired expression (29). Next, by using (28), µk 1 0 (k ), w i = 0 (i / I(w )), and w i = sgn(w i )|w i |, we obtain

i/ I(w ) wk i ((wk i )2 + µ2 k 1) p 2 1ζk i

i/ I(w ) sgn(w i )|w i |p 1ζ i . (32)

Combining (29) and (32) with (19) yields

lim k ϕµk 1(wk) ζk

i=1 wk i ((wk i )2 + µ2 k 1) p 2 1ζk i

i I(w ) wk i ((wk i )2 + µ2 k 1) p 2 1ζk i + X

i/ I(w ) wk i ((wk i )2 + µ2 k 1) p 2 1ζk i

i/ I(w ) sgn(w i )|w i |p 1ζ i ,

BILEVEL ℓp-HYPERPARAMETER LEARNING

which together with driving k to in condition (5) with (w, λ, ζ, η) = (wk, λk, ζk, ηk) implies

i/ I(w ) sgn(w i )|w i |p 1ζ i = η 1,

where we use η 1 = limk ηk 1. This is nothing but condition (11). Consequently, the proof of Theorem 5 is complete.

Boundedness of the Lagrange-multiplier Sequence

In Assumption A2, we suppose that the Lagrange multiplier sequence {(ζk, ηk)} is bounded. One may ask when this condition holds. In many optimization algorithms, boundedness properties of relevant Lagrange multiplier sequences are shown to be true under suitable constraint qualiﬁcations. In fact, we can verify the boundedness of {(ζk, ηk)} under the presence of linearly constraint-like qualiﬁcations as follows:

A4: Let (w , λ ) Rn Rr be an arbitrary accumulation point of the sequence {(wk, λk)}. Let

I(λ ) := {i {1, 2, . . . , r} | λ i = 0}.

Then, the linearly independent constraint qualiﬁcation (LICQ) holds at (w, λ) = (w , λ ) for the constraints Φi(w, λ) := G(w, λ)

wi +p sgn(wi)λ1|wi|p 1 = 0 (i / I(w )), wi = 0 (i I(w )), and λ 0, that is to say, the gradient vectors for the active constraints n { Φi(w , λ )}i/ I(w ) , (w,λ)wi|w=w

i I(w ) , (w,λ)λi|λ=λ

are linearly independent.

Proposition 9 Suppose that Assumptions A1, A3, and A4 hold. Additionally, suppose that the sequence {(wk, λk)} is bounded. Let {(ζk, ηk)} Rn Rr be a sequence of the accompanying Lagrange multiplier vectors which satisfy the KKT conditions (4) (8). Then, {(ζk, ηk)} is bounded.

Proof We derive contradiction by supposing that {(ζk, ηk)} is unbounded. For details, see Appendix A.5.

5. Numerical Experiments

In this section, we investigate the performance of Algorithm 1 through comparison to other hyperparameter learning methods such as Bayesian optimization (Mockus et al., 1978) and gridsearch. All the experiments are conducted on a personal computer with Intel Core i7-8559U CPU @ 2.70GHz, 16.00 GB memory. Algorithm 1 and the other competitor algorithms are implemented with MATLAB R2020a. Two kinds of bilevel problems relevant to linear regression with real data are solved. The ﬁrst problem handles a single hyperparameter related to the ℓp-regularizer, while the second one does multiple hyperparameters.

OKUNO, TAKEDA, KAWANA, AND WATANABE

5.1 Linear regression bilevel problem with a single ℓp hyperparameter

In this section, we solve the following bilevel problem regarding squared linear regression problem with a single ℓp hyperparameter.

min w,λ Avalw bval 2 2

s.t. w argmin ˆ w

Atr ˆw btr 2 2 + exp(λ1) ˆw p p , (33)

where p (0, 1] and (Atxt, btxt) Rmtxt n Rmtxt for txt {val, tr}. Notice that the form of the above problem slightly differs from that of (1) in that λi is replaced with exp(λi) for each i. This is because positive hyperparameters, in particular λ1, are actually desirable as outputs. With this manipulation, the nonnegative constraint exp(λi) 0, which corresponds to λi 0 in (1), is clearly fulﬁlled and thus removed. For the sake of examining the accuracy of solutions obtained by solving the above problem, we use Atew bte 2 2 as a test error function, where Ate Rmte n and bte Rmte. The data matrices and vectors A{val,tr,te}, b{val,tr,te} are taken from UCI machine learning repository (Lichman et al., 2013): Facebook Comment Volume ( m = 40949, n = 53), Insurance Company Benchmark ( m = 9000, n = 85), Student Performance for a math exam ( m = 395, n = 272)4, Body Fat ( m = 336, n = 14), and Cpu Small ( m = 8192, n = 12) are from UCI machine learning repository Lichman et al. (2013). The m samples are divided into 3 groups (training, validation and test samples) with the same sample size m/3 . Hence, m{val,tr,te} = m/3 .

5.1.1 EXPERIMENTAL CONDITIONS

Method for solving the smoothed subproblem (3): Algorithm 1 requires an ˆε-approximate KKT point of (3) in Step 2 at every iteration. To compute such a point, we present an algorithm using implicit functions. Several past works also employed similar approaches based on implicit functions for hyperparameter optimization. For example, see the articles (Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al., 2018). We only explain the algorithmic framework of the proposed implicit function based method, leaving the precise description to Algorithm B.1 in Appendix B.1. At every iteration, Algorithm B.1 locally represents problem (3) as problem having the hyperparameter λ as variables by means of implicit functions deﬁned over the λ-space. The implicit function, say w( ), is deﬁned on some open set U and expresses a solution set for the smoothed lower-level problem minw Rn G(w, λ) + λ1ϕµ(w). Namely, we have w G(w(λ), λ) + λ1 ϕµ(w(λ)) = 0 for all λ U. We then solve the reformulated problem (3) that is described in terms of λ by the quasi-Newton method (Nocedal and Wright, 2006). Note that, though it is difﬁcult in general to know the concrete form of the implicit function w( ), we can compute the gradient w in virtue of the implicit function theorem, which enables us to perform gradient based methods like the quasi-Newton method for solving problems that are described in terms of w( ). Actually, the efﬁciency of this approach relies on how rapidly and accurately w(λ) is computed for a given λ. To this end, we employ a certain modiﬁed Newton method, which was originally proposed by Lai and Wang (2011). See Appendix B.2 for details.

4. The original dataset has n = 32, but the feature size is increased by adding new features: interaction effects generated by pairwise products among some features for each sample.

BILEVEL ℓp-HYPERPARAMETER LEARNING

Moreover, as a starting point of the algorithm, we use a solution of the smoothed problem (3) at the previous iteration of Algorithm 1, aiming for the so-called hot-start effect.

Other algorithms for comparison: For the sake of comparison, we also implement Bayesian optimization and the gridsearch method. We use bayesopt in MATLAB with Max Objective Evaluations=30 for Bayesian optimization. In gridsearch, we search for the best value of Avalw bval 2 2 among 30 grids λ = 10 4, 10 4+ 8

29 , , 104 8

29 , 104 for problem (33). At each iteration of bayesopt and gridsearch, we make use of Matlab built-in solver fmincon so as to solve the lower-level problem of (33) with a given λ.

Parameter setting and termination criteria: The smoothing parameter in Algorithm 1 is initialized as µ0 = 1 and updated by µk+1 = min(0.9µk, 10µ1.3 k ). The smoothed subproblem (3) is solved as exactly as possible by ﬁxing (ˆε0, β0) to (10 6, 1). As for the termination criteria of Algorithm 1, writing a resulting solution as w , we stop it if the SB-KKT conditions (9), (10) and (11) are within the error of ϵ := 10 3. We also check whether the other SB-KKT conditions (12)-(14) are satisﬁed. The default setting of bayesopt is employed. Time limits of all the algorithms are set to 600 seconds.

5.1.2 NUMERICAL RESULTS FOR PROBLEM (33) WITH FIXED DATA SIZE

We ﬁrst show the results of applying the three algorithms to problem (33) with p = 1, 0.8, 0.5. The algorithms are run for 5 times from different starting points (λ0, w0) generated in the manner that λ0 is set to 0 and w0 is chosen randomly from [ 5, 5]n. All the results are summarized in Table 1, where the value Errte indicates the averaged values of Atew bte 2 over 5 runs. The value Errval stands for the averaged value of Avalw bval 2. The value sparsity means the ratio of zero elements in the obtained solution w Rn, namely, sparsity = |{i | wi = 0}| /n and hence, the solution with sparsity 1 is very sparse. We denote by time(sec) the spent time from the start to the termination. In the experiments, for each i, we regarded wi as zero if |wi| 10 4 max1 i n |wi|. The best (smallest) values of Err{te,val} and time(sec), among the three algorithms are displayed in bold. From the table, there are signiﬁcant differences in time(sec) of the three algorithms, while Errte and Errval seem comparative. In particular, Algorithm 1 tends to be the fastest. Indeed, it attains the best values in time(sec) for 10 out of 15 problem-instances, seven of which moreover achieve the best values in Errval. For example, for Facebook with p = 1, it computes a solution with Errval = 6.474 by about 17 seconds, while bayesopt and gridsearch do solutions with Errval 6.476 after spending more than 40 seconds. Thus, Algorithm 1 is likely to be the most effective among the three on seeking (w, λ) with good Errval. As pointed out by a reviewer, bayesopt actually found the ﬁnal solutions or close solutions earlier than the recorded time on the table. Nonetheless, in many instances, Algorithm 1 reached the ﬁnal solution more quickly than bayesopt found such a solution. We refer readers to Table C.1 in Appendix C, which shows the ﬁrst time of bayesopt for ﬁnding a solution which attains the ﬁnal best observed objective value, namely, validation value. Also see Figure C.1 that depicts the time-series of the best observed objective value of bayesopt for the problem organized with the data sets of Student and Cpu Small. From the values of sparsity, the problems with smaller p tends to output sparser solutions. For example, Algorithm 1 outputs a solution with sparsity 0.9 for Facebook with p = 0.5, while sparsity 0.3 for p = 0.8, 1. Nevertheless, sparsity of Algorithm 1 is 0.00 for Body Fat with

OKUNO, TAKEDA, KAWANA, AND WATANABE

p = 0.8, while sparsity = 0.7, 0.07 for p = 0.5, 1, respectively. In this case, Algorithm 1 might fall into a local optimum with small Eval at the expense of sparsity.

5.1.3 PERFORMANCE WITH VARIED DATA SIZE

Changing the data sizes of Student and Facebook datasets, we make comparison of the performances of Algorithm 1, bayesopt, and gridsearch. We ﬁrst examine how Errte, Errval, and time(sec) of the three algorithms change against the sample size, ˆm, of Facebook. The sample size ˆm is increased from 1

2 m to m by 1 10 m with m 40000. We apply the algorithms using ℓ0.8 regularizer to these problems with a varied sample size. The obtained results are depicted in Figure 1. Figures 1a and 1b indicates that the computed test and validation values Errte and Errval behave analogously as ˆm increases. There are no crucial differences among those values. However, from Figure 1c, computation time, time(sec), of bayesopt grows more rapidly than the others. This may be because bayesopt has to search a wider region as the sample size grows. In contrast, the values of time(sec) of Algorithm 1 and gridsearch grow moderately. In particular, Algorithm 1 is the fastest for most cases.

(a) Errte vs scaled sample size ˆm/ m (b) Errval vs scaled sample size ˆm/ m

(c) time(sec) vs scaled sample size ˆm/ m

Figure 1: Performance of Algorithm 1, bayesopt, and gridsearch using ℓ0.8 regularizer for Facebook with ﬁxed feature size n = 53 and varying sample size ˆm ( m = 40949); Alg.1: Algorithm 1, bayes: bayesopt, grid: gridsearch

Next, we investigate the performances of the algorithms by varying the feature size, ˆn, of Student. As in the above experiment, we use ℓ0.8 regularizer. The feature size ˆn is increased from 1

BILEVEL ℓp-HYPERPARAMETER LEARNING

n by 1 10n with n = 272. The obtained results are shown in Figure 2. Algorithm 1 successfully attains better values in all time(sec), Errte, and Errval than bayesopt and gridsearch for n 0.7ˆn. From Figures 2a and 2b, bayesopt seems stuck in a local optimum with larger Errval and Errte for n 0.8ˆn. From Figure 2c, time(sec) for bayesopt and gridsearch grow more rapidly than ours as ˆn increases. The above two experiments suggest that, compared with gridsearch and Bayesian optimization, Algorithm 1 is unlikely to be affected by growth of the data size.

(a) Errte vs scaled feature size ˆn/n (b) Errval vs scaled feature size ˆn/n

(c) time(sec) vs scaled feature size ˆn/n

Figure 2: Performance of Algorithm 1, bayesopt, and gridsearch using ℓ0.8 regularizer for Student with varying feature size ˆn (n = 272) and ﬁxed sample size m = 395; Alg.1: Algorithm 1, bayes: bayesopt, grid: gridsearch

OKUNO, TAKEDA, KAWANA, AND WATANABE

Table 1: Comparison of Algorithm 1, bayesopt in MATLAB and gridsearch in terms of squared errors (validation error Errval and test error Errte), CPU times (time(sec)), and sparsities (sparsity). Here, sparsity := |{i | wi = 0}| /n and hence a solution is sparser as sparsity is closer to 1. The best values in Errval, Errte, and time(sec) are displayed in bold.

Data Algorithm 1 bayesopt in MATLAB grid search

name p Errte Errval time (sec) sparsity Errte Errval time (sec) sparsity Errte Errval time (sec) sparsity

Facebook 1 5.399 6.474 17.399 0.057 5.401 6.476 146.144 0.249 5.394 6.483 43.738 0.264 0.8 5.431 6.512 22.242 0.245 5.411 6.499 137.764 0.143 5.439 6.545 37.021 0.113 0.5 5.455 6.550 16.820 0.925 5.452 6.535 93.967 0.596 5.704 6.747 16.514 0.925

Insurance 1 87.837 95.764 33.077 0.035 87.842 95.764 55.304 0.435 87.843 95.765 19.779 0.435 0.8 87.891 95.676 32.465 0.188 87.882 95.634 75.384 0.287 87.872 95.806 19.388 0.329 0.5 88.625 95.562 44.904 0.859 88.101 95.526 45.359 0.675 88.211 96.592 5.453 0.871

Student 1 1.127 0.778 10.586 0.625 1.147 0.785 217.415 0.032 1.147 0.785 81.766 0.066 0.8 1.083 0.724 2.348 0.996 1.082 0.724 409.382 0.004 1.100 0.731 241.298 0.004 0.5 1.082 0.724 3.618 0.996 1.125 0.772 152.826 0.829 2.848 1.120 9.520 0.974

Body Fat 1 0.277 0.209 0.068 0.071 0.276 0.210 10.253 0.714 0.279 0.216 0.820 0.714 0.8 0.288 0.179 0.203 0.000 0.280 0.184 10.567 0.571 0.287 0.182 0.808 0.429 0.5 0.582 0.267 0.395 0.714 0.353 0.229 7.279 0.286 0.316 0.256 0.931 0.286

Cpu Small 1 132326 130981 11.299 0.083 131834 131123 21.334 0.917 132261 130983 1.164 0.917 0.8 132339 130982 0.741 0.250 131770 131187 19.909 0.917 132205 130991 1.240 0.750 0.5 132093 131058 0.672 0.250 131754 131234 17.619 0.917 132127 131059 1.514 0.750

BILEVEL ℓp-HYPERPARAMETER LEARNING

5.1.4 PERFORMANCE AS THE SMOOTHING PARAMETER µ DECREASES

We examine impact of the smoothing parameter µ on test error of solutions of the smoothed subproblems (3). Figures 3a-3c depict the growth behavior of the test error in the ﬁnal stage of Algorithm 1 for the problems of Facebook, Body Fat, and Insurance. From the ﬁgures, the test errors for the three problems do not vary signiﬁcantly. Taking into account that the smoothed subproblem (3) is more difﬁcult as µ becomes smaller, it may be good strategy to stop the algorithm earlier than convergence to a SB-KKT point. This is also indicated by the fact that, around µ = 0.01, the algorithm attains solutions with better test errors than the solutions upon termination for Facebook and Body Fat.

(a) Facebook

(b) Body Fat

(c) Insurance

Figure 3: Change of test error for ℓ0.8 regularizer as µ decreases; The horizontal axis: a smoothing parameter µ; The vertical axes: test error

5.2 Linear regression problem with multiple hyperparameters

Next, we solve the following problem that possesses multiple hyperparameters:

min w,λ Avalw bval 2 2

s.t. w argmin ˆ w

Atr ˆw btr 2 2 + exp(λ1) ˆw p p + ˆw C( λ) ˆw , (34)

where C( λ) := Diag(exp(λi))n+1 i=2 being positive deﬁnite and A{val,tr,te} and b{val,tr,te} are the ones used in the previous experiments in Section 5.1.

OKUNO, TAKEDA, KAWANA, AND WATANABE

5.2.1 EXPERIMENTAL CONDITIONS

We set p = 0.5 in problem (34). The experimental conditions are almost the same as those for problem (33). The main differences are as follows: We make comparison of Bayesian Optimization bayesopt with Max Objective Evaluations=300 and Algorithms 1 using the two algorithms for solving subproblems (3): The ﬁrst one is the implicit function approach as in the previous experiments and the second one is fmincon, where we opt for the SQP method and set Max Iterations= 107 . Though we also implemented gridsearch seeking a solution over 30n grids λ {10 4, 10 4+ 8

29 , , 104 8

29 , 104}n, the obtained results were actually quite poor because the number of grids, which was larger than 3010, was too huge to search. We thus omit those results with gridsearch. Finally, according to the observation in the last experiment in Subsection 5.1.4, we terminate Algorithm 1 when µk 0.01.

5.2.2 NUMERICAL RESULTS FOR PROBLEM (34)

Table 2 summarizes the obtained results of applying the two types of Algorithms 1 and bayesopt to problem (34). In the table, we denote by λ the number of hyperparameters λ in each problem. Moreover, Alg.1-A stands for Algorithm 1 using the implicit function approach and Alg.1-B does the one using fmincon. The hyphens in the line of Facebook for Alg.1-B indicate that fmincon, which is used in Alg.1-B, terminates with an infeasible solution of the smoothed problem (3). From the table, compared with the results for the single-hyperparameter bilevel problem (33), bayesopt does not work well. For four out of the ﬁve problems, it cannot terminate within the time-limit 600 seconds. The qualities of the output solutions of bayesopt upon termination are also not good in values of Errval and Errte. For the sake of completeness, as well as the previous experiment, we examined the ﬁrst time when the best observed objective values, that is, validation values of bayesopt were found. Refer to Table C.2 in Appendix C. Also see Figure C.2 for graphs depicting the time-series of the best observed objective values concerning the data sets of Student and Cpu Small. In contrast to bayesopt, the two Algorithms 1 with different subroutines show better performance. There are notable differences between performances of Alg.1-A and Alg.1-B. While Alg.1-A seems to fall into non-sparse solutions, but with good Err{te,val} for Insurance, Body Fat, and Cpu Small, Alg.1-B ﬁnds good solutions balancing in sparsity, Errte, and Errval for all the same data sets. This may be because Alg.1-A computes a solution of problem (3) by remaining in the feasible set, namely, the solution set of the smoothed lower level problem. This behavior may cause Alg.1-A to miss a chance of broadly seeking sparse solutions. Meanwhile, Alg.1-B using the SQP method in fmincon tends to approach a solution of (3) from the outside of the feasible set, which often leads to a good solution even in sparsity.

BILEVEL ℓp-HYPERPARAMETER LEARNING

Table 2: Comparison of bayesopt and Algorithm s 1 using the implicit function approach and fmincon as subroutines for solving subproblem (3) in Step 2 in terms of squared errors (validation error Errval and test error Errte), CPU times (time(sec)), and sparsities (sparsity). Here, sparsity := |{i | wi = 0}| /n and hence a solution is sparser as sparsity is closer to 1. Alg.1-A stands for Algorithm 1 using the implicit function approach and Alg.1-B does the one using fmincon. The notation λ stands for the number of hyperparameters in each problem. Moreover, the best values in Errval, Errte, and time(sec) are displayed in bold. The hyphens for Facebook indicate that fmincon, which is used in Alg.1-B, terminates with an infeasible solution of the smoothed problem (3). The algorithms with 600 seconds stopped at the time-limit.

Data Alg.1-A Alg.1-B bayesopt in MATLAB

name λ Errte Errval time (sec) sparsity Errte Errval time (sec) sparsity Errte Errval time (sec) sparsity

Facebook 54 5.404 6.478 20.247 0.038 7.629 8.780 600.000 0.000

Insurance 86 87.920 95.694 50.714 0.000 88.340 94.604 4.473 0.988 98.000 107.000 600.000 0.000

Student 273 1.132 0.771 1.451 0.368 1.142 0.786 71.775 0.670 21.002 18.324 600.000 0.000

Body Fat 15 0.531 0.243 0.072 0.000 0.286 0.130 0.695 0.933 46.675 48.815 455.230 0.000

Cpu Small 13 132780 131130 6.222 0.000 131940 128540 0.658 0.692 156420 151670 600.000 1.000

OKUNO, TAKEDA, KAWANA, AND WATANABE

6. Discussion on extension to other nonsmooth regularizers

In this section, we discuss the extension of the SB-KKT conditions from hyperparameter optimization of ℓp-regularizers to those of other nonsmooth and nonconvex regularizers. Let Θ : [0, ) [0, ) be a function such that it is concave and continuously differentiable on (0, ) and Θ(0) = 0. Many sparse regularizers are representable in terms of Θ. Indeed, Pn i=1 Θ(|wi|) reduces to the ℓpand log-regularizers, SCAD, and MCP by selecting Θ appropriately as follows. Here, a > 1, b > 0, and γ > 0 are preﬁxed parameters and x 0. The continuous differentiability of Θ for SCAD and MCP can be conﬁrmed in view of the formula of Θ :

ℓp-regularizer (p 1) if Θ(x) := xp;

log-regularizer (Candes et al., 2008) if Θ(x) := 1 log(1 + γ) log(1 + γx);

SCAD (Fan and Li, 2001) if

x2 2abx + b2

2(a 1) if b x ab

2 otherwise.

The ﬁrst-order derivative is

a 1 if b x ab

0 otherwise;

MCP (Zhang et al., 2010) if

2 otherwise.

The ﬁrst-order derivative is

0 otherwise.

Consider the following extended formulation from problem (1):

min w λ,λ f(w λ) s.t. w λ argmin w Rn

G(w, λ) + λ1

i=1 Θ(|wi|)

Any local optimum w of the lower-level problem in the above satisﬁes

wi + sgn(wi)λ1Θ (|wi|) = 0 for i {1, 2, . . . , n} \ I(w), (35)

wi = 0 for i I(w), (36)

BILEVEL ℓp-HYPERPARAMETER LEARNING

which actually correspond with the scaled ﬁrst order condition (16) after transformations when Θ is chosen to be the ℓp-regularizer. We then obtain the following one-level problem that corresponds to (17) min w,λ f(w) s.t. λ 0, w satisﬁes (35) and (36). (37)

The following theorem states the SB-KKT conditions for (37), which are the extended version of Theorem 2. Since its proof can be obtained via straightforward extension of that of Theorem 2 by noting the relation (Θ(|x|), Θ (|x|), Θ (|x|)) = |x|p, sgn(x)p|x|p 1, p(p 1)|x|p 2 for x = 0 when Θ(x) = xp for x 0, we omit it.

Theorem 10 Let (w , λ ) Rn Rr be a local optimum of (37). Suppose that Θ is twice continuously differentiable at |w i | for i / I(w ). Then, (w , λ ) together with some vectors ζ Rn and η Rr satisﬁes the following conditions under an appropriate constraint qualiﬁcation concerning the constraints G(w, λ)

wi + λ1sgn(wi)Θ (|wi|) = 0 (i / I(w )), wi = 0 (i I(w )) , and λ 0:

f(w ) + P i/ I(w ) Hi(w , λ )ζ i = 0, (38)

wi + λ 1sgn(w i )Θ (|w i |) = 0 (i / I(w )), (39) P i/ I(w ) sgn(w i )Θ (|w i |)ζ i = η 1,

ζ i = 0 (i I(w )), (40)

Ri(w ) ζ = η i (i = 2, 3, . . . , r),

0 λ , 0 η , (λ ) η = 0,

Hi(w, λ) := w

+ λ1sgn(wi)Θ (|wi|)ei Rn (i / I(w )).

When Θ(x) = xp with x 0 and 0 < p 1, conditions (38) and (39) premultiplied by diag(w )2

and diag(w ), respectively, are equivalent to (9) and (10) under the presence of (40) and w i = 0 (i I(w )). The above theorem is different from Theorem 2 in that Θ is additionally assumed to be C2 at |w i | for i / I(w ). This is due to the existence of the term Θ in Hi. If Θ is chosen to correspond to ℓp or log-regularizer, this assumption always holds. In contrast, if Θ is selected to correspond to SCAD (resp., MCP), it is equivalent to w i = b, ab (resp., w i = ab) for i / I(w ) and thus may fail to hold in general. Though it is expected to hold in many instances, we need to do a further research so as to remove or weaken it. It is easy to tailor the proposed smoothing method to problems having other regularizers such as SCAD and MCP. Convergence properties similar to the case of using the ℓp-regularizer hold expectedly. However, proofs for the global convergence to an SB-KKT point in the sense of Theorem 10 may differ signiﬁcantly from that in Section 4, because our analysis for the ℓp-regularizer actually relies on the speciﬁc forms of the smoothing function ϕµ(w) = Pn i=1(w2 i + µ2) p 2 and its ﬁrstand second-order derivatives.

Further extension: Besides the above, there are other directions for extending our results. One direction is extension to structured sparse regularizers like the group Lasso model (Yuan and Lin, 2006). Such a model often contains regularizers of the composite form Pl i=1 Θ(θi(w)) with

OKUNO, TAKEDA, KAWANA, AND WATANABE

θi : Rn R (i = 1, 2, . . . , l). For instances of θi, we can set θi(w) := w ai with ai Rn or θi(w) := w Kiw with Ki Rn n being a symmetric positive deﬁnite matrix. Another interesting direction is extension to problems with matrix variables. Marjanovic and Solo (2012) considered regularized least square optimization for matrix completion. For X Rr1 r2, the regularization term that appears there takes the form of X p p := Pmin(r1,r2) i=1 σi(X)p with 0 < p 1, where σi(X) (i = 1, 2, . . . , min(r1, r2)) are the singular values of X. If r1 = r2 and X is a diagonal matrix, X p p reduces to the ℓp-regularizer we have considered. In (Marjanovic and Solo, 2012), in order to ﬁnd the best-qualiﬁed recovered matrix model, the authors iteratively solved problems involving X p p as a regularizer while varying hyperparameters. A bilevel approach may help to recover a matrix with higher quality faster.

7. Conclusions

We have proposed a bilevel optimization approach for selecting the best hyperparameter (regularization parameter) of the ℓp-regularizer. The bilevel optimization problem that appears in our approach has a nonsmooth and possibly nonconvex ℓp-regularized problem as the lower-level problem. For this problem, we have developed the scaled bilevel KKT (SB-KKT) conditions and proposed a smoothing-type method. Furthermore, we have made analysis on convergence of the proposed algorithm to an SB-KKT point. Numerical experiments imply that it exhibited performance superior to Bayesian optimization and grid search especially in computational time. The method/theoretical guarantee can be applicable to hyperparameter learning for classiﬁcation. As a future work, we would like to make the algorithm more practical. For this purpose, we may need to integrate some stochastic technique into the proposed algorithm. For example, approximate KKT points computed by approximate gradient and Hessians can be used. In the stochastic setting, we expect that the SB-KKT conditions will play a signiﬁcant role in convergence analysis.

Acknowledgments

We thank three anonymous reviewers and the action editor Tomioka for their valuable comments and suggestions. This research was supported by JSPS KAKENHI Grant Numbers 20K19748 and 19H04069 and by JST ERATO Grant Number JPMJER1903.

Appendix A. Omitted Proofs

In this section, we provide proofs of some lemmas and propositions.

A.1 Proof of Theorem 2

Firstly, notice that (w , λ ) is also a local optimum of the following problem:

min w,λ f(w)

s.t. G(w, λ)

wi + p sgn(wi)λ1|wi|p 1 = 0 (i / I(w ))

wi = 0 (i I(w )) λ 0.

BILEVEL ℓp-HYPERPARAMETER LEARNING

Actually, this fact is easily conﬁrmed by noting that (w , λ ) is also feasible to (A.1) and the feasible region of (17) is larger than that of (A.1). Hence, under an appropriate constraint qualiﬁcation such as the linearly independent constraint qualiﬁcation associated to (A.1), the KKT conditions for (A.1) hold at (w , λ ), namely, there exist some vectors ˆζ := (ˆζ 1, ˆζ 2, . . . , ˆζ n) Rn and η Rr such that

wi wj + p(p 1)λ1|wi|p 2 ˆζ j = 0 (i / I(w )), (A.2)

wi wj ˆζ j + ˆζ i = 0 (i I(w )),

λf(w ) η + X

wi + p sgn(wi)λ1|wi|p 1 ˆζ i = 0, (A.3)

wi + p sgn(w i )λ 1|w i |p 1 = 0 (i / I(w )), (A.4)

w i = 0 (i I(w )), (A.5)

0 λ , 0 η , (λ ) η = 0, (A.6)

where ˆζ i (i I(w )), ˆζ i (i / I(w )), and η are Lagrange multipliers corresponding to the constraints wi = 0 (i I(w ) and G(w, λ)

wi + p sgn(wi)λ1|wi|p 1 = 0 (i / I(w )), and λ 0, respectively. To derive the ﬁrst equality above, we made use of the fact

wi = (p 1)sgn(wi)|wi|p 2

at wi = 0. Noting the relations λf(w) = 0, G(w, λ)/ λ1 = 0, 2G(w, λ)/ λi wi = Ri(w, λ)/ wi (i = 2, 3, . . . , r), we can rewrite condition (A.3) as

i/ I(w ) p sgn(w i )|w i |p 1ˆζ i = η 1, (A.7)

wj ˆζ j = η i (i = 2, . . . , r). (A.8)

Next, deﬁne ζ Rn as the vector with ζ i = 0 (i I(w )) and ζ i = ˆζ i (i / I(w )). Let us show that (w , λ , ζ , η ) satisﬁes the targeted conditions (9) (14). For each i {1, 2, . . . , n}, we have

(w i )2 f(w )

wi + (w i )2 n X

wi wj ζ j + λ 1p(p 1)|w i |pζ i

= (w i )2 f(w )

wi + (w i )2 X

wi wj ζ j + λ 1p(p 1)|w i |pζ i

OKUNO, TAKEDA, KAWANA, AND WATANABE

where the ﬁrst equality follows from ζ j = 0 (j I(w )) and the second one can be proved by cases; when i I(w ), the desired equality is obviously true because of (A.5); when i / I(w ), it is obtained from multiplying (A.2) by (w i )2 and using ζ i = ˆζ i . Therefore, we conﬁrm (9). Similarly, we can deduce (10) and (13) from (A.4) and (A.8) along with the deﬁnition of ζ , respectively. The remaining conditions (11), (12), and (14) are derived from (A.7), ζ i = 0 (i I(w )), and (A.6), respectively. Putting all the above results together, we conﬁrm that (w , λ , ζ , η ) satisﬁes (9) (14). Consequently, we have the desired result.

A.2 Proof of Lemma 3

We will give a proof of Lemma 3. Firstly, we review the deﬁnition and several properties for a subgradient of a given function from Rockafellar and Wets (2009). We ﬁnally give a proof of Lemma 3. Let us deﬁne regular and general subgradients for a given function according to Deﬁnition 8.3(a),(b) of Rockafellar and Wets (2009). For simplicity, we conﬁne ourselves to a continuous function f : Rn R.

Deﬁnition A.1 For vectors v Rn and x Rn,

1. we say that v is a regular subgradient of f at x, written v ˆ xf( x), if f(x) f( x) + v (x x) + o( x x ).

2. We say that v is a (general) subgradient of f at x, written v xf( x), if there are sequences {xν} Rn converging to x and {vν} Rn converging to v such that vν ˆ f(xν) for each ν.

We often simply write ˆ x and x as ˆ and , respectively.

Obviously, it holds that ˆ f(x) f(x). The following propositions are useful:

Proposition A.2 (Rockafellar and Wets, 2009, 8.8(c) Exercise) Let fi : Rn R (i = 0, 1) be continuous. Let f := f0 + f1. If f0 is continuously differentiable around x, then ˆ f( x) = f0( x) + ˆ f1( x) and f( x) = f0( x) + f1( x).

Proposition A.3 (Rockafellar and Wets, 2009, 8.5 Proposition) Let f : Rn R be continuous. Then, v ˆ f(x) if and only if, on some neighborhood of x, there exists a differentiable function g : Rn R such that g( x) = v, g(x) f(x), and g( x) = f( x). Moreover, g can be taken to be continuously differentiable with g(x) < f(x) for all x = x near x.

We next prove the following proposition associated with x p p (0 < p 1).

Proposition A.4 For x Rn, let I(x) := {i | xi = 0} and g(x) := λ x p p with 0 < p 1 and λ 0. Then, for 0 < p < 1 and x Rn, we have

g( x) = v | vi = λp sgn( xi)| xi|p 1 (i / I( x)), vi R (i I( x)) . (A.9)

On the other hand, for p = 1, we have

g( x) = {v | vi = λ sgn( xi) (i / I( x)), vi [ λ, λ] (i I( x))} . (A.10)

BILEVEL ℓp-HYPERPARAMETER LEARNING

Proof For convenience of expression, let ˆg(x) := λ P i I( x) |xi|p. Note that g(x) = ˆg(x) + λ P i/ I( x) |xi|p and λ P i/ I( x) |xi|p is continuously differentiable around x. Then, by Proposition A.2, we have g( x) = λ X

i/ I( x) p sgn( xi)| xi|p 1ei + ˆg( x), (A.11)

where ei Rn is the vector such that the i-th element is one and the others are zeros. Supposing I( x) = , we next describe xˆg( x) precisely. First, consider the case of 0 < p < 1. For any v Rn

with vi = 0 (i / I( x)), we can show that λ P i I( x) |xi|p λ P i I( x) vixi holds on a sufﬁciently small neighborhood of x since λ 0. Then, Proposition A.3 implies

ˆ ˆg( x) {v | vi = 0 (i / I( x))} . (A.12)

We next show the converse implication for the above. To this end, choose a regular subgradient v ˆ ˆg( x) = ˆ λ P i I( x) |xi|p x= x arbitrarily. Then, according to Proposition A.3, there exists some

differentiable function h such that h(x) λ P i I( x) |xi|p near x, h( x) = λ P i I( x) | xi|p = 0, and h( x) = v. Then, for arbitrarily chosen j / I( x), h( x + sej) λ P i I( x) | xi|p = 0 for any s R sufﬁciently small. From this fact along with h( x) = 0, we see that s = 0 is a local maximizer of maxs R h( x + sej), and thus vj = h( x)/ xj = h( x + sej)/ s|s=0 = 0. Hence, since the index j I( x) was arbitrarily chosen, we obtain the converse implication for (A.12). Using this fact and (A.12), we have ˆ ˆg( x) = {v | vi = 0 (i / I( x))} . (A.13)

We next prove that ˆg( x) {v | vi = 0 (i / I( x))} . (A.14)

Choose v ˆg( x) arbitrarily. Then, there exist sequences {xν} and {vν} such that limν xν = x, limν vν = v, and vν ˆ ˆg(xν) for any ν. For an arbitrary j / I( x), it is not difﬁcult to verify vν j = 0 for all ν sufﬁciently large. Therefore, we obtain vj = 0 for any j / I( x). Thus, we conclude (A.14) which together with the facts of ˆ ˆg( x) ˆg( x) and (A.13) implies

ˆg( x) = {v | vi = 0 (i / I( x))} .

Finally, from this equality and (A.11), we obtain the desired result (A.9). For the case where p = 1, it is easy to show the desired result (A.10) using the fact of ˆg( x) = λ P i I( x) x |xi||x= x . We omit the detailed proof.

We are now ready to show Lemma 3.

Proof of Lemma 3: We ﬁrst note that, since G is continuously differentiable and R1(w) = w p p and λ1 0, we have

w G(w, λ) + λ1R1(w)

= w G(w, λ) + w(λ1R1(w))

n v | vi = G(w, λ)

wi + λ1p sgn(wi)|wi|p 1 (i / I(w)), vi R (i I(w)) o (p < 1) n v | vi = G(w, λ)

wi + λ1sgn(wi) (i / I(w)), vi G(w, λ)

wi + [ λ1, λ1] (i I(w)) o (p = 1),

OKUNO, TAKEDA, KAWANA, AND WATANABE

where the ﬁrst equality follows from Proposition A.2 and the second equality comes from Proposition A.4. Now, let us show the ﬁrst claim. Suppose 0 w(G(w, λ) + λ1R1(w)). Then, by (A.15), we have

wi = 0 (i I(w)), (A.16)

wi + p sgn(wi)λ1|wi|p 1 = 0 (i / I(w)), (A.17)

which readily imply W w G(w, λ) + pλ1|w|p = 0. Hence, we obtain the ﬁrst claim. We next show the latter claim for the case of p < 1. Suppose that W w G(w, λ)+pλ1|w|p = 0. Then, we see that (A.16) and (A.17) hold. In view of this fact together with (A.15) for p < 1, we obtain 0 w(G(w, λ) + λ1R1(w)). Thus, we conclude the latter claim.

A.3 Proof of Proposition 6

Denote wk = (wk 1, wk 2, . . . , wk n) for each k. We ﬁrst show (21). Note that it follows from (19) that

wk i ( ϕµk 1(wk))i = p(wk i )2((wk i )2 + µ2 k 1) p 2 1 (A.18)

for each i {1, 2, . . . , n}. Then, for the index i / I(w ), we have w i = 0 and thus get

lim k wk i ( ϕµk 1(wk))i = p(w i )2|w i |p 2

= p|w i |p. (A.19)

We next choose i I(w ) arbitrarily and divide the index set K := {1, 2, . . . , } into the following two sets: Ui 1 := {k K | wk i = 0}, Ui 2 := {k K | wk i = 0}.

Then, for k Ui 1, equation (A.18) together with p/2 1 < 0 and wk i = 0 yields that

wk i ( ϕµk 1(wk))i p|wk i |2|wk i |2( p

= p|wk i |p. (A.20)

Since wk i ( ϕµk 1(wk))i 0 holds for each k Ui 1 in view of the right-hand of (A.18) and limk µk 1 = 0, letting k Ui 1 in (A.20) implies

lim k Ui 1 wk i ( ϕµk 1(wk))i = p|w i |p = 0. (A.21)

Similarly, for all k Ui 2, we have wk i ( ϕµk 1(wk))i = 0 because of wk i = 0 (k Ui 2) and (19). This fact together with (A.21) yields

lim k wk i ( ϕµk 1(wk))i = p|w i |p. (A.22)

Combining this with (A.19), we conclude (21).

BILEVEL ℓp-HYPERPARAMETER LEARNING

We next show (22). In view of (20), we have

(wk i )2( 2ϕµk 1(wk))ii

= p(wk i )2((wk i )2 + µ2 k 1) p 2 1

+ p(p 2)(wk i )4((wk i )2 + µ2 k 1) p 2 2

1 + (p 2)(wk i )2

(wk i )2 + µ2 k 1

! wk i ϕ(wk)

for any i = 1, 2, . . . , n, where the last equality is due to (19). For the case of i / I(w ), we obtain

lim k (wk i )2

(wk i )2 + µ2 k 1 = 1,

which together with (A.19) and (A.23) implies

lim k (wk i )2( 2ϕµk 1(wk))ii = p(p 1)|w i |p. (A.24)

In turn, let us focus on the case of i I(w ). Then, the sequence {(wk i )2/ (wk i )2 + µ2 k 1 } is bounded since |(wk i )2/ (wk i )2 + µ2 k 1 | < 1 follows from µk 1 > 0 for all k. Hence, using (A.22), we derive from (A.23) that

lim k (wk i )2( 2ϕµk 1(wk))ii = 0 = p(p 1)|w i |p,

where the last equality is due to w i = 0 for i I(w ). By this equation together with (A.24), we conclude (22). The proof is complete.

A.4 Proof of Lemma 7

Choose i I(w ) arbitrarily. We show the claim for the case where wk i = 0 for all k K. It is not difﬁcult to extend the argument to the general case where wk i = 0 occurs for inﬁnitely many k. Also, we may assume λk 1 > 0 for all k K because of Assumption A1. For simplicity, denote

Fi(wk, λk) := G(wk, λk)

for each k K. From (19) and the i-th element of condition (7) with (w, λ, ε4) = (wk, λk, εk 1 4 ), we have, for each k K,

Fi(wk, λk) + pλk 1wk i ((wk i )2 + µ2 k 1) p 2 1 = (εk 1 4 )i, (A.25)

which together with the assumption wk i = 0 and λk 1 = 0 (k K) implies

Fi(wk, λk) (εk 1 4 )i = 0 (k K).

Recall that εk 1 4 0 as k . Noting this fact and (A.25), we get

Fi(wk, λk) (εk 1 4 )i

p λk 1|wk i | 2 p 2 (wk i )2,

OKUNO, TAKEDA, KAWANA, AND WATANABE

where p := p 2 p 2 , λk 1 := (λk 1) 2 p 2 .

Then, it follows that

|wk i | 2 2 p

µ2 k 1 = p λk 1 Fi(wk, λk) (εk 1 4 )i

2 p 2 p λk 1|wk i |2+ 2 p 2 . (A.26)

To show the desired result, it sufﬁces to prove that n |wk i | 2 2 p /µ2 k 1 o

k K is bounded from above.

To this end, we ﬁrst consider the case of p = 1. By substituting p = 1 for (A.25), we get

Fi(wk, λk) (εk 1 4 )i + λk 1 wk i q

(wk i )2 + µ2 k 1 = 0. (A.27)

Moreover, by substituting p = 1 for (A.26), we have

µ2 k 1 = (λk 1) 2 Fi(wk, λk) (εk 1 4 )i 2 (λk 1) 2

Fi(wk, λk) (εk 1 4 )i 2

(λk 1)2 Fi(wk, λk) (εk 1 4 )i 2 . (A.28)

From equation (A.27), it is not difﬁcult to see that |Fi(wk, λk) (εk 1 4 )i|2 |λk 1|2. In this inequality, let k K . Then, Assumption A3 together with Fi(w , λ ) = G(w , λ )/ wi yields

(λ 1)2 |Fi(w , λ )|2 > 0. (A.29)

Letting k K in equation (A.28) and noting (A.29), we readily derive that

lim k K |wk i | 2 2 p

Fi(w , λ ) 2

(λ 1)2 Fi(w , λ ) 2 < . (A.30)

We next consider the case of p < 1. By using (A.26) again, it holds that

lim k K |wk i | 2 2 p

µ2 k 1 = p λ 1 Fi(w , λ ) 2 p 2 p λ 1|w i |2+ 2 p 2

= p λ 1 Fi(w , λ ) 2 p 2

where λ 1 := (λ 1) 2 p 2 > 0 and the second equality follows from 2 + 2 p 2 > 0 and w i = 0 because of i I(w ). Particularly, note that the last strict inequality is true due to 2/(p 2) < 0 even if Fi(w , λ ) = 0. Finally, by (A.30) and (A.31), we conclude the desired result.

BILEVEL ℓp-HYPERPARAMETER LEARNING

A.5 Proof of Proposition 9

We prepare the following lemma.

Lemma A.5 Suppose that Assumption A4 holds and let (w , λ ) be an arbitrary accumulation

point of the sequence {(wk, λk)}. Recall that we write wh(w) := h(w)

wi1 , . . . , h(w)

for a function h : Rn R and the index set {i1, i2, . . . , ip} := {1, 2, . . . , n} \ I(w ). Moreover, denote w := (wi)i/ I(w ) and

( w,λ)Φi(w, λ) := wΦi(w, λ) λΦi(w, λ)

Rn |I(w )|+r (i / I(w )), (A.32)

( w,λ)λi := wλi λλi

Rn |I(w )|+r (i I(λ )). (A.33)

Then, the vectors n ( w,λ)Φi(w , λ )

i/ I(w ) , ( w,λ)λi|λ=λ

are linearly independent.

Proof Notice that ( w,λ)λi is the vector such that the (n |I(w )| + i)-th entry is 1 and the others are 0s. Under Assumption A4, we see that the matrix

M := h ( Φi(w , λ ))i/ I(w ) , ( (w,λ)wi|w=w )i I(w ), ( (w,λ)λi|λ=λ )i I(λ ) i R(n+r) (n+|I(λ )|)

is of full-column rank. Since the matrix

" zeros(|I(w )|, n |I(w )|) E|I(w )| zeros(|I(w )|, |I(λ )|) ( w,λ)Φi(w , λ )

i/ I(w ) zeros(n |I(w )| + r, |I(w )|) ( ( w,λ)λi|λ=λ )i I(λ )

R(n+r) (n+|I(λ )|),

where Es denotes the s s identity matrix and zeros(s, t) stands for the zero matrix in Rs t, is obtained by applying appropriate elementary column and row operations to M, we ﬁnd that N is of full-column rank. Hence, the desired result is obtained.

Proof of Proposition 9: For simplicity, let

ξk := ((ζk) , (ηk) ) , ˆζk := ζk

ξk , ˆηk := ηk

for each k. Suppose to the contrary that {ξk} is unbounded. Choosing an arbitrary accumulation point (w , λ ) of the sequence {(wk, λk)}, without loss of generality, we can assume that (wk, λk) (w , λ ) and ξk as k , if necessary, by taking a subsequence. Let us denote an arbitrary accumulation point of {ξk/ ξk } by ˆξ := ((ˆζ ) , ˆη ) , where ˆζ and ˆη are accumulation points of {ˆζk} and {ˆηk}, respectively. Again, without loss of generality, we can suppose limk ˆξk = ˆξ .

OKUNO, TAKEDA, KAWANA, AND WATANABE

Notice that ˆξ = 1. By dividing both sides of (4), (5), (6), and (8) with w = wk, λ = λk, ζ = ζk, η = ηk and (ε1, ε2, ε3, ε4, ε5) = (εk 1 1 , εk 1 2 , εk 1 3 , εk 1 4 , εk 1 5 ) by ξk , we have, for each k, f(wk)

i ξk + 2 ww G(wk, λk)ˆζk

i + λk 1( 2ϕµk(wk))iiˆζk i = (εk 1 1 )i ξk (i = 1, 2, . . . , n),

ϕµk(wk) ˆζk ˆηk 1 = εk 1 2 ξk , (A.35)

Ri(wk) ˆζk ˆηk i = (εk 1 3 )i ξk (i = 2, 3, . . . , r), (A.36)

λk i ˆηk i εk 1 5 ξk , λk i 0, ˆηk i 0 (i = 1, 2, . . . , r), (A.37)

where the last conditions are deduced by componentwise decomposition of (8). Note that εk 1 1 / ξk , εk 1 2 / ξk , εk 1 3 / ξk , and εk 1 5 / ξk converge to 0 as k . By driving k in (A.37) for i = 1 and using limk λk 1 = λ 1 > 0 from Assumption A1, we have

ˆη 1 = 0. (A.38)

In a similar manner, we can get ˆη i = 0 (i / I(λ )), (A.39)

where I(λ ) = {i {1, 2, . . . , r} | λ i = 0} as is deﬁned in Assumption A4. Expressions (A.38) and (A.39) together with ˆξ = 1, that is, ˆξ 2 = ˆζ 2 + Pr i=1 |ˆη i |2 = 1 imply

i I(λ ) |ˆη i |2 = 1. (A.40)

Next, let k in (A.34). By the boundedness of { 2 ww G(wk, λk)ˆζk} and limk f(wk)/ ξk = 0, we ﬁnd that λk 1 2ϕµk(wk)

ii ζk i / ξk is bounded for each i. Using this fact, limk λk 1 = λ 1 > 0, and limk |( 2ϕµk(wk))ii| for i I(w ) by Proposition 8 yield

ˆζ i = 0 (i I(w )). (A.41)

We next show that X

i/ I(w ) sgn(w i )|w i |p 1ˆζ i = 0. (A.42)

For proving (A.42), it sufﬁces to show

lim k ϕµk 1(wk) ˆζk = X

i/ I(w ) p sgn(w i )|w i |p 1ˆζ i . (A.43)

Indeed, we can derive (A.42) from (A.43) by taking the limit of (A.35), (A.41), and (A.38) into account. Choose i I(w ) arbitrarily. By Lemma 7, there exists some γ > 0 such that

µ2 k 1 γ|wk i | 2 2 p (A.44)

BILEVEL ℓp-HYPERPARAMETER LEARNING

for all k sufﬁciently large. In what follows, we consider sufﬁciently large k so that inequality (A.44) holds. Then, by 0 < p 1, we get µ2 p k 1 γ 2 p

which implies

1 p( ϕµk 1(wk))iˆζk i = wk i ((wk i )2 + µ2 k 1) p 2 1ˆζk i

wk i µ 2( p

2 1) k 1 ˆζk i

µ2 p k 1 γ 2 p

2 1) k 1 ˆζk i

=γ p 2 1 ˆζk i . (A.45)

From relation (A.45) and expression (A.41) we obtain limk ( ϕµk 1(wk))iˆζk i = 0. Since i I(w ) was arbitrarily chosen, it holds that

i I(w ) ( ϕµk 1(wk))iˆζk i = 0. (A.46)

It then follows that

lim k ϕµk 1(wk) ˆζk = lim k

i ˆζk i + X

i/ I(w ) p sgn(w i )|w i |p 1ˆζ i ,

where the second equality follows from (A.46) and the last equality is due to the relation

lim k ( ϕµk 1(wk))i = p sgn(w i )|w i |p 1 (i / I(w )), (A.47)

which can be derived from (19). Therefore, we conclude the desired expression (A.43) and thus (A.42). In addition to (A.47), for i / I(w ), we obtain from (20) that

lim k ( 2ϕµk 1(wk))ii = p(p 1)|w i |p 2.

Then, forcing k in (A.34) yields

w G(w, λ) ˆζ

(w,λ)=(w ,λ )

+ λ 1p(p 1)|w i |p 2ˆζ i = 0 (i / I(w )),

OKUNO, TAKEDA, KAWANA, AND WATANABE

which can be transformed by using (A.41) into

P j / I(w ) G(w, λ)

(w,λ)=(w ,λ )

+ λ 1p P j / I(w ) sgn(wj)|wj|p 1ˆζ j

= 0 (i / I(w )).

Put w := (wi)i/ I(w ). Letting k in (A.36), we get Ri(w ) ˆζ ˆη i = 0 (i = 2, . . . , r), which together with (A.41) implies X

wj ˆζ j ˆη i = 0 (i = 2, . . . , r). (A.48)

Now, let Ψ := (Ψ i ) i/ I(w ) Rn |I(w )| with

Ψ i := P j / I(w ) G(w, λ)

(w,λ)=(w ,λ )

+ λ 1p P j / I(w ) sgn(wj)|wj|p 1ˆζ j

w=w (A.49) and ej Rr be the vector such that the j-th element is 1 and others are 0s. In addition, Φi (i / I(w )), ( w,λ)Φi (i / I(w )), and ( w,λ)λi (i I(λ )) are the functions deﬁned in Assumption A4, (A.32), and (A.33) in Lemma A.5, respectively. Then, it follows that X

j / I(w ) ( w,λ)Φj(w , λ )ˆζ j X

j I(λ ) ( w,λ)λj|λ=λ ˆη j

wΦj(w , λ ) λΦj(w , λ )

zeros(n |I(w )|, 1) ˆη j ej

( P j / I(λ ) Φj(w,λ)ˆζ j ) wi

(w,λ)=(w ,λ )

i/ I(w ) ˆη + P j / I(w ) λΦj(w , λ )ˆζ j

Ψ P j / I(w ) ˆζ j p sgn(w j)|w j|p 1

P j / I(w ) R2(w )

wj ˆζ j ˆη 2 ... P j / I(w ) Rr(w )

wj ˆζ j ˆη r

= 0, (A.50)

where zeros(n |I(w )|, 1) denotes the zero matrix in Rn |I(w )|, the second equality follows from (A.39), the third one is from (A.38), deﬁnition (A.49) of Ψ , and easy calculation, and the last one is derived from (A.42), (A.48), and (A.49). Expression (A.50) together with Lemma A.5 entails ˆζ i = 0 (i / I(w )) and ˆη i = 0 (i I(λ )). Hence, by (A.41), we obtain ˆζ 2 + P i I(λ ) |ˆη i |2 = 0. However, it contradicts (A.40). Therefore, the sequence {(ζk, ηk)} is bounded.

BILEVEL ℓp-HYPERPARAMETER LEARNING

Appendix B. Description of the algorithm for solving the smoothed problem (3)

In this section, we explain the algorithm used for solving problem (3) in the numerical experiment.

B.1 Implicit function based method

In this section, we describe the algorithm that is used for solving the following problem arising by smoothing problems (33) and (34) in the numerical experiments in Section 5:

min (w,λ) Rn Rn+1 fval(w) := Avalw bval 2 2

s.t. w argmin ˆ w

φµ( ˆw, λ) := Atr ˆw btr 2 + eλ1 n X

i=1 ( ˆw2 i + µ2) p 2 + ν ˆw C( λ) ˆw

where ν {0, 1} and C( λ) := Diag(exp(λi))n+1 i=2 . The above problems with ν = 0 and 1 correspond to problems (33) and (34), respectively. Our goal is to compute a KKT triplet (w, λ, η) Rn Rn+1 Rn of the above problem with the constraint replaced by the equality constraint wφµ(w, λ) = 0. Namely, we compute (w, λ, η) which satisﬁes

Θ(w, η) := fval(w) 0

+ 2 wwφµ(w, λ) 2 wλφµ(w, λ)

η = 0, wφµ(w, λ) = 0, (B.2)

where 2 wλφµ(w, λ) = λ ( wφµ(w, λ)) R(n+1) n. Given eλ and µ, let e w be a stationary point of the smoothed lower-level problem minw φµ(w, λ). According to the standard implicit function theorem, if 2 wwφµ( e w, eλ) is of full rank, there exist some open neighborhood Ueλ of eλ and a twice continuously differentiable implicit function w( ) : Ueλ Rn such that

e w = w(eλ), wφµ(w(λ), λ) = 0 (λ Ueλ).

In Ueλ, we may regard problem (B.1) with the constraint replaced by wφµ(w, λ) = 0 as

F(λ) := Avalw(λ) bval 2 2 . (B.3)

By the implicit function theorem again, we then have

w(eλ) = 2 wλφµ(w(eλ), eλ) 2 wwφµ(w(eλ), eλ) 1 ,

and hereby the gradient of the objective of problem (B.3) at eλ is expressed as follows:

F(eλ) = λ Avalw(eλ) bval 2 2

= 2 w(eλ)A val Avalw(eλ) bval

= 2 2 wλφµ(w(eλ), eλ) 2 wwφµ(w(eλ), eλ) 1 A val Avalw(eλ) bval .

OKUNO, TAKEDA, KAWANA, AND WATANABE

Algorithm B.1 Implicit function based quasi-Newton method for the smoothed subproblem

Require: λ0 Rn+1, ϵ 0, α, β (0, 1), B0 Sn+1 ++ (Sn+1 ++ : The set of (n + 1) (n + 1) real symmetric positive deﬁnite matrices); Set k 0.

1: while F(λk) ϵ do

2: Find wk satisfying wφµ(wk, λk) = 0.

3: Set dλ B 1 k F(λk).

4: Find the smallest integer ℓk 0 satisfying

F(λk + βℓkdλ) F(λk) + αβℓk F(λk) dλ. (B.4)

Set tk βℓk.

5: λk+1 λk + tkdλ.

6: Set Bk+1 Sn+1 ++ .

8: end while

9: Set ( w, λ) (wk, λk).

10: Solve Θ( w, λ, η) = 0 for η to obtain a Lagrange multiplier η. output ( w, λ, η)

By computing the above gradient at each iterate, we can preform the quasi-Newton method (Nocedal and Wright, 2006) for problem (B.3) to have a solution λ with F(λ ) = 0. Once λ is gained together with w(λ ), we substitute them into the ﬁrst equation Θ(w, η) = 0 in (B.2) and solve the resultant linear equation Θ(w(λ ), η) = 0 for η to have a solution, say η . The triplet (w(λ ), λ , η ) is then nothing but the desired KKT triplet. The overall algorithm is described as in Algorithm B.1. For the algorithm to work, the fullrankness of 2 wwφµ( e w, eλ) is necessary to ensure the existence of the implicit function w( ). This is expected to hold in many instances, although it cannot be guaranteed generally. We must solve the lower-level problem in Line 2 every time ℓk is updated while performing linesearch (B.4), and thus how we solve the smoothed lower-level problem affects the overall efﬁciency of Algorithm B.1. In the subsequent section, we will present a certain Newton-type method for solving the smoothed lower-level problem. Next, we make a remark on the linesearch procedure in Algorithm B.1. As mentioned previously, we need to solve the smoothed lower-level problem minw φµ(w, λk + βℓkdλ) so as to evaluate F(λk + βℓkdλ) every time ℓk is incremented. Actually, to compute F(λk + βℓkdλ), we need to know the value of w(λk + βℓkdλ) by solving the equation wφµ(w, λk + βℓkdλ) = 0. However, the smoothed lower-level problem minw φµ(w, λk + βℓkdλ) is nonconvex when p < 1 and thus the set of solutions of wφµ(w, λk + βℓkdλ) = 0 is not singleton in general5. This fact yields that applying a numerical method to this equation may not return w(λk + βℓkdλ). Nevertheless, in practice, we expect w(λk + βℓkdλ) to be computed successfully by applying, for example, a Newton-type method with w(λk) as a starting point to the equation, because w(λk +βℓkdλ) actually gets closer to w(λk) as ℓk is increased in the linesearch procedure.

5. When p = 1, minw φµ(w, λk + βℓkdλ) is strongly convex minimization in virtue of the term Pn i=1(w2 i + µ2) p 2 with µ > 0, and thus its solution set is singleton.

BILEVEL ℓp-HYPERPARAMETER LEARNING

The convergence analysis of Algorithm B.1 can be mostly done in a manner similar to that of the standard quasi-Newton method. Indeed, we can show that any accumulation point of {(wk, λk)} is a KKT point of the smoothed subproblem under the following two sets of assumptions:

Assumption B.1 Let {λk} be a sequence produced by Algorithm B.1. Then, the following properties hold:

1. The sequence {λk} is bounded.

2. There exist some α1, α2 (0 < α1 α2) such that

for all k, where E is the identity matrix with the same size with Bk, and for symmetric matrices X, Y , X Y stands for Y X is positive semideﬁnite.

The above assumptions are often made in convergence analysis of the quasi-Newton method, whereas the following assumption is speciﬁc to our setting.

Assumption B.2 2 wwφµ(wk, λk) is of full rank for each k, and so is 2 wwφµ(w , λ ) even at an arbitrary accumulation point (w , λ ).

Assumption B.2 ensures that the implicit function w( ) exists at each iterate and even at an arbitrary accumulation point. The following theorem holds under Assumptions B.1 and B.2. As the proof is similar to that for the quasi-Newton method, we omit it here.

Theorem B.3 Suppose that Assumptions B.1 and B.2 hold. Then, any accumulation point of {λk} satisﬁes F(λ) = 0.

B.2 Newton-type method for solving the smoothed lower-level problem

In this section, we describe the modiﬁed Newton-type algorithm used for solving the smoothed lower-level problem minw φµ(w, λ) in problem (B.1). For brevity, the algorithm is presented in the form pertaining to the following problem:

min w ψµ(w) := 1

2 Kw f 2 + η

i=1 (w2 i + µ2) p 2 , (B.5)

where η R is positive, K Rm n, and f Rm. Note that by setting K and f appropriately, the function ψµ above reduces to φµ. We begin with the update-formula of the standard Newton method for problem (B.5) at the r-th iterate wr Rn:

wr+1 wr B(wr) 1 ψµ(wr), where

B(w) := K K + pηDiag

(w2 i + µ2) p 2 1 + p 2

2 w2 i (w2 i + µ2) p 2 2 | {z } negative

OKUNO, TAKEDA, KAWANA, AND WATANABE

However, the matrix B(wr) is not necessarily nonsingular because of the above negative part, and thus the Newton method may not work.6 As a remedy, in the spirit of the modiﬁed Newton method, we modify B(wr) to the following matrix e B(wr) by deleting the negative part:

e B(w) := K K + pηDiag (w2 i + µ2) p 2 1 n

Now, the presented algorithm is described formally as in Algorithm B.2. In fact, the algorithm is identical to the one that is proposed by Lai and Wang (2011, Section 2), which gives the following theorem:

Theorem B.4 (Lai and Wang, 2011, Theorem 2.1) Let {wr} be a sequence generated by Algorithm B.2 with ϵ = 0. It is bounded and its arbitrary accumulation point satisﬁes ψµ(w) = 0.

It is worthwhile to note that Algorithm B.2 does not request a linesearch procedure for the global convergence, which is often costly.

Algorithm B.2 Modiﬁed Newton-type method for minw ψµ(w)

Require: w0 Rn, r 0, ϵ > 0

1: while ψµ(wr) > ϵ do

2: wr+1 wr e B(wr) 1 ψµ(wr),

3: r r + 1.

4: end while

5: Set w wr

Appendix C. Supplementary tables and ﬁgures of bayesopt for the numerical experiments

This section provides the supplementary Tables C.1 and C.2 that show the ﬁrst time when bayesopt found the best observed objective value. These results were recorded in a single run of bayesopt for each problem, thus differ from the averaged results shown in Tables 1 and 2. In addition, it also gives Figures C.1 and C.2 that depict how the best observed objective value of bayesopt varies over time. In order to monitor the change of values in a long period, we extended the time limit of bayesopt to 1200 seconds from 600 seconds that was employed for making Tables C.1 and C.2. These ﬁgures were obtained by solving the problems organized from the data sets of Cpu Small and Student.

6. In fact, when p = 1, B(w) is nonsingular even in the presence of the negative part, because

pηDiag (w2 i + µ2) p 2 1 + p 2

2 w2 i (w2 i + µ2) p 2 2 n

i=1 = pηDiag µ2 + p

2w2 i (w2 i + µ2) p 2 1 n

which turns out to be positive deﬁnite.

BILEVEL ℓp-HYPERPARAMETER LEARNING

Table C.1: The ﬁrst time of bayesopt for ﬁnding a solution of problem (33), which attains the ﬁnal best observed objective value, that is, validation value (Those results of bayesopt were recorded in a single run, thus differ from the averaged results over 5 runs shown in Table 1. For the sake of comparison, the results of Algorithm 1 are also shown, which are the same as those in Table 1. f.time (sec) stands for the ﬁrst time in seconds where the best objective value is observed. The best values in f.time (sec) and Errval are displayed in bold.)

Data bayesopt Algorithm 1

name p Errval f.time (sec) Errval time (sec)

Facebook 1 6.476 42.256 6.474 17.399 0.8 6.504 122.158 6.512 22.242 0.5 6.536 66.589 6.550 16.820 Insurance 1 95.764 49.538 95.764 33.077 0.8 95.737 63.580 95.676 32.465 0.5 95.604 13.960 95.562 44.904 Student 1 0.777 12.405 0.778 10.586 0.8 0.724 339.980 0.724 2.348 0.5 0.731 147.118 0.724 3.618 Body Fat 1 0.209 4.839 0.209 0.068 0.8 0.180 3.899 0.179 0.203 0.5 0.212 1.160 0.267 0.395 Cpu Small 1 131124 1.202 130981 11.299 0.8 131187 1.853 130982 0.741 0.5 131234 1.712 131058 0.672

Table C.2: The ﬁrst time of bayesopt for ﬁnding a solution of problem (34), which attains the ﬁnal best observed objective value, that is, validation value (Those results of bayesopt were recorded in a single run, thus differ from the averaged results over 5 runs shown in Table 2. For the sake of comparison, the results of Algorithm 1 (Alg.1-A, Alg.1-B) are also shown, which are the same as those in Table 2. f.time (sec) stands for the ﬁrst time in seconds where the best objective value is observed. The best values in f.time (sec) and Errval are displayed in bold.)

Data bayesopt Alg.1-A Alg.1-B

name λ Errval f.time (sec) Errval time (sec) Errval time (sec)

Facebook 54 8.780 1.518 6.478 20.247 Insurance 86 107.000 4.288 95.694 50.714 94.604 4.473 Student 273 18.324 26.756 0.771 1.451 0.786 71.775 Body Fat 15 46.815 0.337 0.243 0.072 0.130 0.695 Cpu Small 13 151394 531.837 131130 6.222 128540 0.658

OKUNO, TAKEDA, KAWANA, AND WATANABE

Figure C.1: Best observed objective value (validation value) vs running time in seconds (bayesopt for problem (33) with a single ℓ0.8 hyperparameter)

(a) Student (The number of hyperparameters is 1; the proposed bilevel algorithm found a solution with 0.724 in 2 seconds.)

(b) Cpu Small (The number of hyperparameters is 1; the proposed bilevel algorithm found a solution with 1.3098 105 in 0.7 seconds.)

BILEVEL ℓp-HYPERPARAMETER LEARNING

Figure C.2: Best observed objective value (validation value) vs running time in seconds (bayesopt for problem (34) with multiple hyperparameters)

(a) Student (The number of hyperparameters is 273; the proposed bilevel algorithms found solutions with 0.771 in 1.45 seconds (Alg.1-A) and 0.786 in 71.78 seconds (Alg.1-B).)

(b) Cpu Small (The number of hyperparameters is 13; the proposed bilevel algorithms found solutions with 1.31 105 in 6.22 seconds (Alg.1-A) and 1.29 105 in 0.66 seconds (Alg.1-B).)

OKUNO, TAKEDA, KAWANA, AND WATANABE

Amir Beck and Marc Teboulle. Smoothing and ﬁrst order methods: A uniﬁed framework. SIAM Journal on Optimization, 22(2):557 580, 2012.

Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli, and Jong-Shi Pang. Model selection via bilevel optimization. In The 2006 IEEE International Joint Conference on Neural Network Proceedings, pages 1922 1929, 2006.

Kristin P. Bennett, Gautam Kunapuli, Jing Hu, and Jong-Shi Pang. Bilevel optimization and machine learning. Computational Intelligence: Research Frontiers. WCCI 2008. Lecture Notes in Computer Science, vol. 5050, Berlin, Heidelberg, 2008. Springer.

Wei Bian and Xiaojun Chen. Worst-case complexity of smoothing quadratic regularization methods for non-Lipschitzian optimization. SIAM Journal on Optimization, 23(3):1718 1741, 2013.

Wei Bian and Xiaojun Chen. Optimality and complexity for constrained optimization problems with nonconvex regularization. Mathematics of Operations Research, 42(4):1063 1084, 2017.

Wei Bian, Xiaojun Chen, and Yinyu Ye. Complexity analysis of interior point algorithms for non-Lipschitz and nonconvex minimization. Mathematical Programming, 149(1-2):301 327, 2015.

Emmanuel J. Candes, Michael B. Wakin, and Stephen P. Boyd. Enhancing sparsity by reweighted ℓ1 minimization. Journal of Fourier analysis and applications, 14(5-6):877 905, 2008.

Xiaojun Chen. Smoothing methods for nonsmooth, nonconvex minimization. Mathematical Programming, 134(1):71 99, 2012.

Xiaojun Chen, Fengmin Xu, and Yinyu Ye. Lower bound theory of nonzero entries in solutions of ℓ2-ℓp minimization. SIAM Journal on Scientiﬁc Computing, 32(5):2832 2852, 2010.

Xiaojun Chen, Lingfeng Niu, and Yaxiang Yuan. Optimality conditions and a smoothing trust region Newton method for non Lipschitz optimization. SIAM Journal on Optimization, 23(3):1528 1552, 2013.

Xiaojun Chen, Dongdong Ge, Zizhuo Wang, and Yinyu Ye. Complexity of unconstrained L2-Lp minimization. Mathematical Programming, 143(1-2):371 383, 2014.

Stephan Dempe, Joydeep Dutta, and Sebastian Lohse. Optimality conditions for bilevel programming problems. Optimization, 55(5-6):505 524, 2006.

Stephan Dempe and Alain B. Zemkoho. The generalized Mangasarian-Fromowitz constraint qualiﬁcation and optimality conditions for bilevel programs. Journal of Optimization Theory and Applications, 148(1):46 68, 2011.

Stephan Dempe and Alain B. Zemkoho. The bilevel programming problem: reformulations, constraint qualiﬁcations and optimality conditions. Mathematical Programming, 138:447 473, 2013.

Stephan Dempe, Vyacheslav Kalashnikov, Gerardo A Pérez-Valdés, and Nataliya Kalashnykova. Bilevel programming problems. Energy Systems. Springer, Berlin, 2015.

BILEVEL ℓp-HYPERPARAMETER LEARNING

Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348 1360, 2001.

Matthias Feurer and Frank Hutter. Hyperparameter optimization. In Automated Machine Learning, pages 3 33. Springer, 2019.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1165 1173, 2017.

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimilano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1568 1577, 2018.

Dongdong Ge, Xiaoye Jiang, and Yinyu Ye. A note on the complexity of Lp minimization. Mathematical Programming, 129(2):285 299, 2011.

Pinghua Gong, Changshui Zhang, Zhaosong Lu, Jianhua Huang, and Jieping Ye. A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In Proceedings of the 30th International Conference on Machine Learning, volume 37, pages 37 45, 2013.

Michael Hintermüller and Tao Wu. Nonconvex TVq-models in image restoration: Analysis and a trust-region regularization based superlinearly convergent solver. SIAM Journal on Imaging Sciences, 6(3):1385 1415, 2013.

Yaohua Hu, Chong Li, Kaiwen Meng, Jing Qin, and Xiaoqi Yang. Group sparse optimization via ℓp,q regularization. Journal of Machine Learning Research, 18:1 52, 2017.

Karl Kunisch and Thomas Pock. A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences, 6(2):938 983, 2013.

Ming-Jun Lai and Jingyue Wang. An unconstrained ℓq minimization with 0 < q 1 for sparse solution of underdetermined linear systems. SIAM Journal on Optimization, 21(1):82 101, 2011.

Moshe Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci. edu/ml.

Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, pages 1540 1552, 2020.

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2113 2122, 2015.

Goran Marjanovic and Victor Solo. On ℓq optimization and matrix completion. IEEE Transactions on Signal Processing, 60(11):5714 5724, 2012.

Goran Marjanovic and Victor Solo. On exact ℓq denoising. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6068 6072, 2013.

OKUNO, TAKEDA, KAWANA, AND WATANABE

Chuang Miao and Hengyong Yu. Alternating iteration for ℓp (0 < p 1) regularized CT reconstruction. IEEE Access, 4:4355 4363, 2016.

Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of bayesian methods for seeking the extremum. Towards Global Optimization, 2:117 129, 1978.

Gregory Moore, Charles Bergeron, and Kristin P. Bennett. Model selection for primal SVM. Machine Learning, 85(1):175 208, 2011.

Gregory M Moore, Charles Bergeron, and Kristin P Bennett. Nonsmooth bilevel programming for hyperparameter selection. In 2009 IEEE International Conference on Data Mining Workshops, pages 374 381, 2009.

Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1): 127 152, 2005.

Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer Science & Business Media, 2006.

Peter Ochs, René Ranftl, Thomas Brox, and Thomas Pock. Techniques for gradient-based bilevel optimization with non-smooth lower level problems. Journal of Mathematical Imaging and Vision, 56(2):175 194, 2016.

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 737 746, 2016.

R. Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis, volume 317. Springer Science & Business Media, 2009.

Saharon Rosset. Bi-level path following for cross validated solution of kernel quantile regression. Journal of Machine Learning Research, 10:2473 2505, 2009.

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, volume 89, pages 1723 1732, 2019.

Fei Wen, Peilin Liu, Yipeng Liu, Robert C. Qiu, and Wenxian Yu. Robust sparse recovery for compressive sensing in impulsive noise using ℓp-norm model ﬁtting. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4643 4647, 2016.

Fei Wen, Lasith Adhikari, Ling Pei, Roummel F. Marcia, Peilin Liu, and Robert C Qiu. Nonconvex regularization-based sparse recovery and demixing with application to color image inpainting. IEEE Access, 5:11513 11527, 2017.

Fei Wen, Lei Chu, Peilin Liu, and Robert C. Qiu. A survey on nonconvex regularization-based sparse and low-rank recovery in signal processing, statistics, and machine learning. IEEE Access, 6: 69883 69906, 2018.

Haolei Weng, Le Zheng, Arian Maleki, and Xiaodong Wang. Phase transition and noise sensitivity of ℓp-minimization for 0 p 1. In IEEE International Symposium on Information Theory, pages 675 679, 2016.

BILEVEL ℓp-HYPERPARAMETER LEARNING

Zongben Xu, Xiangyu Chang, Fengmin Xu, and Hai Zhang. L1/2 regularization: A thresholding representation theory and a fast solver. IEEE Transactions on neural networks and learning systems, 23(7):1013 1027, 2012.

Jane J. Ye and D. L. Zhu. Optimality conditions for bilevel programming problems. Optimization, 33(1):9 27, 1995.

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49 67, 2006.

Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894 942, 2010.

Le Zheng, Arian Maleki, Quanhua Liu, Xiaodong Wang, and Xiaopeng Yang. An ℓp-based reconstruction algorithm for compressed sensing radar imaging. In 2016 IEEE Radar Conference (Radar Conf), pages 1 5, 2016.