# inductive_twolayer_modeling_with_parametric_bregman_transfer__c41397b7.pdf

Inductive Two-layer Modeling with Parametric Bregman Transfer

Vignesh Ganapathiraman 1 Zhan Shi 1 Xinhua Zhang 1 Yaoliang Yu 2

Latent prediction models, exempliﬁed by multilayer networks, employ hidden variables that automate abstract feature discovery. They typically pose nonconvex optimization problems and effective semi-deﬁnite programming (SDP) relaxations have been developed to enable global solutions (Aslan et al., 2014). However, these models rely on nonparametric training of layer-wise kernel representations, and are therefore restricted to transductive learning which slows down test prediction. In this paper, we develop a new inductive learning framework for parametric transfer functions using matching losses. The result for Re LU utilizes completely positive matrices, and the inductive learner not only delivers superior accuracy but also offers an order of magnitude speedup over SDP with constant approximation guarantees.

1. Introduction

The past decade has witnessed advances of deep learning in a broad range of application areas such as game playing (Silver et al., 2016), natural language processing (Sutskever et al., 2014), image processing and computer vision (He et al., 2016). Its effectiveness is often attributed to the automated learning of latent representations, in that salient and discriminative features are highly beneﬁcial for the overall learning task. With abstract and semantic features synthesized, the predictive relations between observations can be captured with more ease despite the possible complications in the correlation. In unsupervised learning, latent models have been widely used for clustering (Banerjee et al., 2005), dimensionality reduction (Lawrence, 2005), and transformation-invariant visual data analysis (Ranzato et al., 2012).

1Department of Computer Science, University of Illinois at Chicago, USA 2School of Computer Science, University of Waterloo, Canada. Correspondence to: Xinhua Zhang <zhangx@uic.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

The focus of this paper is conditional modeling for supervised learning, where latent variables are learned in the context of output information, so that accurate reconstruction of outputs can be facilitated through predictive intervening features. Such features can characterize latent clusters (Tishby et al., 1999), sparse coding (Elad & Aharon, 2006), invariant representation (Rifai et al., 2011), amongst others.

Despite their advantages in modeling and success in applications, latent models remain hard to train. The key challenge originates from the coupling of model parameter learning and latent variable inference, which in general leads to a nonconvex optimization problem. Although empirical performance has been the major focus of deep learning, recently substantial progress has been made towards the analysis of global training and the structure of the optimization problem. For example, Choromanska et al. (2014) and Dauphin et al. (2014) showed that the lowest critical values of the random loss function are close to the global minimum, and Kawaguchi (2016) showed, under certain assumptions, that every local minimum is a global minimum for an expected loss function of a deep nonlinear neural network. Similar global trainability results have been derived for gradient descent on two-node Re LU networks (Tian, 2017), quadratic activations (Soltanolkotabi et al., 2017), and one-hidden-layer non-overlapping convolution nets (Brutzkus & Globerson, 2017). The global minima in over-parameterized settings were characterized on deep and wide nets and convolutional nets (Nguyen & Hein, 2017a;b). However most analyses are still limited, especially with assumptions on the model and data distribution that are hard to verify in practice.

Along a different line of methodology, reformulations of latent models have been studied which admit tractable global solutions. Examples include boosting (Bengio et al., 2005), spectral methods (Anandkumar et al., 2014; Zhong et al., 2017), kernel methods (Zhang et al., 2016; 2017), polynomial networks and sum-product networks (Livni et al., 2014; Gens & Domingos, 2012), and semideﬁnite relaxations (Fogel et al., 2015). Unfortunately, they either impose restrictions on the model space (e.g. polynomial network, recursive inverse kernels), or require tractability of underlying oracles, or rely on realizability assumptions.

Inductive Two-Layer Modeling with Parametric Bregman Transfer

A framework based on reformulation that accommodates more general latent variable structures was proposed by Aslan et al. (2013; 2014), where each pair of adjacent layers are conjoined through a prediction loss that favors nonlinear connections. A similar approach was designed by Carreira-Perpinnan & Wang (2014), which introduced auxiliary coordinates to allow deviation from layer-wise outputs with a penalty. In order to achieve a convex model, Aslan et al. (2013; 2014) further represent each layer s output as a kernel matrix, and the loss over adjacent kernels is relaxed in a jointly convex fashion, retaining nonlinear transformations that allow a rich set of salient latent features to be captured.

However, these models assume that all latent layers behave as a multi-label classiﬁer, and the latent kernels are learned nonparametrically, i.e. there is no explicit parametric transfer function and nonlinear relations are introduced only through the loss functions between layers. This is more restrictive than state-of-the-art deep learners where the activation functions are parametric and continuously valued, with popular choices such as Re LU. As a result the model is restricted to a transductive setting, in that training examples are required to establish the data-dependent context of nonparametric kernel learning. This restriction signiﬁcantly slows down predictions at test time, which is more important than the training cost.

Such a challenge in efﬁciency is exacerbated as the kernelbased learning leads to an expensive semi-deﬁnite programming (SDP), whose computational cost limited their experiments to only 200 examples.

The goal of this paper, therefore, is to develop an inductive and efﬁcient learning strategy for two-layer conditional models with global optimality guarantees. This allows predictions to be made as efﬁciently as a feedforward neural network (FFNN) does, obviating retraining at test time. It is achieved by directly constructing a convex relaxation based on a parametric transfer function (e.g. Re LU) speciﬁed a priori. In particular, we ﬁrst make a new observation that no inter-layer loss satisfying nonlinear recovery and grounding can be jointly convex ( 2). However by using the matching loss, the non-convexity can be encapsulated entirely by a bilinear term, facilitating a convex relaxation for Re LU based on completely positive (CP) cones ( 3). The result provides a direct initialization of FFNN for ﬁner tuning, which yields, inductively, more accurate predictions than baseline training methods ( 5).

Different from the SDP used by Aslan et al. (2013; 2014), our CP-based model allowed us to develop a new efﬁcient algorithm using low-rank approximation, scaling up the size of solvable problems by an order of magnitude ( 4). A new constant approximation guarantee is also proved.

2. Matching Loss for Transfer Functions

Two-layer neural networks are composed of two nonlinear conditional models. The latent layer is characterized by a nonlinear transfer function f : Rh Rh, which converts the linear transformation Wx into φ = f(Wx). Here x Rn is the raw input feature, and W Rh n is the hidden layer weights. We use regular lowercase letters for scalar, bold lowercase letters for vector, and capital letters for matrix. The resulting φ is further multiplied with the output layer weights U Rh m, and the product is measured against the given label y Rm via a loss function ℓ(U φ, y). Here U is the transpose of U. Typical losses include binary hinge loss ℓ(z, y) = [1 yz]+ with m = 1, where y { 1, 1} and [z]+ := max{0, z}. For multiclass problems with C classes, y encodes a class c with the canonical vector ec. Then m = C and the hinge loss ℓ(z, y) = max{1 y + z (y z)1}, where 1 is a vector of all one s. The logistic loss is z y + log P

There are several popularly used transfer functions. The simplest options are elementwise, i.e. f(z) = (f(z1), . . . , f(zh)) , where all zi are applied separately to the same function f: R R. Re LU uses fr(z) = [z]+, and variants include the leaky rectiﬁer which uses fl(z) = max{z, az} where a > 0 is a small positive number, and the bounded hard tanh which uses fh(z) = max{ 1, min{z, 1}}. Transfers that are not piecewise linear are also available, e.g. the sigmoid fs(z) = (1 + e z) 1. These transfers are illustrated in Figure 1. Nonelementwise transfers are also available, e.g. the soft-max function with f(z) = (ez1, . . . , ezh) / Ph k=1 ezk.

A major source of non-convexity in neural network is the nonlinear transfer function. To cope with it, a natural approach is to replace the exact connection of φ = f(z) by a loss function that penalizes the deviation between φ and f(z). Formally, it attempts to construct a loss L(φ, z) that would (ideally) satisfy three conditions:

Unique recovery: arg minφ L(φ, z) = f(z) for all z, with the arg min attained uniquely.

Joint convexity: L is jointly convex over φ and z. This is required if we choose to build a jointly convex deep model by directly using L to connect the input and output of adjacent layers.

Grounding: minφ L(φ, z) = 0 for all z, so that there is no bias towards any value of z.

Unfortunately, it can be shown that such a loss does not exist, unless f is afﬁne (see the proof in Appendix A):

Theorem 1. There exists a loss L that satisﬁes all the three conditions if, and only if, f is afﬁne.

This result motivates us to resort to weaker versions of loss. Interestingly, the matching loss (Auer et al., 1996) meets

Inductive Two-Layer Modeling with Parametric Bregman Transfer

-4 -3 -2 -1 1 2 3 4

4 Transfer f Potential F

(a) Linear rectiﬁer (Re LU)

-4 -3 -2 -1 1 2 3 4

4 Transfer f Potential F

(b) Leaky rectiﬁer ϵ=.05

-4 -3 -2 -1 1 2 3 4 -1

4 Transfer f Potential F

(c) Hard tanh

-4 -3 -2 -1 1 2 3 4

Transfer f Potential F

(d) Sigmoid

Figure 1. Four examples of transfer function f and the corresponding potential function F

the ﬁrst and third conditions, and satisﬁes a weakened version of convexity by imposing a very mild condition on f. In particular, we assume that the transfer function is the gradient of a strictly convex function F : f = F, with F : Rh R. If f is elementwise, this just means the constituent f is continuous and strictly increasing. As a result, the inverse of f also exists, and it is well known that f 1 = F , where F is the Fenchel conjugate of F.

Although the Re LU fr(z) is not strictly increasing in the negative hallf line, it can be approximated arbitrarily closely via max{ϵz, z} for inﬁnitesimally small ϵ > 0. Similar alterations can be applied to hard tanh fh(z) by allowing a tiny slope ϵ for |z| 1. The F corresponding to the abovementioned transfers f are also shown in Figure 1.

In the case that f is not elementwise, this assumption of F implies: 1) f is strictly increasing in the vector sense: (x y) (f(x) f(y)) > 0, and 2) The Jabobian of f is symmetric (as the Hessian of F): Jf = (Jf) , provided f is differentiable. Under this assumption, we adopt the following loss function based on Bregman divergence:

L(φ, z) = DF (φ, f(z)) = F (φ) + F(z) φ z, (1)

where DF is the Bregman divergence induced by F . Obviously L meets the conditions of recovery and grounding, but is not jointly convex. However, the only nonconvex part is the bilinear term φ z, while both F and F are convex. Such a decoupling of nonconvex terms from the transfer functions is the key enabler for our convex reformulation.

3. Convex Two-layer Modeling

Suppose we have t training pairs {(xj, yj)}t j=1, stacked in two matrices X = (x1, . . . , xt) Rn t and Y = (y1, . . . , yt) Rm t. The corresponding set of latent layer outputs are stacked into Φ = (φ1, . . . , φt) Rh t. The regularized risk minimization objective can be written as

min W,Φ U,b

j=1 DF (φj, f(Wxj)) + ℓ(U φj+b, yj)+ W 2+ U 2

= min W,U,b,Φ

j=1 {F (φj) φ j Wxj + F(Wxj) (2)

+ ℓj(U φj + b)} + 1

where ℓj(U φj + b) := ℓ(U φj + b, yj). We introduced regularizations via Frobenius norms. The weight of both regularization terms can be tuned by any model selection method, e.g. cross validation, and here we put 1 to simplify the presentation. We also assume that dom ℓj is the entire space. To keep our notation neat we write vector-input functions on matrices, representing the sum of the function values applied to each column, e.g. F (Φ) = P

j F (φj). Now we can rewrite the objective compactly as

min Φ,W,U,b F (Φ) tr(Φ WX) + F(WX) + ℓ(U Φ + b1 )

2 U 2 . (3)

It is bi-convex in two groups of variables (Φ, b) and (W, U), i.e. ﬁxing one group it is convex in the other. In order to derive a jointly convex reformulation, we ﬁrst note that ℓ(U Φ + b1 ) = max R{tr(R (U Φ + b1 )) ℓ (R)}, where ℓ is the Fenchel conjugate of ℓ, and R Rm t. For binary hinge loss, ℓ (r) = yr over r [min{0, y}, max{0, y}], and else. For multiclass hinge loss, ℓ (r) = y r if r + y m := {x Rm + : 1 x = 1}, and else. For multiclass logistic loss, ℓ (r) = P

i(ri + yi) log(ri + yi) if r + y m, and else. Similarly, F(WX) = maxΛ{tr(Λ WX) F (Λ)}. So we can rewrite (2) into

min W,U,b,Φ max R,Λ F (Φ) tr(Φ WX) + tr(Λ WX)

F (Λ) + tr(R (U Φ + b1 )) ℓ (R) + W 2+ U 2

= min Φ max R,Λ min W,U,b F (Φ) tr(Φ WX) + tr(Λ WX)

F (Λ) + tr(R (U Φ + b1 )) ℓ (R) + W 2+ U 2

2 = min Φ max R1=0,Λ F (Φ) 1

2 (Φ Λ)X 2 1

F (Λ) ℓ (R). (4)

The optimal W and U for the last equality is W = (Φ + Λ)X and U = ΦR . The ﬁrst equality swaps min W,U,b with max R,Λ. Such a strong duality is indeed not trivial because the celebrated Sion s minimax lemma requires that the domain of (W, U) be compact, which is not assumed here. However the conclusion is still correct as we formalize here.

Inductive Two-Layer Modeling with Parametric Bregman Transfer

Theorem 2. For any W, U, b, denote L(Φ, R) = F (Φ) tr(Φ WX) + tr(R (U Φ + b1 )) ℓ (R). Then min Φ max R L(Φ, R) = max R min Φ L(Φ, R).

To prove it, just use Proposition 2.2 (p173) of (Ekeland & T emam, 1999). There, take R = Λ = 0 (i.e. p0 = 0), and then L diverges when (W, U) diverges. Note b disappears as R = 0.

3.1. Convex relaxation

We now derive a convex relaxation for (4). To be concrete, consider the Re LU transfer with Fr(Z) = 1

2 [Z]+ 2. Its Fenchel dual is F r (Φ) = 1

2 Φ 2 for Φ 0 (elementwise), and + otherwise. Therefore (4) can be specialized into min Φ 0 max R1=0,Λ 0 1 2 Φ 2 1

2 (Φ Λ)X 2 (5)

2 Λ 2 ℓ (R).

Notice that both Φ and Λ are constrained to the positive orthant, and they are both sized h t. Since t h in general, their ranks are h and their column spaces have full rank. As a result, we may perform change of variable via Λ = ΦA, where A Rt t + and is not necessarily symmetric. So we can rewrite (5) as min Φ 0 max R1=0,A 0 1 2 Φ 2 1

2 tr(Φ Φ(I A)X X(I A ))

2 tr(Φ ΦR R) 1

2 tr(Φ ΦAA ) ℓ (R).

Although this is still not convex, all occurrences of Φ are now in the form of Φ Φ, leading to the natural idea of optimizing over Φ Φ directly. Denote T := Φ Φ Rt t, and then we ﬁnally arrive at

min T Th max R1=0,A 0 1 2 tr(T) 1

2 tr(T(I A)X X(I A ))

2 tr(TR R) 1

2 tr(TAA ) ℓ (R),

where Th := Φ Φ : Φ Rh t + T Rt t + : T 0 . T 0 means T is positive semi-deﬁnite (PSD). Now given T, the maximization over R and A is concave because T 0. Indeed A and R are decoupled, making the inner optimization efﬁcient. The objective function is also convex in T, because maximization over linear terms gives a convex function. The only challenge left is the nonconvexity of Th.

The set Th is obviously a cone. In fact, if we relax the ﬁxed value of h, then T is the well-known completely positive (CP) matrix cone (Berman & Shaked-Monderer, 2003). More interestingly, it is not hard to show that T is the tightest convex relaxation of Th, i.e. the convex hull of Th for any h. Letting T := T yields our ﬁnal objective

min T T max R1=0,A 0 1 2 tr(T) 1

2 tr(T(I A)X X(I A ))

2 tr(TR R) 1

2 tr(TAA ) ℓ (R). (6)

It turns out that the convex relaxation does not require prespecifying the number of hidden nodes; h can be ﬁgured out automatically through the rank of the optimal T. We will see in the sequel that the formulation does implicitly favor a low-rank solution through a gauge regularizer (Lemma 1), although a manual assignment of h can always be incorporated through truncation after optimization.

Generality of the convexiﬁcation scheme. We note in passing that the above technique is general, and can be extended beyond Re LU. For example, when using the hard tanh transfer, we have F h(Φ) = 1

2 Φ 2 if the L norm Φ := maxij |Φij| 1, and otherwise. Then we get the same objective function as in (6), only with Th changed into {Φ Φ : Φ 1} and the domain of A changed into {A : P

i |Aij| 1, j}.

Even more general extensions to non-elementwise transfer functions can also be developed in our framework. The details on convexifying the soft-max transfer (and hard tanh) are deferred to Appendix B, and the space saved will be devoted to the more important issue of efﬁciently optimizing the model, hence overcoming the key bottleneck that has much conﬁned the applicability of (Aslan et al., 2014).

4. Optimization

Although the problem (6) is convex, the set T lacks a compact characterization in terms of linear/quadratic, PSD, or second-order conic constraints. Optimization over completely positive matrices is known hard (Berman & Shaked Monderer, 2003), and even projection to T is NP-hard (Dickinson & Gijben, 2014).1 Therefore we resort to conditional gradient (Frank-Wolfe) methods that are free of projection (CG, Jaggi, 2013; Harchaoui et al., 2015). The key beneﬁt of CG lies in the efﬁciency of optimizing a linear function over T (a.k.a. the polar operator), robustness in its inaccuracy (Freund & Grigas, 2016), and the low rank of intermediate solutions due to its greedy and progressive nature (hence efﬁcient intermediate updates).

In practice, however, CG still suffers from slow convergence, and its linearly-converging variants are typically subject to a large condition number (Lacoste-Julien & Jaggi, 2015). This is partly because at each step only the weights on the existing bases are optimized, while the bases themselves are not. To alleviate this problem, Zhang et al. (2012) proposed the Generalized Conditional Gradient algorithm (GCG) which simultaneously optimizes the bases. Despite the lack of theoretical proof, it is much faster in practice. Furthermore, GCG is robust to inexactness in polar operators, and one of our key contributions below is to

1In spite of the convexity , a convex function may itself be NP-hard to evaluate, or it can be NP-hard to project to a convex set, or optimize a linear function over it.

Inductive Two-Layer Modeling with Parametric Bregman Transfer

show that it can efﬁciently solve (6) with a multiplicative approximation bound of 1

Since GCG operates on gauge regularized objectives, our ﬁrst step is to take a nontrivial path of rewriting (6). Recall that given a convex bounded set C containing the origin, the gauge function induced by C evaluated at T is deﬁned as γC(T) := min{γ 0 : γX = T, X C}. If no such (γ, X) meets the condition, then γC(T) := . Since (6) does not contain a gauge function induced by a bounded set (T is unbounded), we ﬁrst recast it into this framework.

The simplest way to add bound to T is via the trace norm, which is exactly tr(T) since T 0:

S := T {T : tr(T) 1} (7)

= conv T1 {T : tr(T) 1} (8)

= conv xx : x Rt +, x 1 . (9)

Our key observation is the following lemma which allows us to rewrite the problem in terms of gauge regularized objective. In particular, the domain of the gauge implicitly enforces the constraint on T.

Lemma 1. S is convex, bounded, and closed. In addition

( tr(T) T T + otherwise . (10)

The proof is relegated to Appendix A. In fact, it is easy to show that for any convex cone C, the gauge function of its intersection with a half-space tr(A T) 1 is exactly tr(A T) over C. The signiﬁcance of Lemma 1 is that it provides the cornerstone for solving the problem (6) by GCG. Indeed, (6) can be equivalently rewritten as

min T J(T) := 1

2γS(T) + g(T) where (11)

g(T) := max R1=0,A 0 1

2 tr(T(I A)X X(I A )) (12)

2 tr(TR R) 1

2 tr(TAA ) ℓ (R).

This objective ﬁnally falls into the framework of GCG sketched in Algorithm 1 (Zhang et al., 2012; Harchaoui et al., 2015). GCG proceeds in iterations and at each step it seeks the steepest descent extreme point T new (a.k.a. basis) of the set S with respect to the objective gradient (steps 3-4). After ﬁnding the optimal conic combination with the existing solution (step 5), it directly optimizes the underlying factor Φ, initialized by the value that corresponds to the current solution T (step 6). Although this last step is not convex (hence called local optimization ), it offers significant practical efﬁciency because it allows all existing bases to be optimized along with their weights.

We next provide details on the efﬁcient computational strategies for the above operations in our problem.

Algorithm 1: General GCG algorithm

1 Randomly sample Φ1 [0, 1]t, and set T1 = Φ 1Φ1.

2 while k = 1, 2, . . . do

3 Find g(Tk) with Tk = Φ kΦk by solving the inner maximization problem in g(Tk) of (12).

4 Polar operator: ﬁnd a new basis via T new = arg max T S T, g(Tk) .

5 Compute the optimal combination weight (α, β) := arg minα 0,β 0 J(αTk + βT new).

6 Locally optimize T: Φk+1 =arg minΦ 0 J(Φ Φ) with Φ initialized by the value corresponding to Φ Φ = αTk + βT new (see Section 4.1).

7 Return Tk+1

4.1. Polar operator and constant multiplicative approximation guarantee

Given the negative gradient G = g(Tk) Rt t, the polar operator of S tries to solve the following optimization problem by using the characterization of S in (9): max T S tr(G T) max x Rt +, x 1 tr(x Gx). (13)

Unfortunately, this problem is NP-hard. If this were solvable for any G, then we could use it to answer whether minx 0 x ( G)x 0. But the latter is to check the copositivity of G, which is known to be co-NP-complete (Murty & Kabadi, 1987). Usually problems like (13) are approached by semi-deﬁnite relaxations (SDP), and Nemirovski et al. (1999) showed that it can be approximately solved with a multiplicative bound of O(1/ log t).

As one of our major contributions, we next show that when G 0, this bound can be tightened into constant for (13) with a computational procedure that is much more efﬁcient than SDP. Furthermore, our problem does satisfy G 0.

Before proceeding, we ﬁrst recall the deﬁnition of a multiplicative α-approximate solution. Deﬁnition 1. Let α (0, 1] and assume an optimization problem maxx X f(x) has nonnegative optimal value. A solution x X is called α-approximate if f(x ) α maxx X f(x) 0. Similarly, the condition becomes 0 f(x ) 1

α minx X f(x) for minimization problems. Theorem 3. Assume G 0. Then a 1

4-approximate solution to (13) can be found in O(t2) time. Proof. Since G 0, it can be decomposed into G = H H and the problem (13) becomes maxx Rt +, x 1 Hx 2. Let v be top eigenvector of G that corresponds to the greatest eigenvalue. Then v maximizes Hx over x 1. Decompose v = v+ v , where v+ = [v]+ collects the nonnegative components, and v collects the negative components. Apparently we have v+ 1 and v 1. Without loss of generality assume Hv+ 2 Hv 2 and consequently let us use v+ as an approximate minimizer, which we demonstrate is 1

4-approximate:

Inductive Two-Layer Modeling with Parametric Bregman Transfer

maxx Rt +, x 1 Hx 2 Hv 2 = Hv+ Hv 2

2( Hv+ 2+ Hv 2) 4 Hv+ 2.

Obviously v+ v+ is an even better solution, which can also be used as an initializer for further local optimization. The computational bottleneck lies in the top eigenvector v of G, which costs O(t2).

In the case that G is not PSD, it turns out very hard to extend this technique while retaining a constant bound. However the SDP-based technique in (Nemirovski et al., 1999) still applies, and the bound remains 1/ log t. In hindsight, our choice of the adding Frobenius norm constraint on Φ when deﬁning S in (7) is not arbitrary. It constitutes the most straightforward path that allows the polar operator to be approximated in a tractable fashion. Other choices, such as structured Frobenius norms, could be possible if we would like to enforce structured decompositions in the hidden representation. We leave the extension of tractable approximation for future exploration.

Finally, although our algorithm for the polar operator requires G be positive semi-deﬁnite which is not satisﬁed in general it happens to be fulﬁlled by our particular problem (11). Notice the gradient of g is simply

2(I A)X X(I A ) 1

where the R and A are the optimal solution to the inner maximization. This is obviously negative semi-deﬁnite, providing the key cornerstone for the constant approximation bound of our approach.

Optimality of GCG and rates of convergence We ﬁnally translate the bound on the polar operator to that of the original objective (11). As shown by Theorem 1 of (Cheng et al., 2016), any α-approximate polar operator allows GCG to converge to an α-approximate solution to the original problem, and the convergence rate is O(1/ϵ). Hence we are guaranteed to ﬁnd a 1

4-approximate solution to (11). The overall method is summarized in Algorithm 2.

4.2. Accelerating local optimization by converting min-max into min-min

The computational bottleneck of applying GCG to our problem (11) is the step of local optimization: minΦ J(Φ Φ) over Φ Rh t + . Owing to the Φ Φ term, this objective is not convex. However, it is often observed in practice that the overall optimization can be much accelerated if we solve it just locally (e.g. by BFGS), with Φ initialized based on the value of the convex optimization variable T (step 6 of Algorithm 1 or step 11 of Algorithm 2).

Unfortunately, since g deﬁned in (12) employs a nested maximization, we are now faced with a min-max problem. Different from min-min optimizations minx miny f(x, y)

Algorithm 2: Solve (6) for T by the GCG algorithm

1 Randomly sample Φ1 [0, 1]t, and set T1 = Φ 1Φ1.

2 while k = 1, 2, . . . do

3 if k = 1 then

4 (Uk, bk) = optimal U and b in (15) for Φ1.

5 Mk = optimal M in (15) for Φ1.

6 Recover the optimal R: Rk = ℓ(U kΦk+bk1 ).

7 Recover the optimal A by (17).

8 Compute the gradient Gk of gµ at Tk = Φ kΦk via (14), with R and A served by Rk and Ak, resp

9 Compute a new basis xk by approximately solving arg maxx Rt +, x 1 x ( Gk)x (c.f. Theorem 3).

10 Line search: (α, β) := arg minα 0,β 0 J(αTk + βxkx k).

11 Set Φtmp = ( αΦ k, βxk) .

12 Local search: (Φk+1, Uk+1, bk+1, Mk+1) := Local Opt(Φtmp, Uk, bk, Mk) by Algorithm 3.

13 Return Tk+1

Algorithm 3: Local optimization used by GCG

1 Require (Φtmp, Uk, bk, Mk) from the current step

2 Initialize : Φ = Φtmp, U = Uk, b = bk, M = Mk.

3 for t = 1, 2, . . . do // till the change is small

4 (U, b) = arg min U,b{ℓ(U Φ + b1 ) + 1

5 M = arg min M 0 h(M, Φ).

6 Φ = arg minΦ 0 {ℓ(U Φ + b1 ) + h(M, Φ)}.

7 Return (Φ, U, b, M).

which can be solved very efﬁciently by alternating between optimizing x and y, a min-max problem like minx maxy f(x, y) cannot be solved by alternating: ﬁxing x solve y, and ﬁxing y solve x. Instead, one needs to treat the objective as a function of x, and for each x solve the inner maximization in y exactly, before obtaining a gradient in x that is supplied to standard solvers such as BFGS. This is often much slower than alternating.

To enable an efﬁcient solution by alternating, we next develop a novel reformulation of g as a minimization, such that minimizing g becomes a min-min problem:

g(Φ Φ) = max R1=0

2 ΦR 2 ℓ (R) o

2 Φ(I A)X 2 1

= max R min b

n b R1 ℓ (R) max U tr(U ΦR ) U 2

+ max A min M 0

2 + tr(M A)

n ℓ(U Φ + b1 ) + 1

2 U 2o + min M 0 h(M, Φ),(15)

where h(M, Φ) := max A 1

2 Φ(I A)X 2 (16)

2 ΦA 2 + tr(M A) .

Inductive Two-Layer Modeling with Parametric Bregman Transfer

As the key advantage achieved here, the local optimization minΦ 0 J(Φ Φ) = minΦ 0 1

2 Φ 2 + g(Φ Φ) can now be solved by alternating between (U, b), M, and Φ. The details are shown in Algorithm 3. The optimization over (U, b) is the standard supervised learning. However, the optimization over M and Φ is trickier because they require evaluating h which in turn involves a nested optimization on A. Fortunately h is quadratic in A, which allows us to design an efﬁcient closed-form scheme by leveraging the celebrated Woodbury formula (Woodbury, 1950).

Given (M, Φ), the optimal A can be found by setting its gradient to zero: Φ ΦA(X X + I) = M + Φ ΦX X. Unfortunately, the rank of Φ ΦA (hence the left-hand side) is at most h < t. So no A can satisfy the equality if the rank of the right-hand side is greater than h, and hence h(M, Φ) is ﬁnite only if the column space of (M +Φ ΦX X)(X X + I) 1 is contained in that of Φ . Such an implicit constraint between variables precludes the application of alternating.

To address this problem, we introduce a small strongly convex regularizer on A in the deﬁnition of h(M, Φ) in (16), akin to the standard smoothing technique (Nesterov, 2005):

hµ(M, Φ) := max A 1

2 Φ(I A)X 2 1

+ tr(M A) µ

2 tr(A(X X + I)A ) ,

where µ > 0 is small. The new term µ

2 tr(A(X X + I)A ) also needs to be added to the deﬁnition of g in (12), which we will denote as gµ. Then the optimal A can be found by setting the gradient to zero: A = (Φ Φ + µI) 1(M + Φ ΦX X)(X X + I) 1. (17)

To efﬁciently compute A, we apply the Woodbury formula:

µA = (M + Φ ΦX X)(X X + I) 1

Φ (µI + ΦΦ ) 1Φ(M + Φ ΦX X)(X X+I) 1.

Computational complexity. Here (µI + ΦΦ ) 1 Rh h can be computed efﬁciently as h is not large (it is exactly the iteration index k in GCG Algorithm 2). Then the second line can be computed in O(ht2) time as we can precompute (X X + I) 1. So the only challenge in computing A is the term M(X X +I) 1, which costs O(t3) time. However, if n t, then we may again save computations by applying the Woodbury formula: M(X X + I) 1 = M MX (I + XX ) 1X, which costs O(nt2) time.

Overall, the complexity is 1

ϵ nt2 multiplied with: i) #round of alternating in Algorithm 3, and ii) #iteration of LBFGS in steps 4-6. In practice, with warm start these two numbers are about 10 before the relative change becomes small.

5. Experiment

We evaluated the proposed inductive training of convexiﬁed two-layer model (CVX-IN) by comparing the generalization accuracy with 4 other baselines: FFNN: a two-layer

feedforward neural network; Ker-CVX: the kernel-based convex model proposed by Aslan et al. (2014); LOCAL: a model obtained by alternative minimization of the twolayer objective (3); and CVX-TR: our model learned transductively (see below). SVM was not included since it was already shown inferior to Ker-CVX by Aslan et al. (2014).

Inductive learning. A key advantage of our method is the purely inductive setting, which obviates any retraining during test time, as opposed to a transductive setting. After completing the GCG optimization, CVX-IN directly obtains the optimal U and b thanks to the local minimization in Algorithm 3. The optimal W can be recovered by solving (3) with ﬁxed (Φ, U, b), and it is a simple convex problem. With this initialization, we ﬁnely tuned all parameters by backpropagation.

Transductive learning. As Ker-CVX is transductive, we also considered the following transductive variant of CVXIN. The objective (11) was ﬁrst trained with X being the combination of (Xtrain, Xtest), and accordingly the intermediate representation Φ (along with the corresponding T) also consisted of the combination of (Φtrain, Φtest). Since only Ytrain was available for training, the loss function ℓ(U Φ + b1 ) was applied only to the training data. As a result, Φtest was learned largely from the matching loss in the latent layer given by (16). After recovering the optimal U and b by local minimization (same as in CVX-IN), test data were labeled by ˆYtest = U Φtest + b1 . Although CVX-TR bypasses the recovery of W, optimization has to be redone from scratch when new test data arrives.

Comparison on smaller datasets. To enable comparison with Ker-CVX which is highly expensive in computation, we ﬁrst used smaller datasets including a synthetic XOR dataset and three real world datasets for binary classiﬁcation: Letter (Lichman, 2013), CIFAR-SM, a binary classiﬁcation dataset from (Aslan et al., 2013) based on CIFAR100 (Krizhevsky & Hinton, 2009), and G241N (Chapelle).

All methods were applied to two different sizes of training and test data (Xtrain and Xtest): 100/100 and 200/200, and the resulting test error, averaged over 10 trials, were presented in Table 1 and 2 respectively. CVX-IN outperforms FFNN on G241N, Letter, and CIFAR-SM, and they both delivered perfect classiﬁcation on XOR. This corroborates the advantage of convex models, suggesting that predictive structures are preserved by the relaxation. CVX-IN also marginally outperforms or is comparable to CVX-TR on all the datasets, conﬁrming that inductive learning saves computation at test time without sacriﬁcing the accuracy. Consistently poor performance is observed on the LOCAL method (used in a transductive fashion), and it does not work even for XOR. This implies that it does suffer seriously from local optimality. Ker-CVX (transductive only) performs competitively on 200 examples especially on the Letter dataset, but its error on 100 examples is signiﬁcantly

Inductive Two-Layer Modeling with Parametric Bregman Transfer

Letter G241N XOR CIFAR-SM CVX-IN 4.8 0.8 24.2 1.6 0 21.2 1.2 CVX-TR 4.9 1.3 23.1 0.7 0 22.4 0.8 FFNN 7.9 0.8 31.9 0.9 0 31.0 1.1 LOCAL 8.0 1.2 34.0 0.9 27.0 1.5 25.0 0.8 Ker-CVX 5.7 2.9 N/A 0 27.7 5.5

Table 1. Mean test error for 100 training and 100 test examples

Letter G241N XOR CIFAR-SM CVX-IN 5.1 1.3 21.6 0.9 0 22.6 1.5 CVX-TR 5.3 0.8 22.0 0.8 0 23.4 1.5 FFNN 5.5 0.8 29.9 0.4 0 32.9 1.0 LOCAL 10.5 0.8 33.0 0.6 25.0 1.2 29.5 0.5 Ker-CVX 4.5 0.9 N/A 0 23.3 3.5

Table 2. Mean test error for 200 training and 200 test examples

Letter G241N XOR CIFAR-10 CVX-IN 2.7 0.8 13.0 0.8 0 27.6 1.4 CVX-TR 2.7 0.9 15.1 0.9 0 27.9 2.3 FFNN 3.5 0.7 24.5 1.0 0 30.4 0.9 LOCAL 5.8 0.7 21.4 1.1 26.7 0.5 32.3 0.7

Table 3. Mean test error for 1000 training and 1000 test examples

higher than CVX-IN and CVX-TR. It ran into computational issues on G241N, hence marked by N/A.

On the CIFAR-SM dataset all methods produced a slightly higher error with 200 training examples than 100 examples, probably due to the small size of training set and high variance. However the comparative results between algorithms remain similar to other datasets.

Comparison on larger datasets. Thanks to the fast local optimization enabled by the new min-min alternating ( 4.2), our model enjoys signiﬁcant speedup compared with Aslan et al. (2013; 2014). To demonstrate this, we applied CVX-IN to Letter, XOR, and CIFAR-10 (Krizhevsky & Hinton, 2009) with 1000/1000 and 2000/2000 train/test examples, and to G241N with 1000/500 examples (the entire dataset only has 1500 examples). Details on data preprocessing are available in Appendix C.

As Table 3 and 4 show, CVX-IN again achieves signiﬁcantly lower test error on these larger datasets over FFNN, CVX-TR, and LOCAL. The training time of CVX-IN is summarized in Table 6, and it took 2.5 hours on CIFAR-10 with 2000 examples and 256 features. Although still expensive, it is substantially faster than Ker-CVX which is completely incapable of scaling here (hence omitted). In contrast, the run time of FFNN and LOCAL is much lower. These are shown for comparison in Appendix D. Overall CVX-IN scales quadratically in #examples (t), which is consistent with our analysis in 4.2.

Letter XOR CIFAR-10 CVX-IN 1.0 0.5 0 26.8 1.6 CVX-TR 1.2 0.7 0 27.0 1.9 FFNN 1.7 0.3 0 30.0 1.8 LOCAL 2.3 0.4 27.2 0.3 33.0 1.5

Table 4. Mean test error for 2000 training and 2000 test examples

100 200 1000 2000 Letter 0.45 1.1 17.1 90.6 G241N 0.68 1.5 27.3 N/A XOR 0.45 1.0 42.0 144.2 CIFAR-10 0.63 1.5 50.6 153.6

Table 5. Training times (in minutes) for CVX-IN on 100, 200, 1000, and 2000 training examples

Figure 2. BOX and XOR datasets (subplots a and b) and their intermediate representations (hdataset in subplots c and d). The representations were reduced to 2-D by using the standard PCA.

Intermediate representation. One of the key merits of our two-layer model is that the relaxation retains the necessary structure in the input data to make accurate predictions. To test this feature, we tried to visualize the latent representation learned by our CVX-IN. Figure 2 demonstrates the original features in the input data Xtrain and the learned intermediate representation Φtrain, for two datasets Box and XOR which both employ a rich latent structure. Clearly the convex relaxation was able to separate the two classes and preserve sufﬁcient structures that allows it to outperform single-layer models.

6. Conclusions and Future Work

We developed a convex relaxation for parametric transfer functions such as Re LU based on matching loss. An efﬁcient optimization method was designed with a constant approximation bound. For future work we will explore other transfer functions and their inﬂuence. To the best of our knowledge, no nontrivial recovery properties are known about nonlinear CP or SDP relaxation. Although our empirical results demonstrate compelling promise, it will be interesting to rigorously establish its theoretical guarantees.

Inductive Two-Layer Modeling with Parametric Bregman Transfer

Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773 2832, 2014.

Aslan, O., Cheng, H., Zhang, X., and Schuurmans, D. Convex two-layer modeling. In Neural Information Processing Systems, 2013.

Aslan, O., Zhang, X., and Schuurmans, D. Convex deep learning via normalized kernels. In Neural Information Processing Systems, 2014.

Auer, P., Herbster, M., and Warmuth, M. K. Exponentially many local minima for single neurons. Technical Report UCSC-CRL-96-1, Univ. of Calif.Computer Research Lab, Santa Cruz, CA, 1996. In preparation.

Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705 1749, 2005.

Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., and Marcotte, P. Convex neural networks. In Neural Information Processing Systems, 2005.

Berman, A. and Shaked-Monderer, N. Completely Positive Matrices. World Scientiﬁc, 2003.

Brutzkus, A. and Globerson, A. Globally optimal gradient descent for a Conv Net with gaussian inputs. In Proc. Intl. Conf. Machine Learning, 2017.

Carreira-Perpinnan, M. and Wang, W. Distributed optimization of deeply nested systems. In Proc. Intl. Conference on Artiﬁcial Intelligence and Statistics, 2014.

Chapelle, O. http://olivier.chapelle.cc/sslbook/benchmarks.html.

Cheng, H., Yu, Y., Zhang, X., Xing, E., and Schuurmans, D. Scalable and sound low-rank tensor learning. In Proc. Intl. Conference on Artiﬁcial Intelligence and Statistics, 2016.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and Le Cun, Y. The loss surfaces of multilayer networks. In Proc. Intl. Conference on Artiﬁcial Intelligence and Statistics, 2014.

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Neural Information Processing Systems, 2014.

Dickinson, P. J. C. and Gijben, L. On the computational complexity of membership problems for the completely positive cone and its dual. Computational Optimization and Applications, 57(2):403 415, Mar 2014. ISSN 1573-2894. doi: 10.1007/ s10589-013-9594-z. URL https://doi.org/10. 1007/s10589-013-9594-z.

Ekeland, I. and T emam, R. Convex Analysis and Variational Problems. SIAM, 1999.

Elad, M. and Aharon, M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736 3745, 2006.

Fogel, F., Jenatton, R., Bach, F., and d Aspremont, A. Convex relaxations for permutation problems. SIAM Journal on Matrix Analysis and Applications, 36(4):1465 1488, 2015.

Freund, R. and Grigas, P. New analysis and results for the Frank-Wolfe method. Mathematical Programming, 155 (1):199 230, 2016.

Gens, R. and Domingos, P. Discriminative learning of sum-product networks. In Neural Information Processing Systems, 2012.

Harchaoui, Z., Juditsky, A., and Nemirovski, A. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152: 75 112, 2015.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016.

Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proc. Intl. Conf. Machine Learning, 2013.

Kawaguchi, K. Deep learning without poor local minima. In Neural Information Processing Systems, 2016.

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. 2009.

Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of Frank-Wolfe optimization variants. In Neural Information Processing Systems, 2015.

Lawrence, N. Probabilistic non-linear principal component analysis with gaussian process latent variable models. J. Mach. Learn. Res., 6:1783 1816, 2005.

Lichman, M. UCI machine learning repository, 2013. URL

http://archive.ics.uci.edu/ml.

Inductive Two-Layer Modeling with Parametric Bregman Transfer

Livni, R., Shalev-Shwartz, S., and Shamir, O. An algorithm for training polynomial networks. ar Xiv:1304.7045v2, 2014.

Murty, K. G. and Kabadi, S. N. Some NP-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39(2):117 129, 1987.

Nemirovski, A., Roos, C., and Terlaky, T. On maximization of quadratic form over intersection of ellipsoids with common center. Math. Program. Ser. A, 86:463 473, 1999.

Nesterov, Y. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 152, 2005.

Nguyen, Q. and Hein, M. The loss surface of deep and wide neural networks. In Proc. Intl. Conf. Machine Learning, 2017a.

Nguyen, Q. and Hein, M. The loss surface and expressivity of deep convolutional neural networks. ar Xiv:1710.10928, 2017b.

Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., Dean, J., Le, Q. V., and Ng, A. Y. Building highlevel features using large scale unsupervised learning. In Proc. Intl. Conf. Machine Learning, 2012.

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive auto-encoders: Explicit invariance during feature extraction. In Proc. Intl. Conf. Machine Learning, 2011.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Science, 529:484 489, 2016.

Soltanolkotabi, M., Javanmard, A., and Lee, J. D. Theoretical insights into the optimization landscape of overparameterized shallow neural networks. In Proc. Intl. Conf. Machine Learning, 2017.

Steinberg, D. Computation of matrix norms with applications to Robust Optimization. Ph D thesis, Faculty of Industrial Engineering and Management, Technion, 2005.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Neural Information Processing Systems, 2014.

Tian, Y. An analytical formula of population gradient for two-layered Re LU network and its applications in convergence and critical point analysis. In Proc. Intl. Conf. Machine Learning, 2017.

Tishby, N., Pereira, F., and Bialek, W. The information bottleneck method. In 37-th Annual Allerton Conference on Communication, Control and Computing, 1999.

UCI. University of California Irvine: Machine Learning Repository, 1990.

Woodbury, M. A. Inverting modiﬁed matrices. Technical Report MR38136, Memorandum Rept. 42, Statistical Research Group, Princeton University, Princeton, NJ, 1950.

Zhang, X., Yu, Y., and Schuurmans, D. Accelerated training for matrix-norm regularization: A boosting approach. In Neural Information Processing Systems, 2012.

Zhang, Y., Lee, J., and Jordan, M. L1-regularized neural networks are improperly learnable in polynomial time. In Proc. Intl. Conf. Machine Learning, 2016.

Zhang, Y., Liang, P., and Wainwright, M. Convexiﬁed convolutional neural networks. In Proc. Intl. Conf. Machine Learning, 2017.

Zhong, K., Song, Z., Jain, P., Bartlett, P., and Dhillon, I. Recovery guarantees for one-hidden-layer neural networks. In Proc. Intl. Conf. Machine Learning, 2017.