# convex_twolayer_modeling_with_latent_structure__5304a3d3.pdf

Convex Two-Layer Modeling with Latent Structure

Vignesh Ganapathiraman , Xinhua Zhang , Yaoliang Yu , Junfeng Wen

University of Illinois at Chicago, Chicago, IL, USA University of Waterloo, Waterloo, ON, Canada, University of Alberta, Edmonton, AB, Canada {vganap2, zhangx}@uic.edu, yaoliang.yu@uwaterloo.ca, junfengwen@gmail.com

Unsupervised learning of structured predictors has been a long standing pursuit in machine learning. Recently a conditional random ﬁeld auto-encoder has been proposed in a two-layer setting, allowing latent structured representation to be automatically inferred. Aside from being nonconvex, it also requires the demanding inference of normalization. In this paper, we develop a convex relaxation of two-layer conditional model which captures latent structure and estimates model parameters, jointly and optimally. We further expand its applicability by resorting to a weaker form of inference maximum a-posteriori. The ﬂexibility of the model is demonstrated on two structures based on total unimodularity graph matching and linear chain. Experimental results conﬁrm the promise of the method.

1 Introduction

Over the past decade deep learning has achieved signiﬁcant advances in many application areas [1]. By automating the acquisition of latent descriptive and predictive representation, they provide highly effective models to capture the relationships between observed variables. Recently more reﬁned deep models have been proposed for structured output prediction, where several random variables for prediction are statistically correlated [2 4]. Improved performance has been achieved in applications such as image recognition and segmentation [5] and natural language parsing [6], amongst others.

So far, most deep models for structured output are designed for supervised learning where structured labels are available. Recently an extension has been made to the unsupervised learning. [7] proposed a conditional random ﬁeld auto-encoder (CRF-AE) a two-layer conditional model where given the observed data x, the latent structure y is ﬁrst generated based on p(y|x), and then applied to reconstruct the observations using p(x|y). The motivation is to ﬁnd the predictive and discriminative (rather than common but irrelevant) latent structure in the data. Along similar lines, several other discriminative unsupervised learning methods are also available [8 11].

Extending auto-encoders X Y X to general two-layer models X Y Z is not hard. [12, 13] addressed transliteration between two languages, where Z is the observed binary label indicating if two words match, and higher accuracy can be achieved if we faithfully recover a letter-wise matching represented by the unobserved structure Y . In essence, their model optimizes p(z|arg maxy p(y|x)), uncovering the latent y via its mode under the ﬁrst layer model. This is known as bi-level optimization because the arg max of inner optimization is used. A soft variant adopts the mean of y [14]. In general, conditional models yield more accurate predictions than generative models X Y Z (e.g. multi-wing harmoniums/RBMs), unless the latter is trained in a discriminative fashion [15].

In computation, all methods require certain forms of tractability in inference. CRF-AE leverages marginal inference on p(y|x)p(x|y) (over y) for EM. Contrastive divergence, instead, samples from p(y|x) [11]. For some structures like graph matching, neither of them is tractable [16, 17] (unless assuming ﬁrst-order Markovian). In single-layer models, this challenge has been resolved by max-margin estimation, which relies only on the MAP of p(y|x) [18]. This oracle is much less demanding than sampling or normalization, as ﬁnding the most likely state can be much easier than summarizing over all possible y. For example, MAP for graph matching can be solved by max-ﬂow.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Unfortunately a direct extension of max-margin estimation to two-layer modeling meets with immediate obstacles, because here one has to solve maxy p(y|x)p(z|y). In general, p(z|y) depends on y in a highly nonlinear form, making this augmented MAP inference intractable. This seems to leave the aforementioned bi-level optimization the only option that retains the sole dependency on MAP. However, solving this optimization poses substantial challenge when y is discrete, because the mode of p(y|x) is almost always invariant to small perturbations of model parameters (i.e. zero gradient).

In this paper we demonstrate that this optimization can be relaxed into a convex formulation while still preserving sufﬁcient regularities to recover a non-trivial, nonlinear predictive model that supports structured latent representations. Recently a growing body of research has investigated globally trainable deep models. But they remain limited. [19] formulated convex conditional models using layer-wise kernels, connected through nonlinear losses. However these losses are data dependent, necessitating a transductive setting to retain the context. [20] used boosting but the underlying oracle is generally intractable. Speciﬁc global methods were also proposed for polynomial networks [21] and sum-product networks [22]. None of these methods accommodate structures in latent layers.

Our convex formulation is achieved by enforcing the ﬁrst-order optimality conditions of inner level optimization via sublinear constraints. Using a semi-deﬁnite relaxation, we amount to the ﬁrst two-layer model that allows latent structures to be inferred concurrently with model optimization while still admitting globally optimal solutions ( 3). To the best of our knowledge, this is the ﬁrst algorithm in machine learning that directly constructs a convex relaxation for a bi-level optimization based on the inner optimality conditions. Unlike [19], it results in a truly inductive model, and its ﬂexibility is demonstrated with two example structures in the framework of total unimodularity ( 4). The only inference required is MAP on p(y|x), and the overall scalability is further improved by a reﬁned optimization algorithm ( 5). Experimental results demonstrate its useful potentials in practice.

2 Preliminaries and Background

We consider a two-layer latent conditional model X Y Z, where X is the input, Z is the output, and Y is a latent layer composed of h random variables: {Yi}h i=1. Instead of assuming no interdependency between Yi as in [19], our major goal here is to model the structure in the latent layer Y . Speciﬁcally, we assume a conditional model for the ﬁrst layer based on an exponential family p(y|x) = q0(y) exp( y Ux Ω(Ux)), where q0(y) = Jy YK . (1) Here U is the ﬁrst layer weight matrix, and Ωis the log-partition function. q0(y) is the base measure, with Jx K = 1 if x is true, and 0 otherwise. The correlation among Yi is instilled by the support set Y, which plays a central role here. For example, when Y consists of all h-dimensional canonical vectors, p(y|x) recovers the logistic multiclass model. In general, to achieve a tradeoff between computational efﬁciency and representational ﬂexibility, we make the following assumptions on Y: Assumption 1 (PO-tractable). We assume Y is bounded, and admits an efﬁcient polar operator. That is, for any vector d Rh, miny Y d y is efﬁciently solvable.

Note the support set Y (hence the base measure q0) is ﬁxed and does not contain any more parameter. PO-tractability is available in a variety of applications, and we give two examples here.

Graph matching. In a bipartite graph with two sets of vertices {ai}n i=1 and {bj}n j=1, each edge between ai and bj has a weight Tij. The task is to ﬁnd a one-to-one mapping (can be extended) between {ai} and {bj}, such that the sum of weights on the edges is maximized. Denote the matching by Y {0, 1}n n where Yij = 1 iff the edge (ai, bj) is selected. So the optimal matching is the mode of p(Y ) JY YK exp(tr(Y T)) where the support is Y = {Y {0, 1}n n : Y 1 = Y 1 = 1}.

Graphical models. For simplicity, consider a linear chain model V1 V2 Vp. Here each Vi can take one of C possible values, which we encode using the C-dimensional canonical basis vi. Suppose there is a node potential mi RC for each Vi, and each edge (Vi, Vi+1) has an edge potential Mi RC C. Then we could directly deﬁne a distribution on {Vi}. Unfortunately, it will involve quadratic terms such as v i Mivi+1, and so a different parameterization is in order. Let Yi {0, 1}C C encode the values of (Vi, Vi+1) via row and column indices of Yi respectively. Then the distribution on {Vi} can be equivalently represented by a distribution on {Yi}:

p({Yi}) J{Yi} YK exp Xp

i=1 m i Yi1 + Xp 1

i=1 tr(M i Yi) , (2)

where Y = {Yi} : Yi {0, 1}C C H, with H := {{Yi} : 1 Yi1 = 1, Y i 1 = Yi+11} . (3)

The constraints in H encode the obvious consistency constraints between overlapping edges. This model ultimately falls into our framework in (1).

In both examples, the constraints in Y are totally unimodular (TUM), and therefore the polar operator can be computed by solving a linear programming (LP), with the {0, 1} constraints relaxed to [0, 1]. In 4.1 and 4.2 we will generalize y Ux to y d(Ux), where d is an afﬁne function of Ux that allows for homogeneity in temporal models. For clarity, we ﬁrst develop a general framework using y Ux.

Output layer As for the output layer, we assume a conditional model from an exponential family p(z|y) = exp(z R y G(R y))q1(z) = exp( DG (z|| G(R y)) + G (z))q1(z), (4) where G is a smooth and strictly convex function, and DG is the Bregman divergence induced by the Fenchel dual G . Such a parameterization is justiﬁed by the equivalence between regular Bregman divergence and regular exponential family [23]. Thanks to the convexity of G, it is trivial to extend p(z|y) to y conv Y (the convex hull of Y), and G(R y) will still be convex over conv Y (ﬁxing R).

Finally we highlight the assumptions we make and do not make. First we only assume PO-tractability of Y, hence tractability in MAP inference of p(y|x). We do not assume it is tractable to compute the normalizer Ωor its gradient (marginal distributions). We also do not assume that unbiased samples of y can be drawn efﬁciently from p(y|x). In general, PO-tractability is a weaker assumption. For example, in graph matching MAP inference is tractable while marginalization is NP-hard [16] and sampling requires MCMC [24]. Finally, we do not assume tractability of any sort for p(y|x)p(z|y) (in y), and so it may be hard to solve miny Y{d y + G(R y) z R y}, as G is generally not afﬁne.

2.1 Training principles

At training time, we are provided with a set of feature-label pairs (x, z) p, where p is the empirical distribution. In the special case of auto-encoder, z is tied with x. The bootstrapping" style estimation [25] optimizes the joint likelihood with the latent y imputed in an optimistic fashion

min U,R E (x,z) p

min y Y log p(y|x)p(z|y) = min U,R E (x,z) p

min y Y y Ux + Ω(Ux) z R y + G(R y) .

This results in a hard EM estimation, and a soft version can be achieved by adding entropic regularizers on y. Regularization can be imposed on U and R which we will make explicit later (e.g. bounding the L2 norm). Since the log-partition function Ωin p(y|x) is hard to compute, the max-margin approach is introduced which replaces Ω(Ux) by an upper bound maxˆy Y ˆy Ux, leading to a surrogate loss

min U,R E (x,z) p

z R y + G(R y) + y Ux min ˆy Y ˆy Ux . (5)

However, the key disadvantage of this method is the augmented inference on y, because we have only assumed the tractability of miny Y d y for all d, not miny Y{y d + G(R y) z R y}. In addition, this principle intrinsically determines the latent y as a function of both the input and the output, while at test time the output itself is unknown and is the subject of prediction. The common practice therefore requires a joint optimization over y and z at test time, which is costly in computation.

The goal of this paper is to design a convex formulation in which the latent y is completely determined by the input x, and both the prediction and estimation rely only on the polar operator: arg miny Y y Ux. As a consequence of this goal, it is natural to postulate that the y found this way renders an accurate prediction of z, or a faithful recovery of x in auto-encoders. This idea, which has been employed by [e.g., 9, 26], leads to the following bi-level optimization problem

max U,R E (x,z) p

log p z arg max y Y p(y|x) max U,R E (x,z) p

log p z arg min y Y y Ux (6)

min U,R E (x,z) p [ z R y x + G(R y x)] , where y x = arg min y Y y Ux. (7)

Directly solving this optimization problem is challenging, because the optimal y x is almost surely invariant to small perturbations of U (e.g. when Y is discrete). So a zero valued gradient is witnessed almost everywhere. Therefore a more carefully designed optimization algorithm is in demand.

3 A General Framework of Convexiﬁcation

We propose addressing this bi-level optimization by convex relaxation, and it is built upon the ﬁrst-order optimality conditions of the inner-level optimization. First notice that the set Y participates

in the problem (7) only via the polar operator at Ux: arg miny Y y Ux. If Y is discrete, this problem is equivalent to optimizing over S := conv Y, because a linear function on a convex set is always optimized on its extreme points. Clearly, S is convex, bounded, closed, and is PO-tractable. It is important to note that the origin is not necessarily contained in S. To remove the potential non-uniqueness of the minimizer in (7), we next add a small proximal term to the polar operator problem (σ is a small positive number): min w S w Ux + σ

2 w 2 . (8)

This leads to a small change in the problem and makes sure that the minimizer is unique.1 Adding strongly convex terms to the primal and dual objectives is a commonly used technique for accelerated optimization [27], and has been used in graphical model inference [e.g., 28]. We intentionally changed the symbol y into w, because here the optimal w is not necessarily in Y.

By the convexity of the problem (8) and noting that the gradient of the objective is Ux + σw, w is optimal if and only if w S, and (Ux + σw) (ˆθ w) 0, ˆθ S. (9) These optimality conditions can be plugged into the bi-level optimization problem (7). Introducing Lagrange multipliers (γ, ˆθ) to enforce the latter condition via a mini-max formulation, we obtain

min U 1 min R 1 E (x,z) p

h min w max γ 0, ˆθ S max v z R w + v R w G (v) (10)

+ ιS(w) + γ(Ux + σw) (w ˆθ) i , (11)

where ιS is the {0, }-valued indicator function of the set S. Here we dualized G via G(R w) = maxv v R w G (v), and made explicit the Frobenius norm constraints ( ) on U and R.2

Applying change of variable θ = γ ˆθ, the constraints γ 0 and ˆθ S (a convex set) become (θ, γ) N := cone{(ˆθ, 1) : ˆθ S}, where cone stands for the conic hull (convex). Similarly we can dualize ιS(w) = maxπ π w σS(π), where σS(π) := maxw S π w is the support function on S. Now swap minw with all the subsequent max (strong duality), we arrive at a form where w can be minimized out analytically

min U 1 min R 1 E (x,z) p

h max π max (θ,γ) N max v min w z R w + v R w G (v) (12)

+ π w σS(π) + (Ux + σw) (γw θ) i (13)

= min U 1 min R 1 E (x,z) p

h max π max (θ,γ) N max v G (v) σS(π) θ Ux (14)

1 4σγ R(v z) + γUx + π σθ 2 . (15)

Given (U, R), the optimal (v, π, θ, γ) can be efﬁciently solved through a concave maximization. However the overall objective is not convex in (U, R) because the quadratic term in (15) is subtracted. Fortunately it turns out not hard to tackle this issue by using semi-deﬁnite programming (SDP) relaxation which linearizes the quadratic terms. In particular, let I be the identity matrix, and deﬁne

M := M(U, R) :=

(I, U, R) =

I U R U U U U R R R U R R

M1 Mu Mr M u Mu,u M r,u M r Mr,u Mr,r

Then θ Ux can be replaced by θ Mux and the quadratic term in (15) can be expanded as

f(M, π, θ, γ, v; x, z) := tr(Mr,r(v z)(v z) ) + γ2 tr(Mu,uxx ) + 2γ tr(Mr,ux(v z) )

+ 2(π σθ) (Mr(v z) + γMux) + π σθ 2 . (17)

Since given (π, θ, γ, v) the objective function becomes linear in M, so after maximizing out these variables the overall objective is convex in M. Although this change of variable turns the objective into convex, it indeed shifts the intractability into the feasible region of M: M0 := {M 0 : M1 = I, tr(Mu,u) 1, tr(Mr,r) 1} | {z } =:M1

{M : rank(M) = h} . (18)

1If p(y|x) p0(y) exp( y Ux σ

2 y 2) (for any σ > 0), then there is no need to add this σ

2 w 2 term. In this case, all our subsequent developments apply directly. Therefore our approach applies to a broader setting where L2 projection to S is tractable, but here we focus on PO-tractability just for the clarity of presentation. 2To simplify the presentation, we bound the radius by 1 while in practice it is a hyperparameter to be tuned.

Here M 0 means M is real symmetric and positive semi-deﬁnite. Due to the rank constraint, M0 is not convex. So a natural relaxation the only relaxation we introduce besides the proximal term in (8) is to drop this rank constraint and optimize with the resulting convex set M1. This leads to the ﬁnal convex formulation:

min M M1 E (x,z) p

h max π max (θ,γ) N max v G (v) σS(π) θ Mux 1 4σγ f(M, π, θ, γ, v; x, z) i . (19)

To summarize, we have achieved a convex model for two-layer conditional models in which the latent structured representation is determined by a polar operator. Instead of bypassing this bi-level optimization via the normal loss based approach [e.g., 19, 29], we addressed it directly by leveraging the optimality conditions of the inner optimization. A convex relaxation is then achieved via SDP.

3.1 Inducing low-rank solutions of relaxation

Although it is generally hard to provide a theoretical guarantee for nonlinear SDP relaxations, it is interesting to note that the constraint set M1 effectively encourages low-rank solutions (hence tighter relaxations). As a key technical result, we next show that all extreme points of M1 are rank h (the number of hidden nodes) for all h 2. Recall that in sparse coding, the atomic norm framework [30] induces low-complexity solutions by setting up the optimization over the convex hull of atoms, or penalize via its gauge function. Therefore the characterization of the extreme points of M1 might open up the possibility of analyzing our relaxation by leveraging the results in sparse coding. Lemma 1. Let Ai be symmetric matrices. Consider the set of

R := {X : X 0, tr(Ai X) bi, i = 1, . . . , m}, (20)

where m is the number of linear (in)equality constraints. means it can be any one of , =, or . Then the rank r of all extreme points of R is upper bounded by r 1

8m + 1 1) . (21)

This result extends [31] by accommodating inequalities in (20), and its proof is given in Appendix A. Now we show that the feasible region M1 as deﬁned by (18) has all extreme points with rank h. Theorem 1. If h 2, then all extreme points of M1 have rank h, and M1 is the convex hull of M0.

Proof. Let M be an extreme point of M1. Noting that M 0 already encodes the symmetry of M, the linear constraints for M1 in (18) can be written as 1

2h(h + 1) linear equality constraints and two linear inequality constraints. In total m = 1

2h(h + 1) + 2. Plug it into (21) in the above lemma

8m + 1 1) = j 1 2( p

4h(h + 1) + 17 1) k = h + Jh = 1K. (22)

Finally, the identity matrix in the top-left corner of M forces rank(M) h. So rank(M) = h for all h 2. It then follows that M1 = conv M0.

4 Application in Machine Learning Problems

The framework developed above is generic. For example, when Y represents classiﬁcation for h classes by canonical vectors, S = conv Y is the h dimensional probability simplex (sum up to 1). Clearly σS(π) = maxi πi, and N = {(x, t) Rh+1 + : 1 x = t}. In many applications, Y can be characterized as {y {0, 1}h : Ay c}, where A is TUM and all entries of c are in { 1, 1, 0}.3 In this case, the convex hull S has all extreme points being integral, and S employs an explicit form: Y = {y {0, 1}h : Ay c} = S = conv Y = {w [0, 1]h : Aw c}, (23) replacing all binary constraints {0, 1} by intervals [0, 1]. Clearly TUM is a sufﬁcient condition for PO-tractability, because miny Y d y is equivalent to minw S d w, an LP. Examples include the above graph matching and linear chain model. We will refer to Aw c as the non-box constraint.

4.1 Graph matching

As the ﬁrst concrete example, we consider convex relaxation for latent graph matching. One task in natural language is transliteration [12, 32]. Suppose we are given an English word e with m letters, and a corresponding Hebrew word h with n letters. The goal is to predict whether e and h are phonetically similar, a binary classiﬁcation problem with z { 1, 1}. However it obviously helps to

3For simplicity, we write equality constraints (handled separately in practice) using two inequality constraints.

ﬁnd, as an intermediate step, the letter-wise matching between e and h. The underlying assumption is that each letter corresponds to at most one letter in the word of the other language. So if we augment both e and h with a sink symbol * at the end (hence making their length ˆm := m + 1 and ˆn := n + 1 respectively), we would like to ﬁnd a matching y {0, 1} ˆmˆn that minimizes the following cost

j=1 Yiju φij, where Y = {0, 1} ˆm ˆn {Y : Yi,:1 = 1, i m, 1 Y:,j = 1, j n} | {z } =:G

Here Yi,: is the i-th row of Y . φij Rp is a feature vector associated with the pair of i-th letter in e and j-th letter in h, including the dummy *. Our notation omits its dependency on e and h. u is a discriminative weight vector that will be learned from data. After ﬁnding the optimal Y , [12] uses the maximal objective value of (24) to make the ﬁnal binary prediction: sign(P

ij Y iju φij).

To pose the problem in our framework, we ﬁrst notice that the non-box constraints G in (24) are TUM. Therefore, S is simply [0, 1] ˆm ˆn G. Given the decoded w, the output labeling principle above essentially duplicates u as the output layer weight. A key advantage of our method is to allow the weights of the two layers to be decoupled. By using a weight vector r Rp, we deﬁne the output score as r Φw, where Φ is a p-byˆmˆn matrix whose (i, j)-th column is φij. So Φ depends on e and h. Overall, our model follows by instantiating (12) as:

min U 1 min R 1 E (e,h,z) p

h max π max (θ,γ) N max v R min w zr Φw + vr Φw G (v) + π w (25)

ij(u φij + σwij)(γwij θij) i . (26)

Once more we can minimize out w, which gives rise to a quadratic (v z)Φ r + γΦ u + π σθ 2. It is again amenable to SDP relaxation, where (Mu,u, Mr,u, Mr,r) correspond to (uu , ru , rr ) resp.

4.2 Homogeneous temporal models

A variety of structured output problems are formulated with graphical models. We highlight the gist of our technique by using a concrete example: unsupervised structured learning for inpainting. Suppose we are given images of handwritten words, each segmented into p letters, and the latent representation is the corresponding letters. Since letters are correlated in their appearance in words, the recognition problem has long been addressed using linear chain conditional random ﬁelds. However imagine no ground truth letter label is available, and instead of predicting labels, we are given images in which a random small patch is occluded. So our goal will be inpainting the patches.

To cast the problem in our two-layer latent structure model, let each letter image in the word be denoted as a vector xi Rn, and the reconstructed image be zi Rm (m = n here). Let Yi {0, 1}h h (h = 26) encode the labels of the letter pair at position i and i + 1 (as rows and columns of Yi respectively). Let Uv Rh n be the letter-wise discriminative weights, and Ue Rh h be the pairwise weights. Then by (2), the MAP inference can be reformulated as (ref. deﬁnition of H in (3))

i=1 1 Y i Uvxi + Xp 1

i=1 tr(U e Yi) where Y = {Yi} : Yi {0, 1}C C H. (27)

Since the non-box constraints in H are TUM, the problem can be cast in our framework with S = conv Y = {Yi} : Yi [0, 1]C C H. Finally to reconstruct the image for each letter, we assume that each letter j has a basis vector rj Rm. So given Wi, the output of reconstruction is R Wi1, where R = (r1, . . . , rh) . To summarize, our model can be instantiated from (12) as

min U 1 min R 1 E (x,z) p

h max Π max (Θ,γ) N max v min W

i=1(vi zi) R Wi1 G (vi) (28)

+ tr(Π W) σS(Π) + Xp

i=1 tr((Uvxi1 + Ji = p K Ue + σWi) (γWi Θi)) i .

Here zi is the inpainted images in the training set. If no training image is occluded, then just set zi to xi. The constraints on U and R can be reﬁned, e.g. bounding Uv , Ue , and rj separately. As before, we can derive a quadratic term R(vi zi)1 + γUvxi1 + γUe + Πi σΘi 2 by minimizing out Wi, which again leads to SDP relaxations. Even further, we may allow each letter to employ a set of principal components, whose combination yields the reconstruction (Appendix B).

Besides modeling ﬂexibility, our method also accommodates problem-speciﬁc simpliﬁcation. For example, the dimension of w is often much higher than the number of non-box constraints. Appendix C shows that for linear chain, the dimension of w can be reduced from C2 to C via partial Lagrangian.

5 Optimization

The key advantage of our convex relaxation (19) is that the inference depends on S (or equivalently Y) only through the polar operator. Our overall optimization scheme is to perform projected SGD over the function of M. This requires: a) given M, compute its objective value and gradient; and b) project to M1. We next detail the solution to the former, relegating the latter to Appendix D.

Given M, we optimize over (π, θ, γ, v) by projected LBFGS [33]. The objective is easy to compute thanks to PO-tractability (for the σS(π) term). The only nontrivial part is to project a point (θ0, γ0) to N, which is actually amenable to conditional gradient (CG). Formally it requires solving minθ,γ 1

2 θ θ0 2 + 1

2(γ γ0)2, s.t. θ = γs, γ [0, C], s S. (29)

W.l.o.g., we manually introduced an upper bound4 C := γ0 + p

θ0 2 + γ2 0 on γ. At each iteration, CG queries the gradient gθ in θ and gγ in γ, and solves the polar operator problem on N: minθ γS,γ [0,C] θ gθ+γgγ = mins S,γ [0,C] γs gθ + γgγ = min{0, C mins S(s gθ+gγ)}. (30) So it boils down to the polar operator on S, and is hence tractable. If the optimal value in (30) is nonnegative, then the current iterate is already optimal. Otherwise we add a basis (s , 1) to the ensemble and a totally corrective update can be performed by CG. More details are available in [34].

After ﬁnding the optimal ˆ M, we recover the optimal w for each training example based on the optimal w in (12). Using it as the initial point, we locally optimize the two layer models U and R based on (14).

6 Experimental Results

To empirically evaluate our convex method (henceforth referred to as CVX), we compared it with the state-of-the-art methods on two prediction problems with latent structure.

Transliteration The ﬁrst experiment is based on the English-Hebrew corpus [35]. It consists of 250 positive transliteration pairs for training, and 300 pairs for testing. On average there are 6 characters per word in each of the languages. All these pairs are considered positive examples", and for negative examples we followed [12] and randomly sampled t {50, 75, 100} pairs from 2502 250 mismatched pairings (which are 20%, 30%, and 40% of 250, resp). We did not use many negative examples because, as per [12], our test performance measure will depend mainly on the highest few discriminative values, which are learned largely from the positive examples.

Given a pair of words (e, h), the feature representation φij for the i-th letter in e and j-th letter in h is deﬁned as the unigram feature: an n-dimensional vector with all 0 s except a single one in the (ei, hj)-th coordinate. In this dataset, there are n = 655 possible letter pairs (* included). Since our primary objective is to determine whether the convex relaxation of a two-layer model with latent structure can outperform locally trained models, we adopted this simple but effective feature representation (rather than delving into heuristic feature engineering).

Our test evaluation measurement is the Mean Reciprocal Rank (MRR), which is the average of the reciprocal of the rank of the correct answer. In particular, for each English word e, we calculated the discriminative score of respective methods when e is paired with each Hebrew word in the test set, and then found the rank of the correct word (1 for the highest). The reciprocal of the rank is averaged over all test pairs, giving the MRR. So a higher value is preferred, and 50% means on average the true Hebrew word is the runner-up. For our method, the discriminative score is simply f := r Φw (using the symbols in (25)), and that for [12] is f := max Y Y u Φvec(Y ) (vectorization of Y ).

We compared our method (with σ = 0.1) against the state-of-the-art approach in [12]. It is a special case of our model with the second-layer weight r tied with the ﬁrst-layer weight u. They trained it using a local optimization method, and we will refer to it as Local. Both methods employ an output loss function max{0, yf}2 with y {+1, 1}, and both contain only one parameter the bound on u (and r ). We simply tuned it to optimize the performance of Local. The test MRR is shown in Figure 1, where the number of negative examples was varied in 50, 75, and 100. Local was trained with random initialization, and we repeated the random selection of the negative examples for 10 times, yielding 10 dots in each scatter plot. It is clear that CVX in general delivers signiﬁcantly higher MRR than Local, with the dots lying above or close to the diagonal. Since this dataset is not big, the randomness of the negative set leads to notable variations in the performance (for both methods).

4For γ to be optimal, we require (γ γ0)2 θ θ0 2 +(γ γ0)2 0 θ0 2 +(0 γ0)2, i.e., γ C.

50 60 70 80 MRR of Local

(a) 50 negative examples

50 60 70 80 MRR of Local

(b) 75 negative examples

70 80 90 MRR of Local

(c) 100 negative examples

Figure 1: MRR of Local versus CVX over 50, 75, and 100 negative examples.

SIZE OF OCCLUDED PATCH (k k)

k = 2 k = 3 k = 4 CRF-AE 0.29 0.01 0.80 0.01 1.31 0.02 CVX 0.27 0.01 0.79 0.01 1.28 0.02

Table 1: Total inpainting error as a function of the size of occluded patch (p = 8).

LENGTH OF SEQUENCE p = 4 p = 6 p = 8 CRF-AE 1.33 0.04 1.30 0.02 1.31 0.03 CVX 1.29 0.04 1.27 0.02 1.28 0.03

Table 2: Total inpainting error as a function of the length of sequences (k = 4).

Inpainting for occluded image Our second experiment used structured latent model to inpaint images. We generated 200 sequences of images for training, each with p {4, 6, 8} digits. In order to introduce structure, each sequence can be either odd (i.e. all digits are either 1 or 3) or even (all digits are 2 or 4). So C = 4. Given the digit label, the corresponding image (x [0, 1]196) was sampled from the MNIST dataset, downsampled to 14-by-14. 200 test sequences were also generated.

In the test data, we randomly set a k k patch of each image to 0 as occluded (k {2, 3, 4}), and the task is to inpaint it. This setting is entirely unsupervised, with no digit label available for training. It falls in the framework of X Y Z, where X is the occluded input, Y is the latent digit sequence, and Z is the recovered image. In our convex method, we tied Uv with R and so we still have a 3-by-3 block matrix M, corresponding to I, Uv and Ue. We set σ to 10 1 and G( ) = 1

2 2 (Gaussian). Y was predicted using the polar operator, based on which Z was predicted with the Gaussian mean.

For comparison, we used CRF-AE, which was proposed very recently by [7]. Although it ties X and Z, extension to our setting is trivial by computing the expected value of Z given X. Here P(Z|Y ) is assumed a Gaussian whose mean is learned by maximizing P(Z = x|X = x), and we initialized all model parameters by unit Gaussian. For the ease of comparison, we introduced regularization by constraining model parameters to L2 norm balls rather than penalizing the squared L2 norm. For both methods, the radius bound was simply chosen as the maximum L2 norm of the images, which produced consistently good results. We did not use higher k because the images are sized 14-by-14.

The error of inpainting given by the two methods is shown in Table 1 where we varied the size of the occluded patch with p ﬁxed to 6, and in Table 2 where the length of the sequence p was varied while k was ﬁxed to 4. Each number is the sum of squared error in the occluded patch, averaged over 5 random generations of training and test data (hence producing the mean and standard deviation). Here we can see that CVX gives lower error than CRF-AE. With no surprise, the error grows almost quadratically in k. When the length of sequence grows, the error of both CVX and CRF-AE ﬂuctuates nonmonotonically. This is probably because with more images in each node, the total error is summed over more images, but the error per image decays thanks to the structure.

7 Conclusion

We have presented a new formulation of two-layer models with latent structure, while maintaining a jointly convex training objective. Its effectiveness is demonstrated by the superior empirical performance over local training, along with low-rank characterization of the extreme points of the feasible region. An interesting extension for future investigation is when the latent layer employs submodularity, with its base polytope mirroring the support set S.

[1] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527 1554, 2006. [2] L.-C. Chen, A. Schwing, A. Yuille, and R. Urtasun. Learning deep structured models. In ICML. 2015. [3] V. Mnih, H. Larochelle, and G. E. Hinton. Conditional restricted Boltzmann machines for structured output prediction. In AISTATS. 2011. [4] M. Ratajczak, S. Tschiatschek, and F. Pernkopf. Sum-product networks for structured prediction: Contextspeciﬁc deep conditional random ﬁelds. In Workshop on Learning Tractable Probabilistic Models. 2014. [5] K. Sohn, X. Yan, and H. Lee. Learning structured output representation using deep conditional generative models. In NIPS. 2015. [6] R. Collobert. Deep learning for efﬁcient discriminative parsing. In ICML. 2011. [7] W. Ammar, C. Dyer, and N. A. Smith. Conditional random ﬁeld autoencoders for unsupervised structured prediction. In NIPS. 2014. [8] L. Xu, D. Wilkinson, F. Southey, and D. Schuurmans. Discriminative unsupervised learning of structured predictors. In ICML. 2006. [9] H. Daumé III. Unsupervised search-based structured prediction. In ICML. 2009. [10] N. Smith and J. Eisner. Contrastive estimation: training log-linear models on unlabeled data. In ACL. 2005. [11] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771 1800, 2002. [12] M.-W. Chang, D. Goldwasser, D. Roth, and V. Srikumar. Discriminative learning over constrained latent representations. In NAACL. 2010. [13] M.-W. Chang, V. Srikumar, D. Goldwasser, and D. Roth. Structured output learning with indirect supervision. In ICML. 2010. [14] N. Chen, J. Zhu, F. Sun, and E. P. Xing. Large-margin predictive latent subspace learning for multiview data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12):2365 2378, 2012. [15] H. Larochelle and Y. Bengio. Classiﬁcation using discriminative restricted Boltzmann machines. In ICML. 2008. [16] T. Hazan and T. Jaakkola. On the partition function and random maximum a-posteriori perturbations. In ICML. 2012. [17] A. Gane, T. Hazan, and T. Jaakkola. Learning with random maximum a-posteriori perturbation models. In AISTATS. 2014. [18] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: A large margin approach. In ICML. 2005. [19] O. Aslan, X. Zhang, and D. Schuurmans. Convex deep learning via normalized kernels. In NIPS. 2014. [20] Y. Bengio, N. L. Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In NIPS. 2005. [21] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. In ICML. 2014. [22] R. Livni, S. Shalev-Shwartz, and O. Shamir. An algorithm for training polynomial networks, 2014. Ar Xiv:1304.7045v2. [23] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705 1749, 2005. [24] A. Gotovos, H. Hassani, and A. Krause. Sampling from probabilistic submodular models. In NIPS. 2015. [25] G. Haffari and A. Sarkar. Analysis of semi-supervised learning with Yarowsky algorithm. In UAI. 2007. [26] L. Xu, M. White, and D. Schuurmans. Optimal reverse prediction: a uniﬁed perspective on supervised, unsupervised and semi-supervised learning. In ICML. 2009. [27] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127 152, 2005. [28] O. Meshi, M. Mahdavi, and A. G. Schwing. Smooth and strong: Map inference with linear convergence. In NIPS. 2015. [29] G. Druck, C. Pal, X. Zhu, and A. Mc Callum. Semi-supervised classiﬁcation with hybrid generative/discriminative methods. In KDD. 2007. [30] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S.Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805 849, 2012. [31] G. Pataki. On the rank of extreme matrices in semideﬁnite programs and the multiplicity of optimal eigenvalues. Mathematics of Operations Research, 23(2):339 358, 1998. [32] D. Goldwasser and D. Roth. Transliteration as constrained optimization. In EMNLP. 2008. [33] http://www.cs.ubc.ca/~schmidtm/Software/min Conf.html. [34] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML. 2013. [35] https://cogcomp.cs.illinois.edu/page/resource_view/2.