# structural_maxent_models__2f24cd6d.pdf Structural Maxent Models Corinna Cortes CORINNA@GOOGLE.COM Google Research, 111 8th Avenue, New York, NY 10011 Vitaly Kuznetsov VITALY@CIMS.NYU.EDU Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012 Mehryar Mohri MOHRI@CIMS.NYU.EDU Courant Institute and Google Research, 251 Mercer Street, New York, NY 10012 Umar Syed USYED@GOOGLE.COM Google Research, 111 8th Avenue, New York, NY 10011 We present a new class of density estimation models, Structural Maxent models, with feature functions selected from a union of possibly very complex sub-families and yet benefiting from strong learning guarantees. The design of our models is based on a new principle supported by uniform convergence bounds and taking into consideration the complexity of the different sub-families composing the full set of features. We prove new data-dependent learning bounds for our models, expressed in terms of the Rademacher complexities of these sub-families. We also prove a duality theorem, which we use to derive our Structural Maxent algorithm. We give a full description of our algorithm, including the details of its derivation, and report the results of several experiments demonstrating that its performance improves on that of existing L1-norm regularized Maxent algorithms. We further similarly define conditional Structural Maxent models for multi-class classification problems. These are conditional probability models also making use of a union of possibly complex feature subfamilies. We prove a duality theorem for these models as well, which reveals their connection with existing binary and multi-class deep boosting algorithms. Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). 1. Introduction Maximum entropy models, commonly referred to as Maxent models, are density estimation methods used in a variety of tasks in natural language processing (Berger et al., 1996; Rosenfeld, 1996; Pietra et al., 1997; Malouf, 2002; Manning & Klein, 2003; Ratnaparkhi, 2010) and in many other applications, including species habitat modeling (Phillips et al., 2004; 2006; Dud ık et al., 2007; Elith et al., 2011). Maxent models are based on the density estimation principle advocated by Jaynes (1957), which consists of selecting the distribution that is the closest to the uniform distribution or to some other prior, while ensuring that the average value of each feature matches its empirical value. When closeness is measured in terms of the relative entropy, Maxent models are known to coincide with Gibbs distributions, as in the original Boltzmann models in statistical mechanics. One key benefit of Maxent models is that they allow the use of diverse features that can be selected and augmented by the user, while in some other popular density estimation techniques such as n-gram modeling in language processing, the features are inherently limited. The richness of the features used in many tasks as well as small sample sizes have motivated the use of regularized Maxent models where the L1-norm (Kazama & Tsujii, 2003) or the L2-norm (Chen & Rosenfeld, 2000; Lebanon & Lafferty, 2001) of the parameter vector defining the Gibbs distribution is controlled. This can be shown to be equivalent to the introduction of a Laplacian or Gaussian prior over the parameter vector in a Bayesian interpretation (Williams, 1994; Goodman, 2004). Group sparsity regularizations can also be used with for example L1and L2-norms: the parameter vector is partitioned into blocks with L2-norm used Structural Maxent Models within blocks and L1-norm for combining blocks (Huang & Zhang, 2010). An extensive theoretical study of these regularizations and the introduction of other more general ones were presented by Dud ık et al. (2007). The generalization guarantees for regularized Maxent models depends on the sample size and the complexity of the family of features used. This dependency suggests using relatively simpler feature families, such as threshold functions over the input variables. However, feature functions selected from simpler families could excessively limit the expressiveness of the model. This paper introduces and studies a new family of density estimation models, Structural Maxent models, which offers the flexibility of selecting features out of complex families, while benefiting from strong learning guarantees. Let H1, . . . , Hp be p families of feature functions with increasing complexity. Hk could be for example the family of regression trees of depth k or that of monomials of degree k based on the input variables. The main idea behind the design of our model is to allow the use of features from the family of very deep trees or other rich or complex families (that is, Hks with relatively large k), but to reserve less total model parameter weight for such features than for those chosen from simpler families (Hks with smaller k). We call our Maxent models structural since they exploit the structure of H as a union of Hks. Note, however, that the main idea behind the design of our models is distinct from that of structural risk minimization (SRM) (Vapnik, 1998): while SRM seeks a single Hk with an optimal trade-off between empirical error and complexity, structural Maxent allocates different model weights to features in different Hks to achieve an even better trade-off based on multiple Hks. In the following, we first define a new Structural Maxent principle for the general scenario where feature functions may be selected from multiple families (Section 2.1). Our new principle takes into consideration the different complexity of each of these families and is supported by data-dependent uniform convergence bounds. This principle guides us to design our new Structural Maxent models, whose regularization depends on the data-dependent complexity of each of the feature families. We study the optimization problem for Structural Maxent models and present a duality theorem showing the equivalence of the primal and dual problems, which can be viewed as a counterpart of the duality theorem for standard Maxent (Pietra et al., 1997) (see also (Dud ık et al., 2007)) (Section 2.2). Next, we present data-dependent learning guarantees for our Structural Maxent models in terms of the Rademacher complexity of each of the feature families used (Section 2.3). The amount of total parameter weight apportioned to each family in our models is quantitatively determined by our new Maxent principle and further jus- tified by these learning guarantees. In Section 2.4, we describe in detail our Struct Maxent algorithm, including the details of its derivation and its pseudocode. Our algorithm consists of applying coordinate descent to the dual objective function proceeding in the same way as (Dud ık et al., 2007) for L1-norm regularized Maxent. We derive two versions of our algorithm differing only by the definition of the step, which is based on minimizing different upper bounds. The first version of our algorithm uses the same upper bounding technique as Dud ık et al. (2007) for L1-norm regularized Maxent, and, as with the algorithm of Dud ık et al. (2007), is subject to several assumptions. The second version of our algorithm does not require any assumption and leads to a simpler analysis, at the cost of only slightly slower convergence. We prove convergence guarantees for both versions of our algorithm. In Section 3, we further extend our ideas to the scenario of multi-class classification. We present a new conditional Maxent principle for the case where multiple feature families are used (Section 3.1), which leads to the definition of our conditional Structural Maxent models. These are conditional probability models that admit as a special case existing conditional Maxent models, or, equivalently, multinomial logistic regression models. We prove a duality theorem showing the equivalence of the primal and dual optimization problems for our conditional Structural Maxent models. This shows that these models precisely coincide with the Deep Boost algorithms of Cortes et al. (2014) and Kuznetsov et al. (2014) in the special case where the surrogate loss function used is the logistic function. Thus, our algorithm benefits from the data-dependent learning guarantees and empirical validation already presented for deep boosting. Conversely, our analysis and new conditional structural Maxent principle provide an alternative justification in support of deep boosting. In Section 4, we report the results of extensive experiments with data from various domains including community crimes, traffic and species habitat modeling. Our results show the advantage of our structural Maxent models for density estimation when compared to existing regularized Maxent models. 2. Structural Maxent models Let X denote the input space. We first consider the following problem of density estimation. Assume that a sample S = (x1, . . . , xm) X m of m points drawn from an unknown distribution D is given and that we have at our disposal a feature vector Φ(x) associated to each point x X. Then, the standard density estimation problem consists of using the sample S and the features to find a distribution p that forms a good estimate of D. We consider the general case of infinite families of feature functions. Note that even standard families of threshold Structural Maxent Models functions over a finite set of variables have infinite size. Let H1, . . . , Hp be p 1 families of functions mapping X to R with each feature function falling in one of these families. Assume that the Hks have increasing Rademacher complexities. For example, H1 may be composed of some simple feature functions φ, H2 the set of all products φψ with φ, ψ H1, and, more generally, Hk may contain all monomials of degree k over functions in H1. In fact, it is common in some applications where Maxent is used to design new features precisely in this manner, starting with some basic features, here H1. Similarly, Hk may be the set of regression trees of depth k with node questions based on threshold functions of some set of variables or, more complex functions of these variables as in splines. 2.1. Principle The key idea behind our new Maxent formulation is to take into consideration the distinct complexity of each family of feature functions Hk. Let Φ be the feature mapping from X to the space F, with F = F1 Fp. For any x X, Φ(x) can be decomposed as Φ(x) = (Φ1(x), . . . , Φp(x)), with Φk a mapping from X to Fk such that Φk Λ, for all k [1, p], and with each of its components Φk,j in Hk. By a standard Rademacher complexity bound (Koltchinskii & Panchenko, 2002) and the union bound, for any δ > 0, the following inequality holds with probability at least 1 δ over the choice of a sample of size m: E x D[Φk(x)] E x S[Φk(x)] 2Rm(Hk) + Λ for all k [1, p]. Here, we denote by Ex S[Φk(x)] the expected value of Φk with respect to the empirical distribution defined by the sample S. Let p0 be a distribution over X with p0[x] > 0 for all x X, typically chosen to be the uniform distribution. In view of (1), our extension of the maximum entropy principle (see Jaynes (1957; 1983)) consists of selecting p as the distribution that is the closest to p0 and that verifies for all k [1, p]: E x p[Φk(x)] E x S[Φk(x)] 2Rm(Hk) + β, where β > 0 is a parameter. Here, closeness is measured using the relative entropy. Let denote the simplex of distributions over X, then, our structural Maxent principle can be formulated as the following optimization problem: min p D(p p0), s.t. k [1, p] : (2) E x p[Φk(x)] E x S[Φk(x)] 2Rm(Hk) + β. Since the relative entropy, D, is convex with respect to its arguments and since the constraints are affine, this defines a convex optimization problem. The solution is in fact unique since the relative entropy is strictly convex. Since the empirical distribution is a feasible point, problem (2) is feasible. For any convex set K, let IK denote the function defined by IK(x) = 0 if x K, IK(x) = + otherwise. Then, the problem can be equivalently expressed as minp F(p) where F(p) = D(p p0) + I (p) + IC(E p[Φ]), (3) where is the simplex and where C is the convex set defined by C = {u: uk ES[Φk] βk, k [1, p]}, with βk = 2Rm(Hk) + β. 2.2. Dual problem As for the standard Maxent models with L1 constraints (Kazama & Tsujii, 2003) (see also (Dud ık et al., 2007)), we can derive an equivalent dual problem for (2) or (3) formulated as a regularized maximum likelihood problem over Gibbs distributions. Let G be the function defined for all w in the dual of F by i=1 log pw[xi] k=1 βk wk 1 , (4) with pw = p0[x]ew Φ(x) Zw and Zw = P x X p0[x]ew Φ(x). For simplicity, we assume that the dual of F is RN. Then, the following theorem gives a result similar to the duality theorem of (Pietra et al., 1997). Theorem 1. Problems (2) and (3) are equivalent to the dual optimization problem supw RN G(w): sup w RN G(w) = min p F(p). (5) Furthermore, let p = argminp F(p), then, for any ϵ > 0 and any w such that |G(w) supw RN G(w)| < ϵ, the following inequality holds: D(p pw) ϵ. The proof is given in Appendix A (Theorem 1). In view of the theorem, if w is an ϵ-solution of the dual optimization problem, then D(p pw) ϵ, which, by Pinsker s inequality implies that pw is 2ϵ-close in L1-norm to the optimal solution of the primal: p pw 1 2ϵ. Thus, the solution of our structural Maxent problem can be determined by solving the dual problem, which can be written equivalently as follows: inf w RN β w 1 +2 k=1 Rm(Hk) wk 1 1 i=1 log pw[xi]. (6) The difference in problem (6) with respect to the common L1-regularized Maxent problem is the remarkable new second term, which is defined by the Rademacher complexities of the Hks. This term penalizes more the norm of a weight vector wk associated to a feature vector selected from a complex hypothesis set Hk. Structural Maxent Models 2.3. Generalization bound Let LD(w) denote the log-loss of the distribution pw with respect to a distribution D, LD(w) = Ex D[ log pw[x]], and similarly LS(w) its log-loss with respect to the empirical distribution defined by a sample S. Theorem 2. Fix δ > 0. Let bw be a solution to the opti- mization (6) with β = Λ q δ 2m . Then, with probability at least 1 δ over the sample S, LD(bw) is bounded by inf w LD(w) + 2 2Rm(Hk) + Λ Proof. Using the definition of LD(w) and LS(w), H older s inequality, and inequality (1), with probability at least 1 δ, the following holds: LD(bw) LS(bw) = bw [E S[Φ] E D[Φ]] k=1 bwk [E S[Φk] E D[Φk]] k=1 bwk 1 E S[Φk] E D[Φk] k=1 bwk 1(β + 2Rm(Hk)). Thus, since bw is a minimizer, we can write for any w LD(bw) LD(w) = LD(bw) LS(bw) + LS(bw) LD(w) k=1 bwk 1(β + 2Rm(Hk)) + LS(bw) LD(w) k=1 wk 1(β + 2Rm(Hk)) + LS(w) LD(w) k=1 wk 1(β + 2Rm(Hk)), where we used in the last step the left inequality counterpart of inequality (1). This concludes the proof. This bound suggests that learning with feature functions selected from highly complex families (relatively large Rm(Hk)) can benefit from favorable learning guarantees so long as the total weight assigned to these features by the model is relatively small. 2.4. Algorithm Our algorithm consists of applying coordinate descent to the objective function of (6). Ignoring the constant term m Pm i=1 log p0[xi], the optimization problem (6) can be rewritten as infw F(w) with k=1 βk wk 1 w E S[Φ]+log X x X p0[x]ew Φ(x) , and βk = β + 2Rm(Hk) for k [1, p]. Function F is not differentiable but it is convex and admits a subdifferential at any point. For notational simplicity, we will assume that each Fk is finite. 2.4.1. DIRECTION Let wt 1 denote the weight vector defined after (t 1) iterations. At each iteration t [1, T], the direction ek,j, (k, j) [1, p] [1, Nk], with Nk being the size of Fk, considered by coordinate descent is δF(wt 1, ek,j). If wt 1,k,j = 0, then F admits a directional derivative along ek,j given by F (wt 1, ek,j) = βk sgn(wt 1,k,j) + ϵt 1,k,j. where ϵt 1,k,j = Epwt 1[Φk,j] ES[Φk,j]. If wt 1,k,j = 0, F admits right and left directional derivatives along ek,j: F +(wt 1, ek,j) = βk + ϵt 1,k,j and F (wt 1, ek,j) = βk + ϵt 1,k,j. Thus, in summary, we can write, δF(wt 1, ek,j) = βk sgn(wt 1,k,j) + ϵt 1,k,j if (wt 1,k,j = 0) 0 else if ϵt 1,k,j βk βk sgn(ϵt 1,k,j) + ϵt 1,k,j otherwise. The coordinate descent algorithm selects the direction ek,j with the largest numeric value of δF(wt 1, ek,j). 2.4.2. TWO ALTERNATIVE STEP SIZES Given the direction ek,j, the optimal step value η is given by argminη F(wt 1 + η ek,j). η can be found via a line search or other numerical methods. We can also derive a closed-form solution for the step by minimizing an upper bound on F(wt 1 + η ek,j). We present closed-form solutions based on two different but related upper bounds. The full argument is given in Appendix C. In what follows, we highlight the main steps of these derivations. Observe that F(wt 1 + η ek,j) F(wt 1) (7) = βk(|wk,j + η| |wk,j|) η E S[Φk,j] + log h E pwt 1 [eηΦk,j] i . STEP SIZE STRUCTMAXENT1 Since Φk,j [ Λ, +Λ], by the convexity of x 7 eηx, we can write eηΦk,j Λ Φk,j 2Λ e ηΛ + Φk,j+Λ Structural Maxent Models Taking the expectation and the log yields log E pwt 1 [eηΦk,j] log Φ + t 1,k,jeηΛ Φ t 1,k,je ηΛ where we used the following notation: Φ s t 1,k,j = E pwt 1 [Φk,j] + sΛ Φ s k,j = E S[Φk,j] + sΛ, for all (k, j) [1, p] [1, Nk] and s { 1, +1}. Plugging back this inequality in (7) and minimizing the resulting upper bound on F(wt 1 + η ek,j) F(wt 1) leads to the closed-form expression for the step size η. The full pseudocode of the algorithm using this closed-form solution for the step size, Struct Maxent1, is given in Appendix B. Note that the following condition on Φ, Λ and βk must be satisfied: ( E pwt 1 [Φk,j] { Λ, +Λ}) (| E S[Φk,j]| < Λ βk). (8) This first version of Struct Maxent uses the same upper bounding technique as the L1-norm regularized Maxent algorithm of Dud ık et al. (2007), which is subject to the same type of conditions as (8). STEP SIZE STRUCTMAXENT2 An alternative method is based on a somewhat looser upper bound for log Epwt 1[eηΦk,j] using Hoeffding s lemma. A key advantage is that the analysis is no more subject to conditions such as (8). Additionally, the Struct Maxent2 algorithm is much simpler. The price to pay is a slightly slower convergence but, as pointed out in Section 4, both algorithms exhibit an exponential convergence and lead to the same results. By Hoeffding s lemma, since Φk,j [ Λ, +Λ], we can write log E pwt 1 [eηΦk,j] η E pwt 1 [Φk,j] + η2Λ2 Combining this inequality with (7) and minimizing the resulting upper bound leads to an alternative closed-form solution for the step size η. Figure 1 shows the pseudocode of our algorithm using the closed-form solution for the step size just presented. 2.5. Convergence analysis The following result gives convergence guarantees for both versions of our Struct Maxent algorithm. Theorem 3. Let (wt)t be the sequence of parameter vectors generated by Struct Maxent1 or Struct Maxent2. Then, (wt)t converges to the optimal solution w of (6). STRUCTMAXENT2(S = (x1, . . . , xm)) 1 for t 1 to T do 2 for k 1 to p and j 1 to Nk do 3 if (wt 1,k,j = 0) then 4 dk,j βk sgn(wt 1,k,j) + ϵt 1,k,j 5 elseif |ϵt 1,k,j| βk then 6 dk,j 0 7 else dk,j βk sgn(ϵt 1,k,j) + ϵt 1,k,j 8 (k, j) argmax (k,j) [1,p] [1,Nk] |dk,j| 9 if (|wt 1,k,jΛ2 ϵt 1,k,j| βk) then 10 η wt 1,k,j 11 elseif (wt 1,k,jΛ2 ϵt 1,k,j > βk) then 12 η 1 Λ2 [ βk ϵt 1,k,j] 13 else η 1 Λ2 [βk ϵt 1,k,j] 14 wt wt 1 + ηek,j 15 pwt p0[x]ewt Φ(x) P x X p0[x]ewt Φ(x) 16 return pwt Figure 1. Pseudocode of the Struct Maxent2 algorithm. For all (k, j) [1, p] [1, Nk], βk = 2Rm(Hk) + β and ϵt 1,k,j = Epwt 1[Φk,j] ES[Φk,j]. Note that no technical assumption such as those of (8) are required here for the closed-form expressions of the step size. The full proof of this result is given in Appendix D. We also remark that if the step size is determined via a line search, then our algorithms benefit from an exponential convergence rate (Luo & Tseng, 1992). 3. Conditional Structural Maxent Models In this section, we extend the analysis presented in the previous section to that of conditional Maxent models, also known as multinomial logistic regression. We consider a multi-class classification problem with c 1 classes and denote by Y = {1, . . . , c} the output space and by D the distribution over X Y from which pairs (x, y) X Y are drawn i.i.d. As in standard supervised learning problems, the learner receives a labeled sample S = ((x1, y1), . . . , (xm, ym)) (X Y)m and, as in Section 2.1, we assume the learner has access to a feature function Φ: X Y F that can be decomposed as Φ = (Φ1, . . . , Φp), where for any k [1, p], Φk is a mapping from X Y to Fk with elements in Hk. 3.1. Principle Our conditional Structural Maxent, or structural logistic regression models can be defined in a way similar to that of density estimation. Here, the problem consists of learning a conditional probability p[ |x] for all x X. As in the density estimation case, we will denote by p0[ |x] a conditional Structural Maxent Models probability, often chosen to be the uniform distribution. As in the density estimation case, the definition of our algorithm is guided by a generalization bound expressed in terms of the Rademacher complexity of the families Hk. For any δ > 0, the following inequality holds with probability at least 1 δ over the choice of a sample of size m, for all k [1, p]: E x bp y D[ |x] [Φk(x, y)] E x bp y bp[ |x] [Φk(x, y)] βk, (9) where we denote by ES the expectation over the empirical distribution defined by sample S and βk = 2Rm(Hk) + q δ 2m . Let bp denote the empirical distribution of the input points. Our structural conditional Maxent model is defined by searching the conditional probabilities p[ |x] that are as close as possible to p0[ |x], while ensuring an inequality similar to (9), where closeness is defined via the conditional relative entropy based on bp (Cover & Thomas, 2006). This leads to the following convex optimization problem: x X bp[x] D p[ |x] p0[ |x] (10) s.t. E x bp h E y p[ |x][Φk(x, y)] i E (x,y) S[Φk(x, y)] 2Rm(Hk) + β, k [1, p], where we again denote by the simplex in Y. Let p denote the vector of conditional probabilities (p[ |xi])i [1,m]. Then, this problem can be equivalently expressed as minp e F(p) where e F(p)= E x bp D p[ |x] p0[ |x] +I p[ |x] +IC E x bp y p[ |x] [Φ] , (11) where is the simplex and C the convex set defined by C = {u: uk ES[Φk] βk, k [1, p]}, with βk = 2Rm(Hk) + β. 3.2. Dual problem Let e G be the function defined for all w RN by i=1 log pw[yi|xi] k=1 βk wk 1 , (12) with, for all x X, pw[y|x] = p0[x]ew Φ(x) Zw(x) and Zw(x) = P x X p0[x]ew Φ(x). Then, the following theorem gives a result similar to the duality theorem presented in the nonconditional case. Theorem 4. Problem (10) is equivalent to dual optimization problem supw RN e G(w): sup w RN e G(w) = min p e F(p). (13) Furthermore, let p = argminp e F(p). Then, for any ϵ > 0 and any w such that | e G(w) supw RN e G(w)| < ϵ, we have Ex bp h D p [ |x] p0[ |x] i ϵ. The proof of the theorem is given in Appendix A (Theorem 4). As in the non-conditional case, the theorem suggests that the solution of our structural conditional Maxent problem can be determined by solving the dual problem, which can be written equivalently as follows: inf w β w 1+2 k=1 Rm(Hk) wk 1 1 i=1 log pw[yi|xi] . (14) We note that problem (14) coincides with the optimization problem presented by Cortes et al. (2014) and Kuznetsov et al. (2014) for the Deep Boost algorithms in the particular case where the logistic function is used as a convex surrogate loss function. Our analysis and derivation starting from the conditional Maxent principle provide an alternative justification for these algorithms. In return, our conditional Struct Maxent algorithm benefits from the learning guarantees already given by deep boosting. 4. Experiments This section reports the results of our experiments with the Struct Maxent algorithm. We have fully implemented both Struct Maxent1 and Struct Maxent2 with diverse feature families and will make the software used in our experiments available as open-source. We do not report empirical results for the conditional Struct Maxent algorithm since, as already pointed out, the conditional algorithm coincides with the deep boosting algorithms that have been already extensively studied by Cortes et al. (2014). Our Struct Maxent algorithm can be applied with a variety of different families of feature functions Hk mapping the input space X to R. In our experiments, we used the union of two broad classes of feature maps: monomial features Hmono and tree features Htrees. Let Hmono 1 = {ψ1, . . . , ψd} denote the set of raw features mapping X to R. We define Hmono k = {ψj1 ψjk : jr [1, d]} as the set of all monomials of degree k derived from Hmono 1 and Hmono as the union of all these families: Hmono = k=1Hmono k . Similarly, we denote by Htrees k the set of all binary decision trees with k internal nodes, where the node questions are threshold functions 1ψj(x) θ with j [1, d] and θ R, and where the leaves are labeled with zero or one and define Htrees as the union of these families: Htrees = k=1Htrees k . Note that our monomial and tree features are strict generalizations of the product features and threshold features used by Phillips et al. (2004; 2006) and Dud ık et al. (2007). The hypothesis set H = Hmono Htrees is infinite and can be very large even for a bounded monomial degree and for Structural Maxent Models Table 1. Experimental comparison of Maxent algorithms. Dataset Maxent L1-Maxent Struct Maxent Dataset Maxent L1-Maxent Struct Maxent b. variegatus 19.95 15.67 13.36 m. minutus 17.41 12.20 10.25 (0.54) (0.33) (0.28) (0.87) (0.78) (0.30) arson 5.93 5.75 5.68 murders 5.38 5.23 5.17 (0.02) (0.01) (0.02) (0.02) (0.02) (0.02) rapes 6.42 6.22 6.16 robberies 5.14 5.00 4.94 (0.02) (0.01) (0.01) (0.01) (0.01) (0.01) burglary 6.04 5.85 5.78 assault 6.65 6.35 6.30 (0.01) (0.01) (0.01) (0.02) (0.01) (0.01) larceny 5.83 5.65 5.58 auto theft 5.93 5.75 5.68 (0.01) (0.01) (0.01) (0.02) (0.01) (0.02) traffic 14.72 13.85 13.00 (1.11) (0.24) (1.01) trees restricted to the finite training sample. Thus, exhaustively searching for the best monomial in Hmono and the best decision tree in Htrees is not tractable. Instead, we used the following greedy procedure: given the best monomial m(x) of degree k (starting with k = 0), we find the best monomial of degree k + 1 that can be obtained from m(x) by multiplying it with one of ψ Hmono 1 . Similarly, given the best decision tree of size k, we find the best decision tree of size k + 1 that can be obtained by splitting exactly one leaf of the given tree. Let Φ1, . . . , Φt 1 be features chosen by the algorithm after the first t 1 iterations. Then, on the t-th iteration, the algorithm selects a feature from {Φ1, . . . , Φt 1, t , m }, where t is the tree and m the monomial obtained by the procedure just described. Note that Struct Maxent requires the knowledge of the Rademacher complexities Rm(Hk), which in certain cases can be estimated from the data. For simplicity, in our experiments, we used the following upper bounds Rm(Hmono k ) Rm(Htrees k ) (4k + 2) log2(d + 2) log(m + 1) and we redefined βk as βk = λBk + β, where Bk is the appropriate choice of an upper bound in (15) and where the parameter λ is introduced to control the balance between the magnitude of Bk and β. The proof of these upper bounds is given in Appendix E. In our experiments, we determined both parameters λ and β through a validation procedure. Specifically, we optimized over λ, β {0.0001, 0.001, 0.01, 0.1, 0.5, 1, 2}. Remarkably, in all of our experiments the best value of λ was 0.1. We compared Struct Maxent with L1-regularized Maxent (Kazama & Tsujii, 2003; Phillips et al., 2004; 2006; Dud ık et al., 2007) which is the special case of Struct Maxent with λ = 0. L1-regularized Maxent is the only algorithm used for experiments in (Phillips et al., 2004; 2006; Dud ık et al., 2007). The parameter β of L1-Maxent was set in the same way as for the Struct Maxent algorithm. We compared the performance of Struct Maxent to that of L1-Maxent and to standard Maxent (λ = β = 0) which served as our baseline. Following Phillips et al. (2004), we ran each algorithm for 500 rounds, or until the change in the objective on a single round fell below 10 5. For our experiments, we used a number of different datasets related to species habitat modeling, traffic modeling, and to communities and crimes, which we describe in the following subsections. 4.1. Species habitat modeling Species habitat modeling is among the most prominent applications of Maxent models (Phillips et al., 2004; 2006; Dud ık et al., 2007; Elith et al., 2011). For our experiments, we used two data sets from (Phillips et al., 2006) which are accessible from http://www.cs.princeton.edu/ schapire/maxent. These datasets consist of a set of 648,658 geographical locations in South America which constitute the input space X. For each of these locations, we have a set of environmental variables such as temperature and precipitation, which constitute our raw features ψ1, . . . , ψd. For each location, we are also given the number of times a particular kind of species has been observed, which defines the sample S used as an input to Maxent algorithms. The first dataset consists of 116 observations of bradypus variegatus and the second dataset has 88 observations of microryzomys minutus. The task consists of estimating the geographical distribution of each species. For each species, we first randomly split the sample S into a training set S1 (70%) and a test set S2 (30%). We trained all algorithms on S1 and used the error on S2 to find the optimal value of the parameters λ and β. Next we again split the sample into a training set S 1 (70%) and a test set S 2 (30%). Using the best parameter values found on the previous step, each algorithm was trained on S 1 and the log-loss on S 2 was recorded. We repeated the last step 10 times and reported the average log-loss over these 10 runs. This experimental setup matches that of (Phillips et al., 2004; Structural Maxent Models 2006), with the exception of the validation step, which is omitted in both of these references. 4.2. Minnesota traffic modeling We also experimented with a Minnesota traffic dataset (Kwon, 2004) which is accessible from http://www.d. umn.edu/ tkwon/TMCdata/TMCarchive.html. This dataset contains traffic volume and velocity records collected by 769 road sensors over the period of 31 days with an interval of 30 seconds between observations. The input space X is the set of sensor locations and the task consists of estimating traffic density at each particular location based on the historical data. More precisely, we chose 11 times t1, . . . , t11 on the 31st day starting at midnight and separated by one hour. For computational efficiency, instead of using the whole history, we defined the raw features ψ1, . . . , ψd as the historical volume and velocity averages for each of the past 30 days, augmented with data from the past hour collected every 10 minutes. For any time tl, there are between 1,000 to 20,000 cars observed by all of 769 sensors. For each time tl, we randomly selected 70% of observations for training and the remaining observations were reserved for testing. Data from the first time t1 was used to determine the best parameters λ and β. The parameter values were then fixed and used for training on the remaining ten times t2, . . . , t11. We report log-loss on the test set averaged over these 10 time values. 4.3. Communities and crime We used the UCI Communities and Crime dataset as another test case for Struct Maxent algorithm. This dataset contains demographic and crime statistics for 2,215 US communities. Each community represents a point x in the input space X and we define base features ψj to be various demographic statistics. The goal is to model the likelihood of certain types of crimes based on the demographic statistics available. The data set includes 8 different types of crime: murder, rape, robbery, burglary, assault, larceny, auto theft and arson. For each type of crime, there are between 17,000 to 5,000,000 records. To speed up our experiments, we randomly sub-sampled 11 training sets of size 5,000 for each type of crime (with the remaining data used for testing). As before, the first training and test sets is used to determine the best values for the parameters λ and β, and we report the averaged log-loss on the test set when training with λ and β fixed at these values. 4.4. Results and discussion The results of our experiments are presented in Table 1. They show that Struct Maxent provides an improvement over L1-Maxent that is comparable to the improvement of L1-Maxent over standard Maxent models. All of our re- Table 2. Average AUC values. b. variegatus m. minutus Maxent 0.810 0.020 0.879 0.141 L1-Maxent 0.817 0.027 0.972 0.026 Struct Maxent 0.873 0.027 0.984 0.006 sults are statistically significant using paired t-test at 1% level. Furthermore, Struct Maxent outperforms other algorithms on each of the individual runs. Our experiments indicate that the performances of Struct Maxent1 and Struct Maxent2 are comparable both better than that of L1regularized and non-regularized Maxent in terms of logloss. Note that for the species habitat modeling experiments, Phillips et al. (2004; 2006) report only AUC (Area Under the ROC curve) values, which measure ranking quality, instead of the log-loss optimized for density estimation. They do so by treating the absence of any recorded observation at a particular location as a negative label. For completeness, we also report AUC results for our experiments in Table 2. Struct Maxent outperforms other methods in terms of AUC as well. The convergence of Struct Maxent2 is somewhat slower than that of Struct Maxent1, but both exhibit exponential convergence. Finally, note that the running time of our Struct Maxent algorithms is similar to that of L1-Maxent. 5. Conclusion We presented a new family of density estimation models, Structural Maxent models, which benefit from strong data-dependent learning guarantees and can be used even with complex feature families. Our experiments demonstrated the empirical advantage of these models. We also introduced a new family of conditional probability models for multi-class classification, structural conditional Maxent models, and showed them to coincide with deep boosting when using the logistic function as a surrogate loss. Our conditional structural Maxent principle provide additional support in favor of this family of algorithms. As with standard Maxent models (Lafferty, 1999), our structural Maxent models can be generalized by using an arbitrary Bregman divergence (Bregman, 1967) in place of the (unnormalized) relative entropy. Much of our analysis and theoretical guarantees straightforwardly extend to cover this generalization, modulo some additional assumptions on the properties of the Bregman divergence used. Acknowledgments We thank Tian Jiang for several discussions about this work and early experiments, and Slobodan Vucetic for facilitating our access to the Minnesota traffic dataset. This work was partly funded by the NSF award IIS-1117591 and the NSERC PGS D3 award. Structural Maxent Models Berger, Adam L., Pietra, Stephen Della, and Pietra, Vincent J. Della. A maximum entropy approach to natural language processing. Comp. Linguistics, 22(1), 1996. Boyd, Stephen P. and Vandenberghe, Lieven. Convex Optimization. Cambridge University Press, 2004. Bregman, Lev M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200 217, 1967. Chen, Stanley F. and Rosenfeld, Ronald. A survey of smoothing techniques for ME models. IEEE Transactions on Speech and Audio Processing, 8(1), 2000. Cortes, Corinna, Mohri, Mehryar, and Syed, Umar. Deep boosting. In Proceedings of ICML, 2014. Cover, Thomas M. and Thomas, Joy M. Elements of Information Theory. Wiley-Interscience, 2006. Dud ık, Miroslav, Phillips, Steven J., and Schapire, Robert E. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8, 2007. Elith, Jane, Phillips, Steven J., Hastie, Trevor, Dud ık, Miroslav, Chee, Yung En, and Yates, Colin J. A statistical explanation of Max Ent for ecologists. Diversity and Distributions, 1, 2011. Goodman, Joshua. Exponential priors for maximum entropy models. In Proceedings of HLT-NAACL, 2004. Huang, Junzhou and Zhang, Tong. The benefit of group sparsity. The Annals of Statistics, 38(4):1978 2004, 08 2010. Jaynes, E. T. Information theory and statistical mechanics. Physical Review, 106(4):620 630, 1957. Jaynes, E. T. Papers on probability, statistics, and statistical physics. Synthese library. D. Reidel Pub. Co., 1983. Kazama, Jun ichi and Tsujii, Jun ichi. Evaluation and extension of maximum entropy models with inequality constraints. In Proceedings of EMNLP, pp. 137 144, 2003. Koltchinskii, Vladmir and Panchenko, Dmitry. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002. Kuznetsov, Vitaly, Mohri, Mehryar, and Syed, Umar. Multi-class deep boosting. In Proceedings of NIPS, 2014. Kwon, Taek M. TMC traffic data automation for Mn/DOT s traffic monitoring program. Univ. of Minnesota, Report no. Mn/DOT 2004-29, 2004. Lafferty, John D. Additive models, boosting, and inference for generalized divergences. In Proceedings of COLT, pp. 125 133, 1999. Lebanon, Guy and Lafferty, John D. Boosting and maximum likelihood for exponential models. In Proceedings of NIPS, pp. 447 454, 2001. Luo, Zhi-Quan and Tseng, Paul. On the convergence of coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7 35, 1992. Malouf, Robert. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of Co NLL2002, pp. 49 55, 2002. Manning, Christopher D. and Klein, Dan. Optimization, maxent models, and conditional estimation without magic. In Proceedings of HLT-NAACL, 2003. Mansour, Yishay. Pessimistic decision tree pruning based on tree size. In Proceedings of ICML, 1997. Phillips, Steven J., Dud ık, Miroslav, and Schapire, Robert E. A maximum entropy approach to species distribution modeling. In Proceedings of ICML, 2004. Phillips, Steven J., Anderson, Robet P., and Schapire, Robert E. Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190, 2006. Pietra, Stephen Della, Pietra, Vincent J. Della, and Lafferty, John D. Inducing features of random fields. IEEE Trans. Pattern Anal. Mach. Intell., 19(4), 1997. Ratnaparkhi, Adwait. Maximum entropy models for natural language processing. In Encyclopedia of Machine Learning, pp. 647 651. Springer, 2010. Rockafellar, R. Tyrrell. Convex analysis. Princeton University Press, 1997. Rosenfeld, Ronald. A maximum entropy approach to adaptive statistical language modelling. Computer Speech & Language, 10(3):187 228, 1996. Vapnik, Vladimir N. Statistical Learning Theory. Wiley Interscience, 1998. Williams, Peter M. Bayesian regularisation and pruning using a Laplace prior. Neural Computation, 7:117 143, 1994.