# learning_from_deep_hierarchical_structure_among_features__95e8cb1a.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Learning (from) Deep Hierarchical Structure among Features Yu Zhang,1 Lei Han2 1HKUST, 2Tencent AI Lab yu.zhang.ust@gmail.com, leihan.cs@gmail.com Data features usually can be organized in a hierarchical structure to reflect the relations among them. Most of previous studies that utilize the hierarchical structure to help improve the performance of supervised learning tasks can only handle the structure of a limited height such as 2. In this paper, we propose a Deep Hierarchical Structure (DHS) method to handle the hierarchical structure of an arbitrary height with a convex objective function. The DHS method relies on the exponents of the edge weights in the hierarchical structure but the exponents need to be given by users or set to be identical by default, which may be suboptimal. Based on the DHS method, we propose a variant to learn the exponents from data. Moreover, we consider a case where even the hierarchical structure is not available. Based on the DHS method, we propose a Learning Deep Hierarchical Structure (LDHS) method which can learn the hierarchical structure via a generalized fused-Lasso regularizer and a proposed sequential constraint. All the optimization problems are solved by proximal methods where each subproblem has an efficient solution. Experiments on synthetic and real-world datasets show the effectiveness of the proposed methods. Introduction Most of previous studies to utilize the hierarchical structure among features, including the group Lasso (Yuan and Lin 2006) and the Hierarchical Penalization (HP) method (Szafranski, Grandvalet, and Morizet-Mahoudeaux 2007), can only handle the hierarchical structure of a limited height up to 2. In this paper, we aim to break this assumption by utilizing the available hierarchical structure with an arbitrary height to help learn an accurate model. Moreover, in some case where the hierarchical structure is unavailable, we aim to learn such hierarchical structure among features to improve the interpretability of the resultant learner. Specifically, given a hierarchical structure to describe relations among features, we propose a Deep Hierarchical Structure (DHS) method to utilize it. In the DHS method, each model parameter corresponding to a data feature is penalized by the product of edge weights along the path from the root to the leaf node for that feature. Interestingly, when the exponents of the edge weights along the path from the Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. root to each leaf node are summed to 1, the proposed objective function can be proved to be convex no matter what the height of the hierarchical structure is. Moreover, when all the exponents take the same value, we can show that the proposed objective function is equivalent to a problem with a hierarchical group lasso regularization term. In order to optimize the objective function of the DHS method, we adopt the FISTA algorithm (Beck and Teboulle 2009) each of whose subproblems has an efficiently analytical solution. Moreover, in the proposed DHS method, the exponents of the edge weights need to be set based on a priori information. When this information is not available, by default we just set them to be identical. Usually this strategy works but it may be suboptimal. In order to alleviate this problem, we propose a variant of the DHS method to learn the exponents from data. Moreover, we consider a more general case where the hierarchical structure is not available. A hierarchical structure can give us more insight about the relations among features but learning it from data is a difficult problem. To the best of our knowledge, there is no work to directly learn the hierarchical structure among features. Here we give the first try based on the DHS method by proposing a Learning Deep Hierarchical Structure (LDHS) method. Given the height of the hierarchical structure, the LDHS method assumes that each path from the root to a leaf node corresponding to a data feature does not share any node between each other, then uses a generalized fused-Lasso regularizer to enforce nodes to fuse at each height, and finally designs a sequential constraint to make the learned structure form a hierarchical structure. For optimization, we use the GIST algorithm (Gong et al. 2013) to solve the objective function of the LDHS method. By comparing with several state-of-theart baseline methods, experiments on several synthetic and real-world datasets show the effectiveness of the proposed models. Related Work The Composite Absolute Penalties (CAP) method (Zhao, Rocha, and Yu 2009) learns from the hierarchical structure but via the group sparsity and its objective function is different from those of the proposed methods. Moreover, the proposed LDHS method can learn the hierarchical structure but the CAP method cannot. The treeguided group Lasso proposed in (Kim and Xing 2010) can learn from the hierarchical structure for multi-output regres- sion, whose settings are different from ours. There are some methods (Bondell and Reich 2007; Hallac, Leskovec, and Boyd 2015; Figueiredo and Nowak 2016) which can learn the group structure among features but fail to learn the hierarchical structure, while the proposed LDHS method can do that. Notations A hierarchical structure is said to be balanced if the paths from the root to every leaf node have the same length. For unbalanced hierarchical structure, we can easily convert it to a balanced one by adding some internal nodes as shown in Figure 1. So in this paper, the hierarchical structure mentioned is always assumed to be balanced. For a hierarchical structure of height m, the root is at height 0, the children of the root are at height 1, and the leaf nodes are at height m. The number of nodes at height i is denoted by si and the nodes at height i are labeled by 1 to si from left to right. A node denoted by N i j means that it is the jth node at height i. The set of children of a node N i j is denoted by Ci j and the number of children of a node N i j is denoted by d(i) j . For each leaf node N m i , we define d(m) i 1 for the ease of the notations. We define Fi j k as the index of the parent node of N i j, implying that N i 1 k is the parent node of N i j. The edge from a node N i 1 j to one of its children N i k is denoted by E(i) j,k and the weight of this edge is denoted by σ(i) j,k, where the superscript denotes the height in which the child node lies and the subscript denotes the indices of the parent and children nodes. The path from the root to the ith leaf node N m i is denoted by a sequence of m + 1 integers as Pi = {i0, . . . , im} where i0 = 1, im = i, and node N j ij is on the path for j = 0, . . . , m, and we define Pj i ij as the index of the node at height j on the path Pi. In the bottom figure of Figure 1, we have C0 1 = {1, 2, 3}, d(0) 1 = 3, C1 2 = {3, 4, 5}, d(1) 2 = 3, F1 2 = 1, and F2 3 = 2. The path from the root to a leaf node N 2 5 is P5 = {1, 2, 5} where P0 5 = 1, P1 5 = 2, and P2 5 = 5. Learning from Deep Hierarchical Structure Most of the existing works such as the group Lasso and the HP method (Szafranski, Grandvalet, and Morizet Mahoudeaux 2007) can only operate on a hierarchical structure of a limited height. However, in many applications, the hierarchical structure is much more complex. To improve the applicability, we present the proposed DHS method in this section. The Objective Function Suppose the training dataset is denoted by D = {(xi, yi)}n i=1 where xi Rd denotes the ith data instance and yi is its label, and the linear learning function is denoted by f(x) = w T x. Suppose that the features are organized in a balanced hierarchical structure of height m where m 2. Based on a loss function l( , , ), the objective function of Figure 1: Illustration for the hierarchical structure. The top figure denotes a hierarchical structure and the bottom figure denotes the equivalently balanced structure. the DHS method is formulated as min w,σ 1 n i=1 l(xi, yi, w) + λ1 Pj 1 i ,Pj i Pj 1 i ,Pj i j=1 d(i) j σ(i) Fi j,j = 1 i [m], σ(i) Fi j,j 0 i, j, (1) where wi is the ith entry in w, an edge weight σ(j) c,d is defined in the previous section, 2 denotes the ℓ2 norm of a vector, a/b for two scalars a and b is defined by continuation at zero as a/0 = if a = 0 and 0/0 = 0, [m] denotes an integer set from 1 to m, and θ(j) Pj 1 i ,Pj i , an exponent, can be viewed as the importance for the edge weight σ(j) Pj 1 i ,Pj i . The summand in the second term of the objective function in problem (1) penalizes each wi based on all the weights of edges on the path from the root node to the ith leaf node as well as the exponents. Hence two coefficients wi and wj will tend to have similar penalizations if they share many edges in the hierarchical structure. As we will see in Theorem 3, when exponents are identical to each other, this term is related to the group Lasso regularizer which enforces the group sparsity. The equality constraint in problem (1) is to restrict the scale of σ. To preserve the convexity of problem (1) as we will see in the next section, it is required that θ(j) Pj 1 i ,Pj i 0 i, j and Pm j=1 θ(j) Pj 1 i ,Pj i = 1 i, which means that the sum of the nonnegative exponents of all the edges along a path from the root to each leaf node equals 1. We first introduce a new family of convex functions with the proof in the supplementary material. Theorem 1 f(w, z) = w2 Qm i=1 z θi i is jointly convex with re- spect to w R and z = (z1, . . . , zm)T Rm, where zi s are required to be positive, given that θi 0 for i = 1, . . . , m and Pm i=1 θi = 1. When m = 1, Theorem 1 asserts that f(w, z) = w2/z is jointly convex with respect to w and z when z > 0, which is a well-known result (pp. 72, (Boyd and Vandenberghe 2004)). When m = 2 and θ1 = θ2 = 1/2, Theorem 1 recovers Proposition 1 in (Szafranski, Grandvalet, and Morizet Mahoudeaux 2007). Theorem 1 is more general since m can be any positive integer and different θi s can have different values. Based on Theorem 1, we can prove the convexity of problem (1) in the following theorem. Theorem 2 Given that the loss function l(x, y, w) is convex with w, problem (1) is jointly convex with respect to w and σ. To see the effect of the regularizer in the second term of problem (1), we investigate two special cases, where m equals 2 or 3. When m = 2, problem (1) degenerates to the HP method (Szafranski, Grandvalet, and Morizet-Mahoudeaux 2007), which shows that when all the θ(j) Pj 1 i ,Pj i s equal 1 2, problem (1) is equivalent to the ℓ4 3 ,1regularized group Lasso. When m = 3, we can derive an equivalent formulation of problem (1) as follows. Theorem 3 When m = 3 and all the θ(j) Pj 1 i ,Pj i s equal 1 3, problem (1) is equivalent to this problem: i=1 (d(1) i ) 1 6 (d(2) j ) 1 5 i=1 l(xi, yi, w) + λ2 2 w 2 2. (2) According to Theorem 3, we can see the second term in the objective function of problem (1) can be converted to the first one of problem (2), which places the ℓ3 2 norm on model parameters corresponding to the leaf nodes which share the same parent node, then places the weighted ℓ6 5 norm on the internal nodes at height 2 which have the same parent node, and finally computes the squared weighted sum on the internal nodes at height 1. This regularizer can be viewed as a hierarchical group Lasso where at each height, the weights corresponding to nodes sharing the same parent node will be combined together via some norm. In general, for any positive integer m, when all the exponents of different σi j,k have the same value (i.e., 1/m), we can always find the explicit form of the second term in the objective function of problem (1) in a similar way to the proof of Theorem 3. Optimization Since problem (1) is convex, we use the FISTA method (Beck and Teboulle 2009) to solve it. We use a variable φ to denote the concatenation of w and σ. We define l(xi, yi, w) Pj 1 i ,Pj i Pj 1 i ,Pj i and g(φ) = λ2 2 w 2 2. We define the set of constraints on φ as j=1 d(i) j σ(i) Fi j,j = 1 i [m], σ(i) Fi j,j 0 i, j}. In the FISTA algorithm, it does not minimize the original composite objective function F(φ) = f(φ) + g(φ), but instead solves a surrogate function: qr( ˆφ) = arg min φ Sφ Qr(φ, ˆφ), where Qr(φ, ˆφ) = g(φ) + f( ˆφ) + (φ ˆφ)T φ f( ˆφ) + r 2 φ ˆφ 2 2 and φf( ˆφ) denotes the derivative of f(φ) with respect to φ at φ = ˆφ. Hence, in the FISTA algorithm, we just need to minimize Qr(φ, ˆφ) with respect to φ Sφ. Specifically, we need to solve the following problem: 2 w 2 2 + r 2 w w 2 2 + r j=1 d(i) j σ(i) Fi j,j = 1 i [m], σ(i) Fi j,j 0 i, j, (3) where r is a step size which can be determined by the FISTA algorithm, w = ˆw 1 r w f(ˆw), and σ = ˆσ 1 r σ f(ˆσ). It is easy to see that the solution for w in problem (3) is w = r λ2+r w. For σ in problem (3), we can find that the corresponding problem can be decomposed into m 1 subproblems with each one solving a problem with respect to {σ(i) Fi j,j}si j=1 and these subproblems have the same formulation as min ρ ρ ˆρ 2 2 s.t. ρ 0, a T ρ = 1, (4) where 0 denotes a zero vector with appropriate size, denotes the elementwise inequalities between two vectors, and a is a constant vector with all entries positive. Problem (4) is a quadratic programming (QP) problem and we can use some off-the-shelf QP solver to solve it. To accelerate the optimization of problem (4), we propose a more efficient solution for problem (4) by solving its dual form and the detailed procedure is put in the supplementary material. A Variant to Learn Exponents In the DHS method, we need to manually set the exponents {θ(j) Pj 1 i ,Pj i }. By default, we usually assume that all the θ(j) Pj 1 i ,Pj i s are equal to 1 m, which satisfies the requirement to guarantee the convexity of problem (1). However, this setting may be suboptimal. In this section, we propose a DHSe method, a variant of the DHS method, to learn the exponents from data directly. The objective function of the DHSe method is formulated as min w,σ,θ 1 n i=1 l(xi, yi, w) + λ1 Pj 1 i ,Pj i Pj 1 i ,Pj i j=1 d(i) j σ(i) Fi j,j = 1 i [m], σ(i) Fi j,j 0 i, j Pj 1 i ,Pj i = 1 i [d], θ(j) Pj 1 i ,Pj i 0 i, j. (5) Different from problem (1) where all the θ(j) Pj 1 i ,Pj i s are constants, all the θ(j) Pj 1 i ,Pj i s in problem (5) are variables to be optimized. The equality and inequality constraints with respect to θ in problem (5) satisfy the requirements for the constant exponents in problem (1) and they form an (m 1) dimensional simplex for each feature. Different from problem (1) which is convex, problem (5) can be proved to be non-convex with respect to all variables and hence we use the GIST algorithm (Gong et al. 2013) to solve it. Due to page limit, we put the detailed optimization procedure in the supplementary material. Learning Deep Hierarchical Structure In some applications, the hierarchical structure is not available. In this section, we propose the LDHS method to learn both the hierarchical structure and the model parameters from data directly. We assume that the height of the hierarchical structure to be learned is given as m. Here we use slightly different notations to define the hierarchical structure. The weights of edges on the path from the root node to the ith leaf node corresponding to the ith feature are denoted by {ω(1) i , . . . , ω(m) i }, where ω(j) i denotes the weight of an edge connecting the height j 1 and j on the path from the root node to the ith leaf node. At the beginning, we assume that the paths from the root node to any two different leaf nodes do not share any edge. When ω(i) j equals ω(i) k for some i, j and k, we can view it as a sign for that the two paths from the root node to the jth and kth leaf nodes become fused at height i and then in order to keep the whole structure as a hierarchical structure, it should be required that the subpaths on the two paths above height i are always fused, implying that ω(i ) j will always equal ω(i ) k when i i. So an algorithm that can learn a valid hierarchical structure should satisfy the following two requirements: (1) It should have the ability to enforce ω(i) j to be equal to ω(i) k for some i, j and k; (2) It should guarantee that when ω(i) j equals ω(i) k , for all i < i, ω(i ) j will equal ω(i ) k . Here we present the objective function of the LDHS method which can satisfy those two requirements: min w,ω 1 n i=1 l(xi, yi, w) + λ1 w2 i Qm j=1 ω(j) i 1 j 1. When the hierarchical structure is available or equivalently ω is given, problem (6) becomes problem (1) and hence the LDHS method is a generalization of the DHS method to learn the hierarchical structure. Even though the objective function of problem (6) is convex based on Theorem 1, the whole problem is non-convex due to the sequential inequality constraint and we also use the GIST method to solve it. We still use the variable φ to denote the concatenation of w and ω. We define a set of constraints on φ as Sφ = {φ| Pd j=1 ω(i) j = 1 i [m], ω(i) j 0 i, j, |ω(1) k ω(1) j | . . . |ω(m) k ω(m) j | j, k.}. We define ( λ2 w 2 2 2 + Pm i=1 ηi P j