# learning_from_deep_hierarchical_structure_among_features__95e8cb1a.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Learning (from) Deep Hierarchical Structure among Features

Yu Zhang,1 Lei Han2

1HKUST, 2Tencent AI Lab yu.zhang.ust@gmail.com, leihan.cs@gmail.com

Data features usually can be organized in a hierarchical structure to reﬂect the relations among them. Most of previous studies that utilize the hierarchical structure to help improve the performance of supervised learning tasks can only handle the structure of a limited height such as 2. In this paper, we propose a Deep Hierarchical Structure (DHS) method to handle the hierarchical structure of an arbitrary height with a convex objective function. The DHS method relies on the exponents of the edge weights in the hierarchical structure but the exponents need to be given by users or set to be identical by default, which may be suboptimal. Based on the DHS method, we propose a variant to learn the exponents from data. Moreover, we consider a case where even the hierarchical structure is not available. Based on the DHS method, we propose a Learning Deep Hierarchical Structure (LDHS) method which can learn the hierarchical structure via a generalized fused-Lasso regularizer and a proposed sequential constraint. All the optimization problems are solved by proximal methods where each subproblem has an efﬁcient solution. Experiments on synthetic and real-world datasets show the effectiveness of the proposed methods.

Introduction Most of previous studies to utilize the hierarchical structure among features, including the group Lasso (Yuan and Lin 2006) and the Hierarchical Penalization (HP) method (Szafranski, Grandvalet, and Morizet-Mahoudeaux 2007), can only handle the hierarchical structure of a limited height up to 2. In this paper, we aim to break this assumption by utilizing the available hierarchical structure with an arbitrary height to help learn an accurate model. Moreover, in some case where the hierarchical structure is unavailable, we aim to learn such hierarchical structure among features to improve the interpretability of the resultant learner. Speciﬁcally, given a hierarchical structure to describe relations among features, we propose a Deep Hierarchical Structure (DHS) method to utilize it. In the DHS method, each model parameter corresponding to a data feature is penalized by the product of edge weights along the path from the root to the leaf node for that feature. Interestingly, when the exponents of the edge weights along the path from the

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

root to each leaf node are summed to 1, the proposed objective function can be proved to be convex no matter what the height of the hierarchical structure is. Moreover, when all the exponents take the same value, we can show that the proposed objective function is equivalent to a problem with a hierarchical group lasso regularization term. In order to optimize the objective function of the DHS method, we adopt the FISTA algorithm (Beck and Teboulle 2009) each of whose subproblems has an efﬁciently analytical solution. Moreover, in the proposed DHS method, the exponents of the edge weights need to be set based on a priori information. When this information is not available, by default we just set them to be identical. Usually this strategy works but it may be suboptimal. In order to alleviate this problem, we propose a variant of the DHS method to learn the exponents from data.

Moreover, we consider a more general case where the hierarchical structure is not available. A hierarchical structure can give us more insight about the relations among features but learning it from data is a difﬁcult problem. To the best of our knowledge, there is no work to directly learn the hierarchical structure among features. Here we give the ﬁrst try based on the DHS method by proposing a Learning Deep Hierarchical Structure (LDHS) method. Given the height of the hierarchical structure, the LDHS method assumes that each path from the root to a leaf node corresponding to a data feature does not share any node between each other, then uses a generalized fused-Lasso regularizer to enforce nodes to fuse at each height, and ﬁnally designs a sequential constraint to make the learned structure form a hierarchical structure. For optimization, we use the GIST algorithm (Gong et al. 2013) to solve the objective function of the LDHS method. By comparing with several state-of-theart baseline methods, experiments on several synthetic and real-world datasets show the effectiveness of the proposed models. Related Work The Composite Absolute Penalties (CAP) method (Zhao, Rocha, and Yu 2009) learns from the hierarchical structure but via the group sparsity and its objective function is different from those of the proposed methods. Moreover, the proposed LDHS method can learn the hierarchical structure but the CAP method cannot. The treeguided group Lasso proposed in (Kim and Xing 2010) can learn from the hierarchical structure for multi-output regres-

sion, whose settings are different from ours. There are some methods (Bondell and Reich 2007; Hallac, Leskovec, and Boyd 2015; Figueiredo and Nowak 2016) which can learn the group structure among features but fail to learn the hierarchical structure, while the proposed LDHS method can do that. Notations A hierarchical structure is said to be balanced if the paths from the root to every leaf node have the same length. For unbalanced hierarchical structure, we can easily convert it to a balanced one by adding some internal nodes as shown in Figure 1. So in this paper, the hierarchical structure mentioned is always assumed to be balanced. For a hierarchical structure of height m, the root is at height 0, the children of the root are at height 1, and the leaf nodes are at height m. The number of nodes at height i is denoted by si and the nodes at height i are labeled by 1 to si from left to right. A node denoted by N i j means that it is the jth node at height i. The set of children of a node N i j is denoted by Ci j and the number of children of a node N i j is denoted by

d(i) j . For each leaf node N m i , we deﬁne d(m) i 1 for the ease of the notations. We deﬁne Fi j k as the index of the parent node of N i j, implying that N i 1 k is the parent node of N i j. The edge from a node N i 1 j to one of its children N i k is denoted by E(i) j,k and the weight of this edge is denoted by

σ(i) j,k, where the superscript denotes the height in which the child node lies and the subscript denotes the indices of the parent and children nodes. The path from the root to the ith leaf node N m i is denoted by a sequence of m + 1 integers as Pi = {i0, . . . , im} where i0 = 1, im = i, and node N j ij is on the path for j = 0, . . . , m, and we deﬁne Pj i ij as the index of the node at height j on the path Pi. In the bottom ﬁgure of Figure 1, we have C0 1 = {1, 2, 3}, d(0) 1 = 3, C1 2 = {3, 4, 5}, d(1) 2 = 3, F1 2 = 1, and F2 3 = 2. The path from the root to a leaf node N 2 5 is P5 = {1, 2, 5} where P0 5 = 1, P1 5 = 2, and P2 5 = 5.

Learning from Deep Hierarchical Structure

Most of the existing works such as the group Lasso and the HP method (Szafranski, Grandvalet, and Morizet Mahoudeaux 2007) can only operate on a hierarchical structure of a limited height. However, in many applications, the hierarchical structure is much more complex. To improve the applicability, we present the proposed DHS method in this section.

The Objective Function

Suppose the training dataset is denoted by D = {(xi, yi)}n i=1 where xi Rd denotes the ith data instance and yi is its label, and the linear learning function is denoted by f(x) = w T x. Suppose that the features are organized in a balanced hierarchical structure of height m where m 2. Based on a loss function l( , , ), the objective function of

Figure 1: Illustration for the hierarchical structure. The top ﬁgure denotes a hierarchical structure and the bottom ﬁgure denotes the equivalently balanced structure.

the DHS method is formulated as

min w,σ 1 n

i=1 l(xi, yi, w) + λ1

Pj 1 i ,Pj i

Pj 1 i ,Pj i

j=1 d(i) j σ(i) Fi j,j = 1 i [m], σ(i) Fi j,j 0 i, j, (1)

where wi is the ith entry in w, an edge weight σ(j) c,d is deﬁned in the previous section, 2 denotes the ℓ2 norm of a vector, a/b for two scalars a and b is deﬁned by continuation at zero as a/0 = if a = 0 and 0/0 = 0, [m] denotes an integer set from 1 to m, and θ(j) Pj 1 i ,Pj i , an exponent, can be viewed as

the importance for the edge weight σ(j) Pj 1 i ,Pj i . The summand in the second term of the objective function in problem (1) penalizes each wi based on all the weights of edges on the path from the root node to the ith leaf node as well as the exponents. Hence two coefﬁcients wi and wj will tend to have similar penalizations if they share many edges in the hierarchical structure. As we will see in Theorem 3, when exponents are identical to each other, this term is related to the group Lasso regularizer which enforces the group sparsity. The equality constraint in problem (1) is to restrict the scale of σ. To preserve the convexity of problem (1) as we will see in the next section, it is required that θ(j) Pj 1 i ,Pj i 0 i, j

and Pm j=1 θ(j) Pj 1 i ,Pj i = 1 i, which means that the sum of the nonnegative exponents of all the edges along a path from the root to each leaf node equals 1.

We ﬁrst introduce a new family of convex functions with the proof in the supplementary material.

Theorem 1 f(w, z) = w2 Qm i=1 z θi i is jointly convex with re-

spect to w R and z = (z1, . . . , zm)T Rm, where zi s are required to be positive, given that θi 0 for i = 1, . . . , m and Pm i=1 θi = 1.

When m = 1, Theorem 1 asserts that f(w, z) = w2/z is jointly convex with respect to w and z when z > 0, which is a well-known result (pp. 72, (Boyd and Vandenberghe 2004)). When m = 2 and θ1 = θ2 = 1/2, Theorem 1 recovers Proposition 1 in (Szafranski, Grandvalet, and Morizet Mahoudeaux 2007). Theorem 1 is more general since m can be any positive integer and different θi s can have different values. Based on Theorem 1, we can prove the convexity of problem (1) in the following theorem.

Theorem 2 Given that the loss function l(x, y, w) is convex with w, problem (1) is jointly convex with respect to w and σ.

To see the effect of the regularizer in the second term of problem (1), we investigate two special cases, where m equals 2 or 3. When m = 2, problem (1) degenerates to the HP method (Szafranski, Grandvalet, and Morizet-Mahoudeaux 2007), which shows that when all the θ(j) Pj 1 i ,Pj i s equal 1

2, problem (1) is equivalent to the ℓ4

3 ,1regularized group Lasso. When m = 3, we can derive an equivalent formulation of problem (1) as follows.

Theorem 3 When m = 3 and all the θ(j) Pj 1 i ,Pj i s equal 1

3, problem (1) is equivalent to this problem:

i=1 (d(1) i ) 1 6

(d(2) j ) 1 5

i=1 l(xi, yi, w) + λ2

2 w 2 2. (2)

According to Theorem 3, we can see the second term in the objective function of problem (1) can be converted to the ﬁrst one of problem (2), which places the ℓ3

2 norm on model parameters corresponding to the leaf nodes which share the same parent node, then places the weighted ℓ6

5 norm on the internal nodes at height 2 which have the same parent node, and ﬁnally computes the squared weighted sum on the internal nodes at height 1. This regularizer can be viewed as a hierarchical group Lasso where at each height, the weights corresponding to nodes sharing the same parent node will be combined together via some norm. In general, for any positive integer m, when all the exponents of different σi j,k have the same value (i.e., 1/m), we can always ﬁnd the explicit form of the second term in the objective function of problem (1) in a similar way to the proof of Theorem 3.

Optimization Since problem (1) is convex, we use the FISTA method (Beck and Teboulle 2009) to solve it. We use a variable φ to denote the concatenation of w and σ. We deﬁne

l(xi, yi, w)

Pj 1 i ,Pj i

Pj 1 i ,Pj i

and g(φ) = λ2

2 w 2 2. We deﬁne the set of constraints on φ as

j=1 d(i) j σ(i) Fi j,j = 1 i [m], σ(i) Fi j,j 0 i, j}.

In the FISTA algorithm, it does not minimize the original composite objective function F(φ) = f(φ) + g(φ), but instead solves a surrogate function:

qr( ˆφ) = arg min φ Sφ Qr(φ, ˆφ),

where Qr(φ, ˆφ) = g(φ) + f( ˆφ) + (φ ˆφ)T φ f( ˆφ) + r 2 φ ˆφ 2 2 and φf( ˆφ) denotes the derivative of f(φ) with respect to φ at φ = ˆφ. Hence, in the FISTA algorithm, we just need to minimize Qr(φ, ˆφ) with respect to φ Sφ. Speciﬁcally, we need to solve the following problem:

2 w 2 2 + r

2 w w 2 2 + r

j=1 d(i) j σ(i) Fi j,j = 1 i [m], σ(i) Fi j,j 0 i, j, (3)

where r is a step size which can be determined by the FISTA algorithm, w = ˆw 1

r w f(ˆw), and σ = ˆσ 1

r σ f(ˆσ). It is easy to see that the solution for w in problem (3) is w = r λ2+r w. For σ in problem (3), we can ﬁnd that the corresponding problem can be decomposed into m 1 subproblems with each one solving a problem with respect to {σ(i) Fi j,j}si j=1 and these subproblems have the same formulation as

min ρ ρ ˆρ 2 2 s.t. ρ 0, a T ρ = 1, (4)

where 0 denotes a zero vector with appropriate size, denotes the elementwise inequalities between two vectors, and a is a constant vector with all entries positive. Problem (4) is a quadratic programming (QP) problem and we can use some off-the-shelf QP solver to solve it. To accelerate the optimization of problem (4), we propose a more efﬁcient solution for problem (4) by solving its dual form and the detailed procedure is put in the supplementary material.

A Variant to Learn Exponents In the DHS method, we need to manually set the exponents {θ(j) Pj 1 i ,Pj i }. By default, we usually assume that all

the θ(j) Pj 1 i ,Pj i s are equal to 1 m, which satisﬁes the requirement to guarantee the convexity of problem (1). However,

this setting may be suboptimal. In this section, we propose a DHSe method, a variant of the DHS method, to learn the exponents from data directly. The objective function of the DHSe method is formulated as

min w,σ,θ 1 n

i=1 l(xi, yi, w) + λ1

Pj 1 i ,Pj i

Pj 1 i ,Pj i

j=1 d(i) j σ(i) Fi j,j = 1 i [m], σ(i) Fi j,j 0 i, j

Pj 1 i ,Pj i = 1 i [d], θ(j)

Pj 1 i ,Pj i 0 i, j. (5)

Different from problem (1) where all the θ(j) Pj 1 i ,Pj i s are

constants, all the θ(j) Pj 1 i ,Pj i s in problem (5) are variables to be optimized. The equality and inequality constraints with respect to θ in problem (5) satisfy the requirements for the constant exponents in problem (1) and they form an (m 1) dimensional simplex for each feature. Different from problem (1) which is convex, problem (5) can be proved to be non-convex with respect to all variables and hence we use the GIST algorithm (Gong et al. 2013) to solve it. Due to page limit, we put the detailed optimization procedure in the supplementary material.

Learning Deep Hierarchical Structure In some applications, the hierarchical structure is not available. In this section, we propose the LDHS method to learn both the hierarchical structure and the model parameters from data directly. We assume that the height of the hierarchical structure to be learned is given as m. Here we use slightly different notations to deﬁne the hierarchical structure. The weights of edges on the path from the root node to the ith leaf node corresponding to the ith feature are denoted by {ω(1) i , . . . , ω(m) i }, where ω(j) i denotes the weight of an edge connecting the height j 1 and j on the path from the root node to the ith leaf node. At the beginning, we assume that the paths from the root node to any two different leaf nodes do not share any edge. When ω(i) j equals ω(i) k for some i, j and k, we can view it as a sign for that the two paths from the root node to the jth and kth leaf nodes become fused at height i and then in order to keep the whole structure as a hierarchical structure, it should be required that the subpaths on the two paths above height i are always fused, implying that ω(i ) j will always equal ω(i ) k when i i. So an algorithm that can learn a valid hierarchical structure should satisfy the following two requirements: (1) It should have the ability to enforce ω(i) j to be equal to

ω(i) k for some i, j and k; (2) It should guarantee that when ω(i) j equals ω(i) k , for all

i < i, ω(i ) j will equal ω(i ) k .

Here we present the objective function of the LDHS method which can satisfy those two requirements:

min w,ω 1 n

i=1 l(xi, yi, w) + λ1

w2 i Qm j=1 ω(j) i 1

j<k |ω(i) k ω(i) j |

j=1 ω(i) j = 1 i [m], ω(i) j 0 i, j

|ω(1) k ω(1) j | . . . |ω(m) k ω(m) j | j, k, (6)

where ω is a vector containing all ω(j) i s. The last term in the objective function of problem (6), a layer-wise generalized fused-Lasso regularizer (Tibshirani et al. 2005; Hocking et al. 2011), can make ω(i) j equal to ω(i) k for some i, j and k, and so this regularizer can satisfy the ﬁrst requirement. The sequential inequality constraint in problem (6) can satisfy the second requirement since when ω(i) j equals

ω(i) k , we can get |ω(i ) j ω(i ) k | 0 for any 1 i < i,

implying that ω(i ) j = ω(i ) k . Therefore, problem (6), which satisﬁes the two requirements, can learn a hierarchical structure. The exponents of all the ω(j) i s are set to be 1

m. We can also set them to be other values or even learn them as the DHSe method did and this will be left as the future work. The regularization parameter ηi controls the level of fusion between {ω(i) j }d j=1 at height i. A larger ηi will lead to more

identical values in {ω(i) j }d j=1. Therefore, it is intuitive to deﬁne an increasing order for ηi s from height m to height 0 to help construct the hierarchical structure. In practice, we set ηi 1 = υηi for i 2 with some constant υ > 1. When the hierarchical structure is available or equivalently ω is given, problem (6) becomes problem (1) and hence the LDHS method is a generalization of the DHS method to learn the hierarchical structure. Even though the objective function of problem (6) is convex based on Theorem 1, the whole problem is non-convex due to the sequential inequality constraint and we also use the GIST method to solve it. We still use the variable φ to denote the concatenation of w and ω. We deﬁne a set of constraints on φ as Sφ = {φ| Pd j=1 ω(i) j = 1 i [m], ω(i) j

0 i, j, |ω(1) k ω(1) j | . . . |ω(m) k ω(m) j | j, k.}. We deﬁne

( λ2 w 2 2 2 + Pm i=1 ηi P

j<k |ω(i) k ω(i) j | if φ Sφ + , otherwise

i=1 l(xi, yi, w) + λ1

w2 i Qm j=1 ω(j) i 1

By omitting some constant terms, the proximal problem to

be solved in the GIST algorithm can be formulated as

min w,ω r 2 w w 2 2 + r

2 ω ω 2 2 + λ2

j<k |ω(i) k ω(i) j |

j=1 ω(i) j = 1 i [m], ω(i) j 0 i, j

|ω(1) k ω(1) j | . . . |ω(m) k ω(m) j | j, k, (7)

where ˆw and ˆω are previous estimations for w and ω respectively, w = ˆw 1

r w f(ˆw), and ω = ˆω 1

r ω f(ˆω). It is easy to see that the solution for w in problem (7) is w = r λ2+r w. For ω in problem (7), its problem can be simpliﬁed as

min ω r 2 ω ω 2 2 +

i=1 ηi Bω(i) 1 (8)

s.t. 1T ω(i) = 1, ω(i) 0, |Bω(1)| . . . |Bω(m)|,

where 1 denotes the ℓ1 norm of a vector, ω(i) = (ω(i) 1 , . . . , ω(i) d )T , 1 is a vector of all ones with an appropriate size, |a| for a vector a returns a vector with each entry being the absolute value of the corresponding entry in a, denotes the elementwise no smaller than relation between two vectors, and B is a d(d 1)

2 d matrix with each of its rows containing only two non-zero entries 1 and 1 at corresponding locations. Problem (8) seems complicated and we reformulate it as

min ω,µ,τ r 2 ω ω 2 2 +

i=1 ηi τ (i) 1

s.t. 1T ω(i) = 1, ω(i) 0, µ(i) = ω(i), τ (i) = Bµ(i)

|τ (1)| . . . |τ (m)|. (9) Due to the existent of linear equality constraints, we use the ADMM to solve problem (9). We deﬁne the augmented Lagrangian function as L(ω, µ, τ)

2 τ (i) Bµ(i) 2 2 + q T i (τ (i) Bµ(i))

p T i (µ(i) ω(i)) + ρ

2 µ(i) ω(i) 2 2

2 ω ω 2 2 +

i=1 ηi τ (i) 1,

where {pi}m i=1 and {qi}m i=1 act as Lagrangian multipliers, and ρ is a penalty parameter. Then we need to solve the following problem as min ω,µ,τ L(ω, µ, τ)

s.t. 1T ω(i) = 1, ω(i) 0 i [m]

|τ (1)| . . . |τ (m)|. (10)

In the ADMM algorithm, problem (10) can be solved alternatively with respect to ω, µ and τ. With ﬁxed µ and τ, we need to solve the following subproblem with respect to ω as:

i=1 ω(i) b(i) 2 2 s.t. 1T ω(i) = 1, ω(i) 0 i,

where b(i) = 1 r+ρ r ω(i) + pi + ρµ(i) . This problem can be decomposed into m subproblems where the ith subproblem with respect to ω(i) has the same formulation as problem (4), which has an efﬁcient solution. With ﬁxed ω and τ, the subproblem with respect to µ is a QP problem. By setting the derivative of L(ω, µ, τ) with respect to µ(i) to zero, we can obtain the analytical solution for µ(i) as

µ(i) = I + BT B 1 BT τ (i) + 1

ρ BT qi + ω(i) 1

Note that (I + BT B) 1 is a constant matrix. So it can be computed and stored before solving the whole problem, leading to a more efﬁcient implementation. With ﬁxed ω and µ, the subproblem with respect to τ can be decomposed into 1

2d(d 1) subproblems with the jth one formulated as

τ (i) j ν(i) j 2 + ηi|τ (i) j |

s.t. |τ (1) j | . . . |τ (m) j |, (11)

where ν(i) = Bµ(i) 1

ρqi and τ (i) j , ν(i) j are the jth entries

of τ (i) and ν(i), respectively. The optimal τ (i) j must have the

same sign as ν(i) j since otherwise we can ﬂip the sign of τ (i) j to achieve a lower value for the ﬁrst term in the summand of the objective function in problem (11) while keeping other terms unchanged and also satisfying the constraints, which leads to a lower objective value. Then by deﬁning new variables {ˆτ (i) j } as ˆτ (i) j = sgn(ν(i) j )τ (i) j where sgn( ) gives the

sign of a scalar, ˆτ (i) j is always nonnegative and based on

problem (11), the problem with respect to {ˆτ (i) j } can be formulated as

min {ˆτ (i) j }

ˆτ (i) j ˆν(i) j 2 s.t. 0 ˆτ (1) j . . . ˆτ (m) j , (12)

where ˆν(i) j = |ν(i) j | 1

ρηi. This problem is similar to problem (17) in (Han and Zhang 2015b) except the requirement that all ˆτ (i) j s are nonnegative. We ﬁrst use the algorithm with linear complexity in Section 3.2 of (Han and Zhang 2015b) to solve this problem and then make negative ones in {ˆτ (i) j } become zero to obtain the ﬁnal solution.

Experiments In this section, we conduct empirical evaluations on both synthetic and real-world problems.

We compare the proposed models (i.e., DHS, DHSe and LDHS) with state-of-the-art structured feature learning methods, including the Lasso, Group Lasso (GLasso) (Yuan and Lin 2006), the CAP family with ℓ4 and ℓ penalties denoted by CAPℓ4 and CAPℓ (Zhao, Rocha, and Yu 2009), and the HP method (Szafranski, Grandvalet, and Morizet Mahoudeaux 2007). Among these baseline methods, the Lasso method does not take any group or hierarchical structure into consideration and the GLasso and HP models require a hierarchical structure of height 2, while the CAP, DHS and DHSe methods need a hierarchical structure with an arbitrary height. In order to provide a fair comparison, in all the experiments we ﬁrst generate a hierarchical structure on the features and then apply it to the GLasso, CAP, HP, DHS and DHSe methods, where the group structure required by the GLasso and HP methods is obtained from the two bottom-most layers of the hierarchical structure. For the LDHS method, we set its height to be that of the given hierarchical structure.

Experiments on Synthetic Data We ﬁrst experiment on synthetic data. We generate three synthetic data by varying the height of the hierarchical structure as m = 3, 4 and 5, respectively. For simplicity, we use full binary trees and the numbers of features d under the three settings are equal to 23, 24 and 25, respectively. The ground truth of the feature weights w when m = 3 is shown in Fig. 2(a) and those for other cases are put in the supplementary material. Then, we generate data instances from a normal distribution N(0, Id), where Id is a d d identity matrix. We assume a linear model between data instances and labels, i.e., y = Xw + ϵ with each entry ϵi in ϵ (i = 1, , n) following N(0, ξ2). In all the settings, we generate n = 100 training samples and set ξ = 2.

Table 1: The MSE s in terms of mean std on synthetic data.

m = 3 m = 4 m = 5 Lasso 0.2913 0.0399 0.8669 0.0499 1.8660 0.0549 GLasso 0.2945 0.0372 0.8597 0.0402 1.8877 0.0593 CAPℓ4 0.3069 0.0349 0.8597 0.0402 1.8885 0.0495 CAPℓ 0.3069 0.0349 0.8597 0.0402 1.8886 0.0497 HP 0.2795 0.0439 0.8403 0.0314 1.5519 0.0550 DHS 0.2743 0.0337 0.8275 0.0336 1.4682 0.0433 DHSe 0.2748 0.0305 0.8337 0.0324 1.5596 0.0474 LDHS 0.2370 0.0303 0.8283 0.0354 1.3894 0.0460

We randomly generate 100 samples for testing and use another 100 random samples for validation to choose regularization parameters of all the methods. All of the regularization parameters in different models are chosen from a set {10 5, 10 4 , 1}, except ηi s in the proposed LDHS method. As discussed before, we set ηi+1 = ηi/υ for i < m, where we choose η1 and υ from {10 5, 10 4 , 1} and {1.1, 2, 10}, respectively. We use the Mean Square Error (MSE) to evaluate different methods, where the MSE is deﬁned as 1

n( ˆw w ) X X( ˆw w ) for an estimation ˆw. All the settings are repeated for 10 times to obtain the average results, which are reported in Table 1. From the results,

Figure 2: The true hierarchical structure and estimated parameters when m = 3.

we observe that the proposed models with deeper hierarchical structure, i.e., the DHS, DHSe and LDHS models, generally show better performance than Lasso, GLasso, CAPℓ4, CAPℓ and HP. This demonstrates that whenever features can be organized into a deeper hierarchical structure, using the correct (deeper) hierarchical structure will allow a lower prediction error compared to models with shallow structure, e.g., groups or a hierarchical structure of height 2. The GLasso, CAPℓ4 and CAPℓ methods show comparable results since all of them belong to the CAP family and may acquire similar information from the feature groups. The DHS and DHSe methods show comparable performance, while the LDHS method, which learns the hierarchical structure automatically, achieves the lowest MSE in two of all the three settings. Moreover, in Fig. 2, we show the estimated parameter σ in the DHS method and ω in the LDHS method when m = 3. The estimated σ in DHSe is similar to that in DHS, hence we omit its result here. In Fig. 2(b), the estimated σ

Table 2: The MSE s (in terms of mean std ) of different methods on the Trafﬁc Volume data.

Task 1 Task 2 Task 3 Task 4 Task 5 Lasso 0.1066 0.0179 0.1366 0.0279 0.1155 0.0231 0.1215 0.0132 0.1632 0.0352 GLasso 0.1042 0.0190 0.1185 0.0201 0.1030 0.0228 0.0934 0.0122 0.1725 0.0264 CAPℓ4 0.0939 0.0173 0.1122 0.0164 0.0944 0.0180 0.0864 0.0109 0.1609 0.0251 CAPℓ 0.0977 0.0172 0.1179 0.0222 0.1029 0.0193 0.0925 0.0136 0.1800 0.0223 HP 0.0869 0.0189 0.1087 0.0270 0.0963 0.0196 0.0915 0.0133 0.1403 0.0299 DHS 0.0839 0.0187 0.1049 0.0262 0.0929 0.0187 0.0884 0.0132 0.1374 0.0284 DHSe 0.0843 0.0189 0.1051 0.0262 0.0935 0.0189 0.0890 0.0128 0.1381 0.0290 LDHS 0.0838 0.0186 0.1048 0.0262 0.0928 0.0186 0.0882 0.0126 0.1373 0.0284

well matches the hierarchical structure. For example, the estimated σ along the path from the root to the last four features are generally small. According to the formulation of the DHS model, these σ s will give heavier penalizations on the corresponding feature weights and as a consequence, we can obtain small feature weights, which match the true values of w . Similarly, for the LDHS model, the parameter ω is an m d matrix. We show the estimated ω in Fig. 2(c), from which we can generally observe a hierarchical structure by comparing their values, and this hierarchical structure well matches the ground truth. This result demonstrates that the LDHS method is able to recover the hierarchical structure.

Experiments on Real-World Datasets

In this section, we experiment on three real-world datasets including the trafﬁc volume data (Han and Zhang 2015b), the breast cancer data (Jacob, Obozinski, and Vert 2009) and the Covtype data. In these datasets, the hierarchical structure over the features is not available. By following (Kim and Xing 2010), we use a simple hierarchical k-means clustering method to generate the hierarchical structure on the features. Speciﬁcally, we perform k-means clustering to split the features into two groups, and for each group we recursively perform k-means clustering to obtain four sub-groups. Therefore, the resultant hierarchical structure has a height of m = 3 and hence in the LDHS method, the height of the learned hierarchical structure is also set to 3. The following experiments show that such a simple hierarchical structure is sufﬁcient to obtain good performance for the proposed methods. First we experiment on the trafﬁc volume data. This dataset collects the trafﬁc volumes from 136 entries (treated as features) and the trafﬁc volumes through some exits (treated as response) in a highway trafﬁc network. We choose 5 exits with the highest volumes in the network to form 5 learning tasks, each of which is to use the volumes from entries to predict the volume through a speciﬁc exit. There are totally 384 data samples. These tasks are regression problems and we use the square loss for all the methods. We randomly select 80% and 20% samples for training and testing, respectively. The regularization parameters are chosen from the same candidate set as used in the synthetic setting via 5-fold cross-validation. We use the MSE, i.e., 1

n y ˆy 2 2, to measure the performance of all the methods. The results on the 5 tasks are given in Table 2. From the

results, we observe that in 4 out of the 5 tasks, the proposed DHS, DHSe and LDHS methods outperform the baseline methods and the LDHS method achieves the lowest MSE. This demonstrates again that a deeper tree structure can provide a better feature structure in this data and our proposed methods can take advantage of this structure.

Table 3: Results on the Breast Cancer (top) and Covtype (bottom) datasets.

ACC (%) SEN (%) SPE (%) Lasso 76.81 3.54 91.48 4.25 71.99 6.22 GLasso 80.69 3.08 94.42 3.67 76.24 4.77 CAPℓ4 81.53 3.08 95.47 3.25 76.96 4.97 CAPℓ 80.97 3.33 95.47 3.25 76.23 5.04 HP 79.68 3.94 92.22 3.44 75.48 6.46 DHS 82.54 3.78 91.59 5.30 79.56 4.96 DHSe 83.49 3.48 92.84 5.53 80.41 4.32 LDHS 82.38 3.71 91.59 7.16 79.36 3.75 ACC (%) SEN (%) SPE (%) Lasso 74.61 0.28 67.03 1.24 72.58 0.73 GLasso 74.97 0.28 70.91 1.06 79.24 0.56 CAPℓ4 74.81 0.11 70.33 0.19 79.51 0.13 CAPℓ 74.80 0.10 70.30 0.18 79.52 0.14 HP 75.39 0.08 74.31 0.18 76.54 0.10 DHS 75.58 0.07 73.86 0.14 77.39 0.09 DHSe 75.37 0.07 71.12 0.32 79.84 0.30 LDHS 75.60 0.08 74.04 0.15 77.24 0.10

Next, we conduct experiments on the breast cancer and Covtype datasets which have been studied in (Jacob, Obozinski, and Vert 2009; Han and Zhang 2015a). The breast cancer dataset contains genes in 295 tumors, among which 78 of them are metastatic while 217 are nonmetastatic. Hence, the tasks here are binary classiﬁcation problems. We use the square loss for all methods. It has been shown that in this dataset, some latent hierarchical structure exists. Similar to (Jacob, Obozinski, and Vert 2009; Han and Zhang 2015a), we select 300 most correlated genes to the outputs as the feature representation, and alleviate the class imbalance problem by duplicating the positive samples twice. The Covtype dataset aims to predict the forest cover type from collected cartographic variables. The problem is originally a multi-class classiﬁcation problem and it is transformed into a binary classiﬁcation problem in (Chang and Lin 2011). There are n = 581, 012 examples and the

feature dimensionality is d = 54. We evaluate all the methods on these two datasets based on three mesuares, i.e., the accuracy (ACC), sensitivity (SEN), and speciﬁcity (SPE), similar to (Yang et al. 2012; Han and Zhang 2015a). By following (Yang et al. 2012; Han and Zhang 2015a), 50%, 30% and 20% of the data are randomly chosen for training, validation, and testing, respectively. The regularization parameters are selected from the same candidate sets as described in the previous experiments. The top table in Table 3 shows the average results over 10 repetitions on the breast cancer data. According to the results, hierarchical methods with deep tree structure generally show more accurate predictions. The DHSe method achieves the best performance in this case. The CAPℓ4 and CAPℓ have better sensitivities than the proposed methods, implying that they can recover more true positive examples, while their SPE is lower. According to results reported in the bottom table of Table 3, we can see that the HP, DHS, DHSe, and LDHS generally outperform the Lasso and GLasso methods on the Covtype data. Again, the LDHS method, which learns the tree structure, obtains the best accuracy.

Conclusions In this paper, we study the problem of learning (from) deep hierarchical structure step by step. In the ﬁrst step, we propose the convex DHS method to learn from hierarchical structure with an arbitrary height. Secondly, we propose the DHSe method, a variant of the DHS method, to learn the exponents from data. Finally, we propose the LDHS method to learn the hierarchical structure since it may be unavailable in some applications. In our future work, we will learn the exponents of the LDHS method in a way similar to the DHSe method.

Acknowledgment This research has been supported by NSFC (61473087 and 61673202).

References Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1):183 202. Bondell, H., and Reich, B. 2007. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 64:115 123. Boyd, S., and Vandenberghe, L. 2004. Convex Optimization. New York, NY: Cambridge University Press. Chang, C.-C., and Lin, C.-J. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3):27. Figueiredo, M. A. T., and Nowak, R. D. 2016. Ordered weighted L1 regularized regression with strongly correlated covariates: Theoretical aspects. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, 930 938. Gong, P.; Zhang, C.; Lu, Z.; Huang, J.; and Ye, J. 2013. A general iterative shrinkage and thresholding algorithm

for non-convex regularized optimization problems. In Proceedings of the 30th International Conference on Machine Learning, 37 45. Hallac, D.; Leskovec, J.; and Boyd, S. P. 2015. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 387 396. Han, L., and Zhang, Y. 2015a. Discriminative feature grouping. In Proceedings of the 29th AAAI Conference on Artiﬁcial Intelligence. Han, L., and Zhang, Y. 2015b. Learning tree structure in multi-task learning. In Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Hocking, T.; Vert, J.-P.; Bach, F.; and Joulin, A. 2011. Clusterpath: An algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, 745 752. Jacob, L.; Obozinski, G.; and Vert, J.-P. 2009. Group lasso with overlap and graph lasso. In Proceedings of the 26th International Conference on Machine Learning, 433 440. Kim, S., and Xing, E. P. 2010. Tree-guided group lasso for multi-task regression with structured sparsity. In Proceedings of the 27th International Conference on Machine Learning, 543 550. Szafranski, M.; Grandvalet, Y.; and Morizet-Mahoudeaux, P. 2007. Hierarchical penalization. In Platt, J. C.; Koller, D.; Singer, Y.; and Roweis, S. T., eds., Advances in Neural Information Processing Systems 20, 1457 1464. Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; and Knight, K. 2005. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B 67(1):91 108. Yang, S.; Yuan, L.; Lai, Y.-C.; Shen, X.; Wonka, P.; and Ye, J. 2012. Feature grouping and selection over an undirected graph. In Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 922 930. Yuan, M., and Lin, Y. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1):49 67. Zhao, P.; Rocha, G.; and Yu, B. 2009. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics 37(6A):3468 3497.