# generalized_hierarchical_kernel_learning__3b50cd32.pdf

Journal of Machine Learning Research 16 (2015) 617-652 Submitted 11/12; Revised 9/14; Published 3/15

Generalized Hierarchical Kernel Learning

Pratik Jawanpuria pratik.j@cse.iitb.ac.in Jagarlapudi Saketha Nath saketh@cse.iitb.ac.in Ganesh Ramakrishnan ganesh@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay Mumbai 400076, INDIA

Editor: Francis Bach

This paper generalizes the framework of Hierarchical Kernel Learning (HKL) and illustrates its utility in the domain of rule learning. HKL involves Multiple Kernel Learning over a set of given base kernels assumed to be embedded on a directed acyclic graph. This paper proposes a two-fold generalization of HKL: the ﬁrst is employing a generic ℓ1/ℓρ block-norm regularizer (ρ (1, 2]) that alleviates a key limitation of the HKL formulation. The second is a generalization to the case of multi-class, multi-label and more generally, multi-task applications. The main technical contribution of this work is the derivation of a highly specialized partial dual of the proposed generalized HKL formulation and an eﬃcient mirror descent based active set algorithm for solving it. Importantly, the generic regularizer enables the proposed formulation to be employed in the Rule Ensemble Learning (REL) where the goal is to construct an ensemble of conjunctive propositional rules. Experiments on benchmark REL data sets illustrate the eﬃcacy of the proposed generalizations. Keywords: multiple kernel learning, mixed-norm regularization, multi-task learning, rule ensemble learning, active set method

1. Introduction

A Multiple Kernel Learning (MKL) (Lanckriet et al., 2004; Bach et al., 2004) framework for construction of sparse linear combinations of base kernels embedded on a directed acyclic graph (DAG) was recently proposed by Bach (2008). Since the DAG induces hierarchical relations between the base kernels, this framework is more commonly known as Hierarchical Kernel Learning (HKL). It has been established that HKL provides a powerful algorithm for task speciﬁc non-linear feature selection. HKL employs a carefully designed ℓ1/ℓ2 blocknorm regularizer: ℓ1-norm across some predeﬁned components associated with the DAG and ℓ2-norm within each such component. However, the sparsity pattern of kernel (feature) selection induced by this regularizer is somewhat restricted: a kernel is selected only if the kernels associated with all its ancestors in the DAG are selected. In addition, it can be proved that the weight of the kernel associated with a (selected) node will always be greater than the weight of the kernels associated with its descendants. Such a restricted selection pattern and weight bias may limit the applicability of HKL in real world problems. This paper proposes a two-fold generalization of HKL. The ﬁrst is employing a ℓ1/ℓρ, ρ (1, 2), block-norm regularizer that mitigates the above discussed weight and selection bias

2015 Pratik Jawanpuria, Jagarlapudi Saketha Nath and Ganesh Ramakrishnan.

Jawanpuria, Nath and Ramakrishnan

among the kernels, henceforth termed as g HKL. Note that for the special case of ρ = 2, g HKL renders the HKL regularizer. Further, g HKL is generalized to the paradigm of Multitask Learning (MTL), where multiple related tasks need to be learnt jointly. We consider the MTL setup where the given learning tasks share a common sparse feature space (Lounici et al., 2009; Jawanpuria and Nath, 2011; Obozinski et al., 2011). Our goal is to construct a shared sparse feature representation that is suitable for all the given related tasks. We pose the problem of learning this shared feature space as that of learning a shared kernel, common across all the tasks. The proposed generalization is henceforth referred to as g HKLMT. In addition to learning a common feature representation, g HKLMT is generic enough to model additional correlations existing among the given tasks. Though employing a ℓ1/ℓρ, ρ (1, 2), regularizer is an incremental modiﬁcation to the HKL formulation, devising an algorithm for solving it is not straightforward. The projected gradient descent employed in the active set algorithm for solving HKL (Bach, 2008) can no longer be employed for solving g HKL as projections onto ℓρ-norm balls are known to be signiﬁcantly more challenging than those onto ℓ1-norm balls (Liu and Ye, 2010). Hence naive extensions of the existing HKL algorithm will not scale well. Further, the computational challenge is compounded with the generalization for learning multiple tasks jointly. The key technical contribution of this work is the derivation of a highly specialized partial dual of the g HKL/g HKLMT formulations and an eﬃcient mirror descent (Ben-Tal and Nemirovski, 2001; Beck and Teboulle, 2003) based active set algorithm for solving it. The dual presented here is an elegant convex optimization problem with a Lipschitz continuous objective and constrained over a simplex. Moreover, the gradient of the objective can be obtained by solving a known and well-studied variant of the MKL formulation. This motivates employing the mirror descent algorithm that is known to solve such problems eﬃciently. Further eﬃciency is brought in by employing an active set method similar in spirit to that in Bach (2008). A signiﬁcant portion of this paper focuses on the application of Rule Ensemble Learning (REL) (Dembczy nski et al., 2010, 2008), where HKL has not been previously explored. Given a set of basic propositional features describing the data, the goal in REL is to construct a compact ensemble of conjunctions with the given propositional features that generalizes well for the problem at hand. Such ensembles are expected to achieve a good trade-oﬀbetween interpretability and generalization ability. REL approaches (Cohen and Singer, 1999; Friedman and Popescu, 2008; Dembczy nski et al., 2010) have additionally addressed the problem of learning a compact set of rules that generalize well in order to maintain their readability. One way to construct a compact ensemble is to consider a linear model involving all possible conjunctions of the basic propositional features and then performing a ℓ1-norm regularized empirical risk minimization (Friedman and Popescu, 2008; Dembczy nski et al., 2010). Since this is a computationally infeasible problem, even with moderate number of basic propositions, the existing methods either approximate such a regularized solution using strategies such as shrinkage (Friedman and Popescu, 2008; Dembczy nski et al., 2010, 2008) or resort to post-pruning (Cohen and Singer, 1999). This work proposes to solve a variant of this regularized empirical risk minimization problem optimally using the framework of g HKL. The key idea is to deﬁne kernels representing every possible conjunction and arranging them on a DAG. The proposed g HKL regularizer is applied on this DAG of kernels, leading to a sparse combination of promising conjunctions. Note that

Generalized Hierarchical Kernel Learning

with such a setup, the size of the g HKL optimization problem is exponential in the number of basic propositional features. However, a key result in the paper shows that the proposed g HKL algorithm is guaranteed to solve this exponentially large problem with a complexity polynomial in the ﬁnal active set1 size. Simulations on benchmark binary (and multiclass) classiﬁcation data sets show that g HKL (and g HKLMT) indeed constructs a compact ensemble that on several occasions outperforms state-of-the-art REL algorithms in terms of generalization ability. These results also illustrate the beneﬁts of the proposed generalizations over HKL: i) the ensembles constructed with g HKL (with low ρ values) involve fewer number of rules than with HKL; though the accuracies are comparable ii) g HKLMT can learn rule ensemble on multiclass problems; whereas HKL is limited to two-class problems. The rest of the paper2 is organized as follows. Section 2 introduces the classical Multiple Kernel Learning setup, brieﬂy reviews the HKL framework and summarizes the existing works in Multi-task Learning. In Section 3, we present the proposed g HKL and g HKLMT formulations. The key technical derivation of the specialized dual is also presented in this section. The proposed mirror descent based active set algorithm for solving g HKL/g HKLMT formulations is discussed in Section 4. In Section 5, we propose to solve the REL problem by employing the g HKL formulation and discuss its details. In Section 6, we report empirical evaluations of g HKL and g HKLMT formulations for REL on benchmark binary and multiclass data sets respectively. Section 7 concludes the paper.

2. Related Works

This section provides a brief introduction to the Multiple Kernel Learning (MKL) framework, the HKL setup and formulation (Bach, 2008, 2009) as well as the existing works in Multi-task Learning.

2.1 Multiple Kernel Learning Framework

We begin by discussing the regularized risk minimization framework (Vapnik, 1998), which has been employed in the proposed formulations. Consider a learning problem like classiﬁcation or regression and let its training data be denoted by D = {(xi, yi), i = 1, . . . , m | xi X, yi R i}, where (xi, yi) represents the ith

input-output pair. The aim is to learn an aﬃne prediction function F(x) that generalize well on unseen data. Given a positive deﬁnite kernel k that induces a feature map φk( ), the prediction function can be written as: F(x) = f, φk(x) Hk b. Here Hk is the Reproducing Kernel Hilbert Space (RKHS) (Sch olkopf and Smola, 2002) associated with the kernel k, endowed with an inner product , Hk, and f Hk, b R are the model parameters to be learnt. A popular framework to learn these model parameters is the regularized risk minimization (Vapnik, 1998), which considers the following problem:

min f Hk,b R 1 2Ω(f)2 + C

i=1 ℓ(yi, F(xi)), (1)

1. Roughly, this is the number of selected conjunctions and is potentially far less than the total number of conjunctions. 2. Preliminary results of this work were reported in Jawanpuria et al. (2011).

Jawanpuria, Nath and Ramakrishnan

where Ω( ) is a norm based regularizer, ℓ: R R R is a suitable convex loss function and C is a regularization parameter. As an example, the support vector machine (SVM) (Vapnik, 1998) employs Ω(f) = f Hk. From the representer theorem (Sch olkopf and Smola, 2002), we know that the optimal f has the following form f( ) = Pm i αik( , xi) where α = (αi)m i=1 is a vector of coeﬃcients to be learnt. It can be observed from above that the kernel deﬁnition plays a crucial role in deﬁning the quality of the solution obtained by solving (1). Hence learning a kernel suitable to the problem at hand has been an active area of research over the past few years. One way to learn kernels is via the Multiple Kernel Learning (MKL) framework (Lanckriet et al., 2004; Bach et al., 2004). Lanckriet et al. (2004) proposed to learn the kernel k as a conic combination of the given base kernels k1, . . . , kl: k = Pl i=1 ηiki, ηi 0 i. Here η = (ηi)l i=1 is a coeﬃcient vector to be (additionally) learnt in the optimization problem (1). In this setting, the feature map with respect to the kernel k is given by φk = ( ηiφki)l i=1 (see Rakotomamonjy et al., 2008, for details). It is a weighted concatenation of feature maps induced by the individual base kernels. Hence, sparse kernel weights will result in a low dimensional φk. Some of the additional constraints on η explored in the existing MKL works are ℓ1-norm constraint (Bach et al., 2004; Rakotomamonjy et al., 2008), ℓp-norm constraint (p > 1) (Kloft et al., 2011; Vishwanathan et al., 2010; Aﬂalo et al., 2011), etc.

2.2 Hierarchical Kernel Learning

Hierarchical Kernel Learning (HKL) (Bach, 2008) is a generalization of MKL and assumes a hierarchy over the given base kernels. The base kernels are embedded on a DAG and a carefully designed ℓ1/ℓ2 block-norm regularization over the associated RKHS is proposed to induce a speciﬁc sparsity pattern over the selected base kernels. We begin by discussing its kernel setup. Let G(V, E) be the given DAG with V denoting the set of vertices and E denoting the set of edges. The DAG structure entails relationships like parent, child, ancestor and descendant (Cormen et al., 2009). Let D(v) and A(v) represent the set of descendants and ancestors of the node v in the G. It is assumed that both D(v) and A(v) include the node v. For any subset of nodes W V, the hull and sources of W are deﬁned as:

hull(W) = [

w W A(w), sources(W) = {w W | A(w) W = {w}} .

The size and complement of W are denoted by |W| and Wc respectively. Let kv : X X R be the positive deﬁnite kernel associated with the vertex v V. In addition, let Hkv be its associated RKHS and φkv be its induced feature map. Given this, HKL employs the following prediction function:

v V fv, φkv(x) Hkv b,

which is an aﬃne model parameterized by f = (fv)v V, the tuple with entries as fv Hkv and b R. Some more notations follow: for any subset of nodes W V, f W = (fv)v W and φW = (φv)v W. In general, the entries in a vector are referred to using an appropriate subscript, i.e., entries in u Rd are denoted by u1, . . . , ud. The kernels are denoted by the lower case k and the corresponding Gram matrices are denoted by the upper case K .

Generalized Hierarchical Kernel Learning

HKL formulates the problem of learning the optimal prediction function F as the following regularized risk minimization problem:

min fv Hkv v V,b R 1 2

v V dv f D(v) 2

i=1 ℓ(yi, F(xi)) , (2)

where f D(v) 2 = P w D(v) fw 2 1

2 v V, ℓ( , ) is a suitable convex loss function and (dv)v V are given non-negative parameters. As is clear from (2), HKL employs a ℓ1/ℓ2 block-norm regularizer, which is known to promote group sparsity (Yuan and Lin, 2006). Its implications are discussed in the following. For most of v V, f D(v) 2 = 0 at optimality due to the sparsity inducing nature of the ℓ1-norm. Moreover ( f D(v) 2 = 0) (fw = 0 w D(v)). Thus it is expected that most fv will be zero at optimality. This implies that the prediction function involves very few kernels. Under mild conditions on the kernels (being strictly positive), it can be shown that this hierarchical penalization induces the following sparsity pattern: (fw = 0) (fv = 0 v A(w)). In other words, if the prediction function employs a kernel kw then it certainly employs all the kernels associated with the ancestor nodes of w. Bach (2008) proposes to solve the following equivalent variational formulation:

min γ 1 min fv Hkv v V,b R 1 2

w V δw(γ) 1 fw 2 + C

i=1 ℓ(yi, F(xi)) , (3)

where 1 = z R|V| | z 0, P v V zv 1 and δw(γ) 1 = P v A(w) d2 v γv . From the representer theorem (Sch olkopf and Smola, 2002), it follows that the eﬀective kernel employed in the HKL is: k = P w V δw(γ)kw. Since the optimization problem (3) has a ℓ1-norm constraint over γ variables, most γv at optimality are expected to be zero. Moreover the kernel weight δw(γ) is zero whenever γv = 0 for any v A(w). Thus, the HKL performs a sparse selection of the base kernels and can be understood as a generalization of the classical MKL framework. However, the sparsity pattern for the kernels has the following restriction: if a kernel is not selected then none of the kernels associated with its descendants are selected, as (γv = 0) (δw(γ) = 0 w D(v)). For the case of strictly positive kernels, it follows that a kernel is selected only if all the kernels associated with its ancestors are selected. In addition, the following relationship holds among the kernels weights: δv(γ) δw(γ) w D(v) (strict inequality holds if δw(γ) > 0). Hence, the weight of the kernel associated with a (selected) node is always be greater than the weight of the kernels associated with its descendants. Since the size of γ is same as that of V and since the optimal γ is known to be sparse, Bach (2008) proposes an active set based algorithm (Lee et al., 2007) for solving (3). At each iteration of the active set algorithm, (3) is solved with respect to only those variables in the active set via the projected gradient descent technique (Rakotomamonjy et al., 2008). As illustrated in Bach (2008), the key advantage of HKL is in performing non-linear feature selection. For example, consider the case where the input space is X = Rn and let I be power set of {1, . . . , n}. Consider the following 2n kernels arranged on the usual subset lattice: ki(x, x ) = Πj ixjx j i I. HKL can be applied in this setup to select

Jawanpuria, Nath and Ramakrishnan

the promising sub-products of the input features over all possible sub-products. Please refer to Bach (2008) for more such pragmatic examples of kernels and corresponding DAGs. The most interesting result in Bach (2008) is that in all these examples where the size of the DAG is exponentially large, the computational complexity of the active set algorithm is polynomial in the training set dimensions and the active set size. Importantly, the complexity is independent of |V|! Though encouraging, the above discussed weight bias (in favor of the kernels towards the top of the DAG) and restricted kernel selection pattern may limit the applicability of HKL in real world problems. For instance, in case of the sub-product kernel example mentioned above, the following is true: a sub-product is selected only if all the products including it are selected. This clearly may lead to selection of many redundant sub-products (features). In Section 3, we present the proposed generalization that provides a more ﬂexible kernel selection pattern by employing a ℓ1/ℓρ, ρ (1, 2), regularizer. A key result of this paper (refer Corollary 6) is that for all the cases discussed in Bach (2008), the proposed mirror descent based active set algorithm for solving the generalization has a computational complexity that is still polynomial in the training set dimensions and the active set size. In other words, the proposed generalization does not adversely aﬀect the computational feasibility of the problem and hence is an interesting result in itself.

2.3 Multi-task Learning

Multi-task Learning (Caruana, 1997; Baxter, 2000) focuses on learning several prediction tasks simultaneously. This is in contrast with the usual approach of learning each task separately and independently. The key underlying idea behind MTL is that an appropriate sharing of information while learning related tasks will help in obtaining better prediction models. Various deﬁnitions of task-relatedness have been explored over the past few years like proximity of task parameters (Baxter, 2000; Evgeniou and Pontil, 2004; Xue et al., 2007; Jacob et al., 2008; Jawanpuria and Nath, 2012) or sharing common feature space (Ando and Zhang, 2005; Ben-David and Schuller, 2008; Argyriou et al., 2008; Lounici et al., 2009; Obozinski et al., 2011). Many learning settings like multiclass classiﬁcation, multi-label classiﬁcation or learning vector-valued function may be viewed as a special case of multitask learning. In this work, we consider the common setting in which the task parameters share a simultaneously sparse structure: only a small number of input features are relevant for each of the tasks and the set of such relevant features is common across all the tasks (Turlach et al., 2005; Lounici et al., 2009). Existing works in this setting typically employ a group lasso penalty on the tasks parameters: ℓ1/ℓ2 block-norm (Lounici et al., 2009; Obozinski et al., 2011) or the ℓ1/ℓ block-norm (Turlach et al., 2005; Negahban and Wainwright, 2009). Thus, they

propose a multi-task regularizer of the form: Ω(f1, . . . , f T ) = Pd i=1 PT t=1 |fti|q 1

q where the input feature space is assumed to be d dimensional, ft is the task parameter of the tth task and ft = (fti)i=1,...,d and q = {2, }. Note that in addition to (sparse) shared feature selection, the ℓ1/ℓ block-norm penalty also promote proximity among the task parameters. We pose the problem of learning the shared features as that of learning a shared kernel, whose induced feature space is common across all the tasks. The shared kernel is

Generalized Hierarchical Kernel Learning

constructed as a sparse combination of the given base kernels. A hierarchical relationship exists over the given kernels (feature spaces). We employ a graph based ℓ1/ℓρ block-norm regularization over the task parameters that enable non-linear feature selection for multiple tasks simultaneously. The details of the proposed MTL formulation are discussed in the following section.

3. Generalized Hierarchical Kernel Learning

In this section, we present the proposed generalizations over HKL. As discussed earlier, the ﬁrst generalization aims at mitigating the weight bias problem as well as the restrictions imposed on the kernel selection pattern of HKL, and is termed as g HKL. The g HKL formulation is then further generalized to the paradigm of MTL, the proposed formulation being termed as g HKLMT. We begin by introducing the g HKL formulation.

3.1 g HKL Primal Formulation

Recall that HKL employs a ℓ1/ℓ2 block norm regularizer. As we shall understand in more detail later, a key reason for the kernel weight bias problem and the restricted sparsity pattern in HKL is the ℓ2-norm regularization. One way to mitigate these restrictions is by employing the following generic regularizer:

v V dv f D(v) ρ, (4)

where f = (fv)v V, f D(v) ρ = P w D(v) fw ρ 1

ρ and ρ (1, 2]. The implications of the ℓ1/ℓρ block-norm regularization are discussed in the following. Since the ℓ1-norm promotes sparsity, it follows that f D(v) ρ = 0 (that is fw = 0 w D(v)) for most v V. This phenomenon is similar as in HKL. But now, even in cases where f D(v) ρ is not forced to zero by the ℓ1-norm, many components of f D(v) tend to zero3 (that is fw 0 for many w D(v)) as the value of ρ tends to unity. Also note that ρ = 2 renders the HKL regularizer. To summarize, the proposed g HKL formulation is

min fv Hkv v V,b R 1 2(ΩS(f))2 + C

i=1 ℓ(yi, F(xi)) . (5)

We next present the g HKLMT formulation, which further generalizes g HKL to MTL paradigm.

3.2 g HKLMT Primal Formulation

We begin by introducing some notations for the multi-task learning setup. Let T be the number of tasks and let the training data for the tth task be denoted by Dt = {(xti, yti), i = 1, . . . , m | xti X, yti R i}, where (xti, yti) represents the ith input-output pair of the

3. Note that as ℓρ-norm (ρ > 1) is diﬀerentiable, it rarely induce sparsity (Szafranski et al., 2010). However, as ρ 1, they promote only a few leading terms due to the high curvatures of such norms (Szafranski et al., 2007). In order to obtain a sparse solution in such cases, thresholding is commonly employed by existing ℓp-MKL (ρ > 1) algorithms (Vishwanathan et al., 2010; Orabona et al., 2012; Jain et al., 2012; Jawanpuria et al., 2014). We employed thresholding in our experiments.

Jawanpuria, Nath and Ramakrishnan

tth task. For the sake of notational simplicity, it is assumed that the number of training examples is same for all the tasks. The prediction function for the tth task is given by: Ft(x) = P v V ftv, φkv(x) Hkv bt, where ft = (ftv)v V and bt are the task parameters to be learnt. We propose the following regularized risk minimization problem for estimating these task parameters and term it as g HKLMT:

min ft,bt t 1 2

w D(v) (Qw(f1, . . . , f T ))ρ

| {z } ΩT (f1,...,f T )

i=1 ℓ(yti, Ft(xti)), (6)

where ρ (1, 2] and Qw(f1, . . . , f T ) is a norm-based multi-task regularizer on the task parameters ftw t . In the following, we discuss the eﬀect of the above regularization. Firstly, there is a ℓ1-norm regularization over the group of nodes (feature spaces) and a ℓρ-norm regularization within each group. This ℓ1/ℓρ block-norm regularization is same as that of g HKL and will have the same eﬀect on the sparsity pattern of the selected feature spaces (kernels). Hence, only a few nodes (feature spaces) will be selected by the g HKLMT regularizer ΩT (f1, . . . , f T ). Secondly, nature of the task relatedness within each (selected) feature space is governed by the Qw(f1, . . . , f T ) regularizer. For instance, consider the following deﬁnition of Qw(f1, . . . , f T ) (Lounici et al., 2009; Jawanpuria and Nath, 2011):

Qw(f1, . . . , f T ) =

t=1 ftw 2 ! 1

The above regularizer couples the task parameters within each feature space via ℓ2-norm. It encourages the task parameters within a feature space to be either zero or non-zero across all the tasks. Therefore, ΩT (f1, . . . , f T ) based on (7) has the following eﬀect: i) all the tasks will simultaneously select or reject a given feature space, and ii) overall only a few feature spaces will be selected in the g HKL style sparsity pattern. Several multi-task regularizations (Evgeniou and Pontil, 2004; Evgeniou et al., 2005; Jacob et al., 2008) have been proposed to encourage proximity among the task parameters within a given feature space. This correlation among the tasks may be enforced while learning a shared sparse feature space by employing the following Qw(f1, . . . , f T ):

Qw(f1, . . . , f T ) =

ftw 1 T + µ

where µ > 0 is a given parameter. The above Qw(f1, . . . , f T ) consists of two terms: the ﬁrst regularizes the mean while the second regularizes the variance of the task parameters in the feature space induced by kernel kw. The parameter µ controls the degree of proximity among the task parameters, with lower µ encouraging higher proximity. Note that when µ = , (8) simpliﬁes to (7). The g HKLMT regularizer ΩT (f1, . . . , f T ) based on (8) has the

Generalized Hierarchical Kernel Learning

following eﬀect: i) all the tasks will simultaneously select or reject a given feature space, ii) overall only a few feature spaces will be selected in the g HKL style sparsity pattern, and iii) within each selected feature space, the task parameters ftw t are in proximity. Thus, g HKLMT framework provides a mechanism to learn a shared feature space across the tasks. In addition, it can also preserve proximity among the tasks parameters in the learnt feature space. As we shall discuss in the next section, more generic correlations among task parameters may be also modeled within the g HKLMT framework. It is clear that the g HKL optimization problem (5) may be viewed as a special case of the g HKLMT optimization problem (7), with the number of tasks set to unity. Hence the rest of the discussion regarding dual derivation and optimization focuses primarily on g HKLMT formulation.

3.3 g HKLMT Dual Formulation

As mentioned earlier, due to the presence of the ℓρ-norm term in g HKLMT formulation, naive extensions of the projected gradient based active set method in Bach (2008) will be rendered computationally infeasible on real world data sets. Hence, we ﬁrst re-write g HKLMT formulation in an elegant form, which can then be solved eﬃciently. To this end, we note the following variational characterization of ΩT (f1, . . . , f T ).

Lemma 1 Given ΩT (f1, . . . , f T ) and Qw(f1, . . . , f T ) as deﬁned in (6) and (8) respectively, we have:

ΩT (f1, . . . , f T )2 = min γ min λv v ˆρ v V

w V δw(γ, λ) 1Qw(f1, . . . , f T )2, (9)

where ˆρ = ρ 2 ρ, δw(γ, λ) 1 = P v A(w) d2 v γvλvw , 1 = z R|V| | z 0, P v V zv 1 and

v r = n z R|D(v)| | z 0, P w D(v) zr w 1 o .

Note that ρ (1, 2) ˆρ (1, ). The proof of the above lemma is provided in Appendix A.2. In order to keep the notations simple, in the remainder of this section, it is assumed that the learning tasks at hand are binary classiﬁcation, i.e., yti { 1, 1} t, i, and the loss function is the hinge loss. However, one can easily extend these ideas to other loss functions and learning problems. Refer Appendix A.8 for g HKLMT dual formulation with general convex loss functions.

Lemma 2 Consider problem (6) with the regularizer term replaced with its variational characterization (9) and the loss function as the hinge loss ℓ(y, Ft(x)) = max (0, 1 y Ft(x)). Then the following is a partial dual of it with respect to the variables ft, bt t = 1, . . . , T:

min γ 1 min λv v ˆρ v V max αt S(yt,C) t G(γ, λ, α), (10)

G(γ, λ, α) = 1 α 1

w V δw(γ, λ)Hw

Jawanpuria, Nath and Ramakrishnan

α = [α 1 , . . . , α T ] , S(yt, C) = {β Rm | 0 β C, Pm i=1 ytiβi = 0}, yt = [yt1, . . . , ytm] , Y is the diagonal matrix corresponding to the vector [y 1 , . . . , y T ] , 1 is a m T 1 vector with entries as unity, δw(γ, λ) 1 = P v A(w) d2 v γvλvw , 1 = z R|V| | z 0, P v V zv 1 ,

v r = n z R|D(v)| | z 0, P w D(v) zr w 1 o , ˆρ = ρ 2 ρ, and Hw Rm T m T is the multi-task kernel matrix corresponding to the multi-task kernel hw w V. The kernel function hw is deﬁned as follows:

hw(xt1i, xt2j) = kw(xt1i, xt2j)B(t1, t2), (11)

where B is a T T matrix. B = I (identity matrix) when the multi-task regularizer (7) is employed in (6). Alternatively, B = I + 11 /µ (here 1 is a T 1 vector with entries as unity) in the case when the regularizer (8) is employed. The prediction function for the task t1 is given by

Ft1(xt1j) =

i=1 αt2iyt2i

w V δw( γ, λ)kw(xt1i, xt2j)B(t1, t2)

where ( γ, λ, α) is an optimal solution of (10).

Proof The proof follows from the representer theorem (Sch olkopf and Smola, 2002). Also refer to Appendix A.3.

This lemma shows that g HKLMT essentially constructs the same prediction function as an SVM with the eﬀective multi-task kernel as: h = P w V δw(γ, λ)hw. Similarly, in the case of the g HKL, the eﬀective kernel is k = P w V δw(γ, λ)kw (since the terms T and B are unity). Here, as well as in the rest of the paper, we employ the symbols h and H for the multi-task kernel and the corresponding Gram matrix respectively. The multi-task kernel (11) consists of two terms: the ﬁrst term corresponds to the similarity between two instances xt1i and xt2j in the feature space induced by the kernel kw. The second term corresponds to the correlation between the tasks t1 and t2. In the case of the regularizer (7), the matrix B simpliﬁes to: B(t1, t2) = 1 if t1 = t2 and B(t1, t2) = 0 if t1 = t2, thereby making the kernel matrices Hw(w V) block diagonal. Hence, the g HKLMT regularizer based on (7) promotes simultaneous sparsity in kernel selection among the tasks, without enforcing any additional correlations among the tasks. In general, any T T positive semi-deﬁnite matrix may be employed as B to model generic correlations among tasks. The multi-task kernel given by (11) will still remain a valid kernel (Sheldon, 2008; Alvarez et al., 2012). The matrix B is sometimes referred to as the output kernel in the setting of learning vector-valued functions. It is usually constructed from the prior domain knowledge. We now discuss the nature of the optimal solution of (10). Most of the kernel weights δw(γ, λ) are zero at optimality of (10): δw(γ, λ) = 0 whenever γv = 0 or λvw = 0 for any v A(w). The vector γ is sparse due to ℓ1-norm constraint in (10). In addition, ρ 1 ˆρ 1. Hence the vectors λv v V get close to becoming sparse as ρ 1 due to the ℓˆρ-norm constraint in (10). The superimposition of these two phenomena leads to a

Generalized Hierarchical Kernel Learning

ﬂexible4 sparsity pattern in kernel selection. This is explained in detail towards the end of this section. Note that ρ = 2 λvw = 1 v A(w), w W at optimality in (10). Hence for ρ = 2, the minimization problem in (10) can be eﬃciently solved using a projected gradient method (Rakotomamonjy et al., 2008; Bach, 2009). However, as established in Liu and Ye (2010), projection onto the kind of feasibility set in the minimization problem in (10) is computationally challenging for ρ (1, 2). Hence, we wish to re-write this problem in a relatively simpler form that can be solved eﬃciently. To this end, we present the following important theorem.

Theorem 3 The following is a dual of (6) considered with the hinge loss function, and the objectives of (6) (with the hinge loss), (10) and (12) are equal at optimality:

min η 1 g(η), (12)

where g(η) is the optimal objective value of the following convex problem:

max αt S(yt,C) t 1 α 1

w V ζw(η) α YHw Yα ρ ! 1

where ζw(η) = P v A(w) dρ vη1 ρ v 1 1 ρ , α = [α 1 , . . . , α T ] , S(yt, C) = {β Rm | 0

β C, Pm i=1 ytiβi = 0}, yt = [yt1, . . . , ytm] , Y is the diagonal matrix corresponding to the vector [y 1 , . . . , y T ] , 1 is a m T 1 vector with entries as unity, ρ = ˆρ ˆρ 1, ˆρ = ρ 2 ρ, 1 = z R|V| | z 0, P v V zv 1 , and Hw Rm T m T is the multi-task kernel matrix corresponding to the multi-task kernel (11).

The key idea in the proof of the above theorem is to eliminate the λ variables and the details are presented in Appendix A.5. The expression for the prediction function F, in terms of the variables η and α, is provided in Appendix A.9. This theorem provides some key insights: ﬁrstly, we have that (12) is essentially a ℓ1norm regularized problem and hence it is expected that most η will be zero at optimality. Since (ηv = 0) (ζw(η) = 0 w D(v)), it follows that most nodes in V will not contribute in the optimization problems (12) and (13). Secondly, in a single task learning setting (T = 1), the problem in (13) is equivalent to the ℓˆρ-norm MKL dual problem (Kloft

et al., 2011) with the base kernels as (ζv(η)) 1

ρ kv v V ζv(η) = 0. The optimization problem (13) essentially learns an eﬀective kernel of the form h = P v V θv (ζv(η)) 1

ρ hv, where the θ are intermediate optimization variables constrained to be non-negative and lie within a ℓˆρ-norm ball. The expression for θ in terms of the variables η and α is provided in Appendix A.9. The variable θ inﬂuence the nature of the eﬀective kernel h in two important ways: i) it follows from the expression of θ that

θv (ζv(η)) 1

ρ ζv(η) α YHv Yα 1 (ˆρ 1) .

4. The HKL dual formulation (Bach, 2009) is a special case of (10) with ρ = 2, T = 1 and B = 1. When ρ = 2, ˆρ = . This implies λvw = 1 v A(w), w V at optimality, resulting in the weight bias towards kernels embedded in the ancestor nodes and restricted sparsity pattern in kernel selection

Jawanpuria, Nath and Ramakrishnan

Algorithm 1 Active Set Algorithm - Outline

Input: Training data D, the kernels (kv) embedded on the DAG (V), the T T matrix B that models task correlations and tolerance ϵ. Initialize the active set W with sources(V). Compute η, α by solving (14) while Optimal solution for (12) is NOT obtained do

Add some nodes to W Recompute η, α by solving (14) end while Output: W, η, α

The above relation implies that the weight of the kernel hv in the DAG V is not only dependent on the position5 of the node v, but also on the suitability of the kernel hv to the problem at hand. This helps in mitigating the kernel weight bias in favour of the nodes towards the top of the DAG from g HKLMT, but which is present in HKL, and ii) as ρ 1 (and hence as ˆρ 1), the optimal θ get close to becoming sparse (Szafranski et al., 2007; Orabona et al., 2012). This superimposed with the sparsity of η promotes a more ﬂexible sparsity pattern in kernel selection that HKL, especially when ρ 1. Next, we propose to solve the problem (12) by exploiting the sparsity pattern of the η variables and the corresponding ζ(η) terms at optimality. We discuss it in detail in the following section.

4. Optimization Algorithm

Note that problem (12) remains the same whether solved with the original set of variables (η) or when solved with only those ηv = 0 at optimality (refer Appendix A.4 for details). However the computational eﬀort required in the latter case can be signiﬁcantly lower since it involves low number of variables and kernels. This motivates us to explore an active set algorithm, which is similar in spirit to that in Bach (2008). An outline of the proposed active set algorithm is presented in Algorithm 1. The algorithm starts with an initial guess for the set W such that ηw = 0 ( w W) at the optimality of (12). This set W is called the active set. Since the weight associated with the kernel hw will be zero whenever ηv = 0 for any v A(w), the active set W must contain sources(V), else the problem has a trivial solution. Hence, the active set is initialized with sources(V). At each iteration of the algorithm, (12) is solved with variables restricted to those in W:

min η 1 max αt S(yt,C) t 1 α 1

w W ζw(η) α YHw Yα ρ ! 1

In order to formalize the active set algorithm, we need: i) an eﬃcient algorithm for solving problem (14), ii) a condition for verifying whether a candidate solution is optimal

5. Similar to the δv function in HKL (3), it follows from the deﬁnition of ζv that ζv(η) ζw(η) w D(v) (strict inequality holds if ζw(η) > 0).

Generalized Hierarchical Kernel Learning

Algorithm 2 Mirror Descent Algorithm for solving (14)

Input: Gram matrices Hw (w W) and the regularization parameter C Initialize ηW (w W) such that ηW 1 (warm-start may be used) Iteration number: i = 0 while convergence criterion is not met6 do

i = i + 1 Compute ζw(ηW) w W (Theorem 3) Compute αW (13) using ℓˆρ-norm MKL algorithm with kernels as (ζw(ηW)) 1

w W Compute g(ηW) as in (24) Compute step size s = p

log(|W|)/i g(ηW)) 2 Compute ηw = exp (1 + log(ηw) s g(ηW)w) w W Normalize ηw = ηw P v W ηv w W end while Output: ηW, αW

with respect to the optimization problem (12), and iii) a procedure for building/improving the active set after each iteration. We begin with the ﬁrst. We propose to solve the optimization problem (14) using the mirror descent algorithm (Ben-Tal and Nemirovski, 2001; Beck and Teboulle, 2003). Mirror descent algorithm is known to eﬃciently solve convex programs with Lipschitz continuous and diﬀerentiable objectives constrained over a convex compact set. It achieves a nearoptimal convergence rate whenever the feasibility set is a simplex (which is true in our optimization problem (14)). Mirror descent is close in spirit to the projected gradient descent algorithm and hence assumes that an oracle for computing the gradient of the objective is available. Following the common practice of smoothing (Bach, 2009), in the rest of the paper, we employ ζw((1 ε)η + ε |V|) instead7 of ζw(η) in (13) with ε > 0. The following theorem establishes the applicability of mirror descent for solving (14):

Theorem 4 The function g(η) given by (13) is convex. Also, the expression for the ith

entry in the gradient ( g(η))i is given in (24). If all the eigenvalues of the Gram matrices Hw are ﬁnite and non-zero, then g is Lipschitz continuous.

The proof of the above theorem is technical and is provided in Appendix A.6. Algorithm 2 summarizes the proposed mirror descent based algorithm for solving (14). One of its steps involve computing g(ηW) (expression provided in (24)), which in turn requires solving (13). As noted before, (13) is similar to the ℓˆρ-norm MKL problem (Kloft et al., 2011) but with a diﬀerent feasibility set for the optimization variables α. Hence, (13) can be solved by employing a modiﬁed cutting planes algorithm (Kloft et al., 2011) or a modiﬁed sequential minimal optimization (SMO) algorithm (Platt, 1999; Vishwanathan

6. Relative objective gap between two successive iteration being less than a given tolerance ϵ is taken to be the convergence criterion. Objective here is the value of g(ηW), calculated after ℓˆρ-norm MKL step. 7. Note that this is equivalent to smoothing the regularizer ΩT while preserving its sparsity inducing properties (Bach, 2009).

Jawanpuria, Nath and Ramakrishnan

Algorithm 3 Active Set Algorithm

Input: Training data D, the kernels (kv) embedded on the DAG (V), the T T matrix B that models task correlations and tolerance ϵ. Initialize the active set W with sources(V) Compute η, α by solving (14) using Algorithm 2 while suﬃcient condition for optimality (15) is not met do

Add those nodes to W that violate (15) Recompute η, α by solving (14) using Algorithm 2 end while Output: W, η, α

et al., 2010). Empirically, we observed the SMO based algorithm to be much faster than the cutting planes algorithm for g HKLMT (and g HKL) with SVM loss functions. In the special case of ρ = 2, T = 1 and B = 1, (13) is simply a regular SVM problem. Now we turn our attention to the second requirement of the active set algorithm: a condition to verify the optimality of a candidate solution. We present the following theorem that provides a suﬃcient condition for verifying optimality of a candidate solution.

Theorem 5 Suppose the active set W is such that W = hull(W). Let (ηW, αW) be a ϵWapproximate optimal solution of (14), obtained from Algorithm (2). Then, it is an optimal solution for (12) with a duality gap less than ϵ if the following condition holds:

max u sources(Wc) α WYKu YαW

w W ζw(ηW) α WYHw YαW ρ ! 1

ρ + 2(ϵ ϵW), (15)

where Ku = P w D(u) Hw ( P v A(w) D(u) dv) 2 .

The proof is provided in Appendix A.7. It closely follows that for the case of HKL (Bach, 2008). The summary of the proposed mirror descent based active set algorithm is presented in Algorithm 3. At each iteration, Algorithm (3) veriﬁes optimality of the current iterate by verifying the condition in (15). In case the current iterate does not satisfy this condition, the nodes in sources(Wc) that violate the condition (15) are included in the active set.8 This takes care of the third requirement of the active set algorithm. The algorithm terminates if the condition (15) is satisﬁed by the iterate. In the following, an estimate of the computational complexity of the active set algorithm is presented. Let W be the ﬁnal active set size. The optimization problem (14) needs to be solved at most W times, assuming the worst case scenario of adding one node per active set iteration. Each run of the mirror descent algorithm requires at most O(log(W)) iterations (Ben-Tal and Nemirovski, 2001; Beck and Teboulle, 2003). A conservative time complexity estimate for computing the gradient g(ηW) by solving a variant of the ℓˆρnorm MKL problem (13) is O(m3T 3W 2). This amounts to O(m3T 3W 3 log(W)). As for the computational cost of the suﬃcient condition, let z denote the maximum out-degree

8. It is easy to see that with this update scheme, W is always equal to hull(W), as required in Theorem 5.

Generalized Hierarchical Kernel Learning

of a node in G, i.e., z is an upper-bound on the the maximum number of children of any node in G. Then the size of sources(Wc) is upper-bounded by Wz. Hence, a total of O(ωm2T 2Wz) operations are required for evaluating the matrices K in (15), where ω is the complexity of computing a single entry in any K. In all the pragmatic examples of kernels and the corresponding DAGs provided by Bach (2008), ω is polynomial in the training set dimensions. Moreover, caching of K usually renders ω to be a constant (Bach, 2009). Further, the total cost of the quadratic computation in (15) is O(m2T 2W 2z). Thus the overall computational complexity is O(m3T 3W 3 log(W) + ωm2T 2Wz + m2T 2W 2z). More importantly, because the suﬃcient condition for optimality (Theorem 5) is independent of ρ, we have the following result:

Corollary 6 In a given input setting, HKL algorithm converges in time polynomial in the size of the active set and the training set dimensions if and only if the proposed mirror descent based active set algorithm (i.e., g HKLMT algorithm) has a polynomial time convergence in terms of the active set and training set sizes.

The proof is provided in Appendix A.10. In the next section, we present an application of the proposed formulation that illustrate the beneﬁts of the proposed generalizations over HKL.

5. Rule Ensemble Learning

In this section, we propose a solution to the problem of learning an ensemble of decision rules, formally known as Rule Ensemble Learning (REL) (Cohen and Singer, 1999), employing the g HKL and g HKLMT formulations. For the sake of simplicity, we only discuss the single task REL setting in this section, i.e., REL as an application of g HKL. Similar ideas can be applied to perform REL in multi-task learning setting, by employing g HKLMT. In fact, we present empirical results of REL in both single and multiple task learning settings in Section 6. We begin with a brief introduction to REL. If-then decision rules (Rivest, 1987) are one of the most expressive and human readable representations for learned hypotheses. It is a simple logical pattern of the form: IF condition THEN decision. The condition consists of a conjunction of a small number of simple boolean statements (propositions) concerning the values of the individual input variables while the decision speciﬁes a value of the function being learned. An instance of a decision rule from Quinlan s play-tennis example (Quinlan, 1986) is:

IF HUMIDITY==normal AND WIND==weak THEN Play Tennis==yes.

The dominant paradigm for induction of rule sets, in the form of decision list (DL) models for classiﬁcation (Rivest, 1987; Michalski, 1983; Clark and Niblett, 1989), has been a greedy sequential covering procedure. REL is a general approach that treats decision rules as base classiﬁers in an ensemble. This is in contrast to the more restrictive decision list models that are disjunctive sets of rules and use only one in the set for each prediction. As pointed out in Cohen and Singer (1999), boosted rule ensembles are in fact simpler, better-understood formally than other state-of-the-art rule learners and also produce comparable predictive accuracy.

Jawanpuria, Nath and Ramakrishnan

REL approaches like SLIPPER (Cohen and Singer, 1999), LRI (Weiss and Indurkhya, 2000), Rule Fit (Friedman and Popescu, 2008), ENDER/MLRules (Dembczy nski et al., 2008, 2010) have additionally addressed the problem of learning a compact set of rules that generalize well in order to maintain their readability. Further, a number of rule learners like Rule Fit, LRI encourage shorter rules (i.e., fewer conjunctions in the condition part of the rule) or rules with a restricted number of conjunctions, again for purposes of interpretability. We build upon this and deﬁne our REL problem as that of learning a small set of simple rules and their weights that leads to a good generalization over new and unseen data. The next section introduces the notations and the setup in context of REL.

5.1 Notations and Setup

Let D = {(x1, y1), . . . , (xm, ym)} be the training data described using p basic (boolean) propositions, i.e., xi {0, 1}p. In case the input features are not boolean, such propositions can be derived using logical operators such as ==, =, or over the input features (refer Friedman and Popescu, 2008; Dembczy nski et al., 2008, for details). Let V be an indexset for all possible conjunctions with the p basic propositions and let φv : Rn 7 {0, 1} denote the vth conjunction in V. Let fv R denote the weight for the conjunction φv. Then, the rule ensemble to be learnt is the weighted combination of these conjunctive rules: F(x) = P v V fvφv(x) b, where perhaps many weights (fv) are equal to zero.

One way to learn the weights is by performing a ℓ1-norm regularized risk minimization in order to select few promising conjunctive rules (Friedman and Popescu, 2008; Dembczy nski et al., 2008, 2010). However, to the best of our knowledge, rule ensemble learners that identify the need for sparse f, either approximate such a regularized solution using strategies such as shrinkage (Ruleﬁt, ENDER/MLRules) or resort to post-pruning (SLIPPER). This is because the size of the minimization problem is exponential in the number of basic propositions and hence the problem becomes computationally intractable with even moderately sized data sets. Secondly, conjunctive rules involving large number of propositions might be selected. However, such conjunctions adversely eﬀect the interpretability. We present an approach based on the g HKL framework that addresses these issues.

We begin by noting that V, is a subset-lattice; hereafter this will be referred to as the conjunction lattice. In a conjunction lattice, v1, v2 V, v1 v2 if and only if the set of propositions in conjunction v1 is a subset of those in conjunction v2. As an example, (HUMIDITY==normal) is considered to be a subset of (HUMIDITY==normal AND WIND==weak). The top node of this lattice is a node with no conjunctions and is also sources(V). Its children, the second level nodes, are all the basic propositions, p in number. The third level nodes, children of these basic propositions, are the conjunctions of length two and so on. The bottom node at (p + 1)th level is the conjunction of all basic propositions. The number of diﬀerent conjunctions of length r is p r and the total number of nodes in this conjunction lattice is 2p. Figure (1) shows a complete conjunction lattice with p = 4.

We now discuss how the proposed g HKL regularizer (5) provides an eﬃcient and optimal solution to a regularized empirical risk minimization formulation for REL.

Generalized Hierarchical Kernel Learning

x1 = a x2 b x3 c x4 d

x1=a Λ x2 b Λ x3 c x1=a Λ x2 b Λ x4 d x1=a Λ x3 c Λ x4 d x2 b Λ x3 c Λ x d

x1=a Λ x2 b Λ x3 c Λ x4 d

Figure 1: Example of a conjunction lattice with 4 basic propositions: (x1 = a), (x2 = b), (x3 c) and (x4 d). The input space consist of four features: x1, x2, x3 and x4. The number of nodes in conjunction lattice is exponential in the number of basic propositions. In this particular example, the number of nodes is 16 (= 24).

5.2 Rule Ensemble Learning with g HKL

The key idea is to employ g HKL formulation (5) with the DAG as the conjunction lattice and the kernels as kv(xi, xj) = φv(xi)φv(xj) for learning an ensemble of rules. Note that with such a setup, the ℓ1/ℓρ block-norm regularizer in g HKL (ΩS(f) = P v V dv f D(v) ρ) implies: 1) for most v V, fv = 0, and 2) for most v V, fw = 0 w D(v). In the context of the REL problem, the former statement is equivalent to saying: selection of a compact set of conjunctions is promoted, while the second reads as: selection of conjunctive rules with small number of propositions is encouraged. Thus, g HKL formulation constructs a compact ensemble of simple conjunctive rules. In addition, we set dv = a|Sv| (a > 1), where Sv is the set of basic propositions involved in the conjunction φv. Such a choice further encourages selection of short conjunctions and leads to the following elegant computational result:

Theorem 7 The complexity of the proposed g HKL algorithm in solving the REL problem, with the DAG, the base kernels and the parameters dv as deﬁned above, is polynomial in the size of the active set and the training set dimensions. In particular, if the ﬁnal active set size is W, then its complexity is given by O(m3W 3 log(W) + m2W 2p).

The proof is provided in Appendix A.11. We end this section by noting the advantage of the generic regularizer in g HKL formulation over the that in HKL formulation in the context of REL application. Recall that

Jawanpuria, Nath and Ramakrishnan

the sparsity pattern allowed by HKL has the following consequence: a conjunction is selected only after selecting all the conjunctions which are subsets of it. This, particularly in the context of REL, is psycho-visually redundant, because a rule with k propositional statements, if included in the result, will necessarily entail the inclusion of (2k 1) more general rules in the result. This violates the important requirement for a small set (Friedman and Popescu, 2008; Dembczy nski et al., 2008, 2010) of human-readable rules. The g HKL regularizer, with ρ (1, 2), alleviates this restriction by promoting additional sparsity in selecting the conjunctions. We empirically evaluate the proposed g HKL based solution for REL application in the next section.

6. Experimental Results

In this section, we report the results of simulation in REL on several benchmark binary and multiclass classiﬁcation data sets from the UCI repository (Blake and Lichman, 2013). The goal is to compare various rule ensemble learners on the basis of: (a) generalization, which is measured by the predictive performance on unseen test data, and (b) ability to provide compact set of simple rules to facilitate their readability and interpretability (Friedman and Popescu, 2008; Dembczy nski et al., 2010; Cohen and Singer, 1999). The latter is judged using i) average number of rules learnt, and ii) average number of propositions per rule. The following REL approaches were compared.

Rule Fit: Rule ensemble learning algorithm proposed by Friedman and Popescu (2008). All the parameters were set to the default values mentioned by the authors. In particular, the model was set in the mixed linear-rule mode, average tree size was set 4 and maximum number of trees were kept as 500. The same conﬁguration was also used by Dembczy nski et al. (2008, 2010) in their simulations. This REL system cannot handle multi-class data sets and hence is limited to the simulations on binary classiﬁcation data sets. Its code is available at www-stat.stanford.edu/~jhf/R-Rule Fit.html.

SLI: The SLIPPER algorithm proposed by Cohen and Singer (1999). Following Dembczy nski et al. (2008, 2010), all parameters were set to their defaults. We retained the internal cross-validation for selecting the optimal number of rules.

ENDER: State-of-the-art rule ensemble learning algorithm (Dembczy nski et al., 2010). For classiﬁcation setting, ENDER is same as MLRules (Dembczy nski et al., 2008). The parameters were set to the default values suggested by the authors. The second order heuristic was used for minimization. Its code is available at www.cs. put.poznan.pl/wkotlowski.

HKL-ℓ1-MKL: A two-stage rule ensemble learning approach. In the ﬁrst stage, HKL is employed to prune the exponentially large search space of all possible conjunctive rules and select a set of candidate rules (kernels). The rule ensemble is learnt by employing ℓ1-MKL over the candidate set of rules. In both the stages, a three-fold cross validation procedure was employed to tune the C parameter with values in {10 3, 10 2, . . . , 103}.

g HKLρ: The proposed g HKL based REL formulation for binary classiﬁcation problem. We considered three diﬀerent values of ρ: 2, 1.5 and 1.1. Note that for binary

Generalized Hierarchical Kernel Learning

classiﬁcation, ρ = 2 renders the HKL formulation (Bach, 2008). In each case, a threefold cross validation procedure was employed to tune the C parameter with values in {10 3, 10 2, . . . , 103}. As mentioned earlier, the parameters dv = 2|v|.

g HKLMT ρ: The proposed g HKLMT based REL formulation for multiclass classiﬁcation problem. For each class, a one-vs-rest binary classiﬁcation task is created. Since we did not have any prior knowledge about the correlation among the classes in the data sets, we employed the multi-task regularizer (7) in the g HKLMT primal formulation (6).

We considered three diﬀerent values of ρ: 2, 1.5 and 1.1. Its parameters and cross validation details are same as that of g HKLρ. The implementations of both g HKLρ and g HKLMT ρ are available at http://www.cse.iitb.ac.in/~pratik.j/ghkl.

Note that the above methods diﬀer in the way they control the number of rules (M) in the ensemble. In the case of g HKLρ (g HKLMT ρ), M implicitly depends on the parameters: ρ, C and dv. SLI has a parameter for maximum number of rules Mmax and M is decided via a internal cross-validation such that M Mmax. For the sake of fairness in comparison with g HKLρ, we set Mmax = max(M1.5, M1.1), where Mρ is the average number of rules obtained with g HKLρ (g HKLMT ρ). ENDER has an explicit parameter for the number of rules, which is also set to max(M1.5, M1.1). In case of Rule Fit, the number of rules in the ensemble is determined internally and is not changed by us.

6.1 Binary Classiﬁcation in REL

This section summarizes our results on binary REL classiﬁcation. Table 1 provides the details of the binary classiﬁcation data sets. For every data set, we created 10 random train-test splits with 10% train data (except for MONK-3 data set, whose train-test split of 122 432 instances respectively was already given in the UCI repository). Since many data sets were highly unbalanced, we report the average F1-score along with the standard deviation (Table 5 in Appendix A.12 reports the average AUC). The results are presented in Table 2. The best result, in terms of the average F1-score, for each data set is highlighted.

Data set Num Bias p |V| Data set Num Bias p |V| TIC-TAC-TOE (TIC) 958 1.89 54 1016 HEARTSTAT (HTS) 270 0.8 76 1022

B-CANCER-W (BCW) 699 0.53 72 1021 MONK-3 (MK3) 554 1.08 30 109

DIABETES (DIA) 768 0.54 64 1019 VOTE (VTE) 232 0.87 32 109

HABERMAN (HAB) 306 0.36 40 1012 B-CANCER (BCC) 277 0.41 76 1022

HEARTC (HTC) 296 0.85 78 1023 MAM. MASS (MAM) 829 0.94 46 1013

BLOOD TRANS (BLD) 748 3.20 32 109 LIVER (LIV) 345 1.38 48 1014

Table 1: Data sets used for binary REL classiﬁcation. Num is the number of instances in the data set while Bias denotes the ratio of # of +ve and ve instances. The number of number of basic propositions is p and |V| represents the total number of possible conjunctions. For each numerical input feature, 8 basic propositions were derived. The letters in brackets are the acronym used for the corresponding data set in Table 2.

Jawanpuria, Nath and Ramakrishnan

Rule Fit SLI ENDER HKLg HKLρ ℓ1-MKL ρ = 2 ρ = 1.5 ρ = 1.1

TIC 0.517 0.092 0.665 0.053 0.668 0.032 0.749 0.040 0.889 0.093 0.897 0.093 0.905 0.096

(57.7, 2.74) (10.3, 1.96) (187, 3.17) (74.8, 1.89) (161.7, 1.72) (186.6, 1.76) (157.6, 1.72)

BCW 0.879 0.025 0.928 0.018 0.900 0.041 0.925 0.032 0.923 0.032 0.924 0.032 0.925 0.032 (17.5, 2.03) (4.4, 1.15) (21, 1.56) (27,1.03) (30.9, 1) (20, 1.03) (20.4, 1.02)

DIA 0.428 0.052 0.659 0.027 0.656 0.027 0.658 0.028 0.661 0.018 0.663 0.017 0.661 0.023 (32.9, 2.66) (4.9, 1.42) (74.0, 2.65) (47.6, 1.40) (83.2, 1.31) (73.2, 1.17) (62.6, 1.27)

HAB 0.175 0.079 0.483 0.057 0.474 0.057 0.506 0.048 0.523 0.062 0.521 0.060 0.521 0.060 (7.5, 1) (2.1, 1) (52, 3.59) (45.6, 1.48) (112.1, 1.366) (51.2, 1.235) (17.1, 1.142)

HTC 0.581 0.047 0.727 0.05 0.724 0.032 0.750 0.038 0.743 0.038 0.735 0.058 0.736 0.055 (8.8, 1) (3.2, 1.23) (32, 2.05) (32.9, 1.09) (46.7, 1.06) (23.9, 1) (32, 1.09)

BLD 0.163 0.088 0.476 0.057 0.433 0 0.572 0.029 0.586 0.029 0.587 0.028 0.588 0.027 (40.7, 2.26) (2.0, 1) (63, 1.97) (175.9, 2.13) (229.7, 1.98) (62.8, 1.79) (19, 1.29)

HTS 0.582 0.040 0.721 0.065 0.713 0.055 0.752 0.036 0.747 0.031 0.746 0.028 0.747 0.028 (9.3, 1) (3.5, 1.07) (25, 2.02) (24.6, 1.06) (34.7, 1.02) (25, 1.02) (24.4, 1.03)

MK3 0.947 0.802 0.972 0.972 0.972 0.972 0.972 (52, 2.88) (1, 3) (93, 1.96) (17, 1.88) (200, 2.07) (93, 1.84) (7, 1.43)

VTE 0.913 0.047 0.935 0.055 0.951 0.035 0.927 0.045 0.93 0.042 0.929 0.043 0.934 0.038 (2.7, 1) (1.3, 1.15) (9, 1.07) (23.5, 1.17) (39, 1.11) (8.2, 1) (6.4, 1)

BCC 0.254 0.089 0.476 0.086 0.452 0.079 0.588 0.057 0.565 0.059 0.563 0.061 0.569 0.063 (8.1, 1) (1.2, 1) (31, 2.93) (33.6, 1.17) (39.6, 1.15) (30.2, 1.07) (29.4, 1.17)

MAM 0.668 0.032 0.808 0.022 0.816 0.018 0.805 0.028 0.796 0.026 0.796 0.026 0.797 0.024 (26.4, 2.68) (5.3, 1.43) (48, 2.53) (38.7, 1.32) (92.2, 1.27) (47.6, 1.24) (40.5, 1.25)

LIV 0.357 0.016 0.445 0.083 0.563 0.058 0.585 0.071 0.594 0.046 0.595 0.048 0.588 0.049 (10, 1) (1.5, 1) (59, 2.35) (43.4, 1.56) (242.5, 1.42) (58.2, 1.32) (45.7, 1.36)

Table 2: Results on binary REL classiﬁcation. We report the F1-score along with standard deviation and, in brackets below, the number of the learnt rules as well as the average length of the learnt rules. The proposed REL algorithm, g HKLρ (ρ = 1.5, 1.1), obtains better generalization performance than state-of-the-art ENDER in most data sets, with the additional advantage of learning a smaller set of more compact rules. The * symbol denotes statistically signiﬁcant improvement. The results are averaged over ten random train-test splits.

Generalized Hierarchical Kernel Learning

Additionally if the best result achieves a statistically signiﬁcant improvement over its nearest competitor, it is marked with a * . Statistical signiﬁcance test is performed using the paired t-test at 99% conﬁdence. We also report the average number of rules learnt (r) and the average length of the rules (c), speciﬁed below each F1-score as: (r, c). As discussed earlier, it is desirable that REL algorithms achieve high F1-score with a compact set of simple rules, i.e., low r and c.

We can observe from Table 2 that g HKLρ obtains better generalization performance than state-of-the-art ENDER in most of the data sets with the additional advantage of having rules with smaller number of conjunctions. In fact, when averaged over the data sets, g HKL1.1 and g HKL1.5 output the shortest rules among all the methods. g HKL1.1 obtains statistically signiﬁcant performance in TIC-TAC-TOE data set. Though the generalization obtained by g HKL2 (HKL), g HKL1.5 and g HKL1.1 are similar, the number of rules selected by g HKL2 is always higher than g HKL1.1 (by as much as 25 times in a few cases), hampering its interpretability.

6.2 Multiclass Classiﬁcation in REL

This section summarizes our results on multiclass REL classiﬁcation. The details of the multiclass data sets are provided in Table 3. Within the data sets, classes with too few instances (< 3) were not considered for simulations since we perform a three-fold cross validation for hyper-parameter selection. The results, averaged over 10 random train-test splits with 10% train data are presented in Table 4. Following Dembczy nski et al. (2008, 2010), we report the accuracy to compare generalization performance among the algorithms. The number of rules as well as the average length of the rules is also reported to judge the interpretability of the output.

We can observe that g HKLMT ρ obtains the best generalization performance in seven data sets, out of which four are statistically signiﬁcant. Moreover, g HKLMT 1.5 and g HKLMT 1.1 usually select the shortest rules among all the methods. The number of rules as well as the average rule length of g HKLMT 2 is generally very large compared to g HKLMT 1.5 and g HKLMT 1.1. This again demonstrates the suitability of the proposed ℓ1/ℓρ regularizer in obtaining a compact set of simple rules.

Data set Num c p |V| Data set Num c p |V| BALANCE 625 3 32 109 IRIS 150 3 50 1015

CAR 1728 4 42 1012 LYMPH 146 3 86 1025

C.M.C. 1473 3 54 1016 T.A.E. 151 3 114 1034

ECOLI 332 6 42 1012 YEAST 1484 10 54 1016

GLASS 214 6 72 1021 ZOO 101 7 42 1012

Table 3: Data sets used for multiclass REL classiﬁcation. Num is the number of instances in the data set while c denotes the number of classes. The number of number of basic propositions is p and |V| represents the total number of possible conjunctions. For each numerical input feature, 8 basic propositions were derived.

Jawanpuria, Nath and Ramakrishnan

SLI ENDER g HKLMT ρ ρ = 2 ρ = 1.5 ρ = 1.1

BALANCE 0.758 0.025 0.795 0.034 0.817 0.028 0.808 0.032 0.807 0.034 (10.4, 1.7) (112, 2.4) (2468.9, 2.84) (112, 1.64) (85, 1.61)

CAR 0.823 0.029 0.835 0.024 0.864 0.020 0.86 0.028 0.875 0.029

(18.3, 2.93) (270, 3.05) (9571.2, 3.14) (220.3, 1.64) (269.3, 1.85)

C.M.C. 0.446 0.016 0.485 0.015 0.472 0.014 0.463 0.017 0.465 0.016 (21.1, 1.9) (513, 4.36) (10299.3, 2.85) (512.9, 1.95) (396.4, 1.88)

ECOLI 0.726 0.042 0.636 0.028 0.779 0.057 0.784 0.045 0.778 0.054 (7.8, 1.34) (35, 2.15) (4790.2, 2.99) (34.3, 1.05) (32.4, 1.16)

GLASS 0.43 0.061 0.465 0.052 0.501 0.049 0.525 0.043 0.524 0.046 (7.4, 1.41) (70, 3.21) (5663.7, 2.40) (69.1, 1.15) (54.6, 1.04)

IRIS 0.766 0.189 0.835 0.093 0.913 0.083 0.927 0.024 0.893 0.091 (2.2, 1.02) (10, 1.34) (567, 2.44) (9.8, 1) (8.6, 1)

LYMPH 0.61 0.066 0.706 0.058 0.709 0.061 0.724 0.078 0.722 0.078 (2.7, 1) (34, 2.2) (4683.8, 2.30) (33.7, 1.01) (33, 1.01)

T.A.E. 0.334 0.035 0.41 0.065 0.418 0.049 0.399 0.049 0.402 0.046 (1.1, 1) (39, 1.86) (5707.4, 2.25) (38.3, 1.00) (38.1, 1.05)

YEAST 0.478 0.035 0.497 0.015 0.487 0.021 0.485 0.022 0.486 0.021 (23.4, 1.63) (218, 5.78) (8153.6, 2.85) (217.8, 1.80) (179.6, 1.73)

ZOO 0.556 0.062 0.938 0.033 0.877 0.06 0.928 0.037 0.927 0.039 (7.1, 1.24) (33, 1.29) (3322.2, 2.70) (32.3, 1.00) (31.9, 1.01)

Table 4: Results on multiclass REL classiﬁcation. We report the accuracy along with standard deviation and, in the brackets below, the number of learnt rules as well as the average length of the learnt rules. The proposed REL algorithm, g HKLMT ρ, obtains the best generalization performance in most data sets. In addition, for ρ = 1.5 and 1.1, our algorithm learns a smaller set of more compact rules than state-of-the-art ENDER. The * symbol denotes statistically signiﬁcant improvement. The results are averaged over ten random train-test splits.

This paper generalizes the HKL framework in two ways. First, a generic ℓ1/ℓρ block-norm regularizer, ρ (1, 2), is employed that provides a more ﬂexible kernel selection pattern than HKL by mitigating the weight bias towards the kernels that are nearer to the sources of the DAG. Secondly, the framework is further generalized to the setup of learning a shared feature representation among multiple related tasks. We pose the problem of learning shared features across the tasks as that of learning a shared kernel. An eﬃcient mirror descent based active set algorithm is proposed to solve the generalized formulations (g HKL/g HKLMT). An interesting computational result is that g HKL/g HKLMT can be solved in time polynomial in the active set and training set sizes whenever the HKL formulation can be solved in polynomial time. The other important contribution in this paper is the application of the proposed g HKL/g HKLMT formulations in the setting of Rule Ensemble Learning (REL),

Generalized Hierarchical Kernel Learning

where HKL has not been previously explored. We pose the problem of learning an ensemble of propositional rules as a kernel learning problem. Empirical results on binary as well as multiclass classiﬁcation for REL demonstrate the eﬀectiveness of the proposed generalizations.

Acknowledgments

We thank the anonymous reviewers for the valuable comments. We acknowledge Chiranjib Bhattacharyya for initiating discussions on optimal learning of rule ensembles. Pratik Jawanpuria acknowledges support from IBM Ph.D. fellowship.

Appendix A.

In the appendix section, we provide the proofs of theorems/lemmas referred to in the main paper.

A.1 Lemma 26 of Micchelli and Pontil (2005)

Let ai 0, i = 1, . . . , d, 1 r < and d,r = n z Rd | z 0, Pd i=1 zr i 1 o . Then, the following result holds:

The minimum is attained at

1 r+1 i Pd j=1 a

r i = 1, . . . , d.

The proof follows from Holder s inequality.

A.2 Proof of Lemma 1

Proof Applying the above lemma (Appendix A.1) on the outermost ℓ1-norm of the regularizer ΩT (f1, . . . , f T )2 in (6), we get

ΩT (f1, . . . , f T )2 = min γ 1

w D(v) (Qw(f1, . . . , f T ))ρ

where 1 = z R|V| | z 0, P v V zv 1 . Reapplying the above lemma on the individual terms of the above summation gives

w D(v) (Qw(f1, . . . , f T )2) ρ 2

2 ρ = min λv v ˆρ

Qw(f1, . . . , f T )2

Jawanpuria, Nath and Ramakrishnan

where ˆρ = ρ 2 ρ and v r = n z R|D(v)| | z 0, P w D(v) zr w 1 o . Using the above two results and regrouping the terms will complete the proof.

A.3 Re-parameterization of the Multi-task Regularizer in (8)

The g HKLMT dual formulation (10) follows from the representer theorem (Sch olkopf and Smola, 2002) after employing the following re-parameterization in (8). Deﬁne f0w = 1 T+µ PT t=1 ftw and ftw = ftw f0w. Then, Qw(f1, . . . , f T ) in (8) may be rewritten as:

Qw(f1, . . . , f T ) =

t=1 ftw 2 ! 1

Further, construct the following feature map (Evgeniou and Pontil, 2004)

Φw(x, t) = ( φw(x) µ , 0, . . . , 0 | {z } for tasks before t

, φw(x), 0, . . . , 0 | {z } for tasks after t

and deﬁne fw = ( µf0w, f1w, . . . , f Tw). With the above deﬁnitions, we rewrite the g HKLMT primal regularizer as well as the prediction function: Qw(f1, . . . , f T )2 = fw 2 and Ft(x) = P w V fw, Φw(x, t) bt t. It follows from Lemma 1 that the g HKLMT primal problem based on (8) is equivalent to the following optimization problem:

min γ 1 min λv v ˆρ v V min f,b 1 2

w V δw(γ, λ) 1 fw 2 + C

i=1 ℓ(yti, Ft(xti)), (17)

where f = (fw)w V and b = [b1, . . . , b T ].

A.4 Motivation for the Active Set Algorithm

Lemma 8 The problem (12) remains the same whether solved with the original set of variables (η) or when solved with only those ηv = 0 at optimality.

Proof The above follows from the following reasoning: a) variables η owe their presence in (12) only via ζ(η) functions, b) (ηv = 0) (ζw(η) = 0 w D(v)), c) Let (η , α ) be an optimal solution of the problem (12). If ζv(η ) = 0 and η v = 0, then (η , α ) is also an optimal solution of the problem (12) where η w = η w w V \v and η v = 0, and d) min-max interchange in (12) yields an equivalent formulation.

Lemma 9 The following min-max interchange is equivalent:

min η 1 max αt S(yt,C) t G(η, α) = max αt S(yt,C) t min η 1 G(η, α),

Generalized Hierarchical Kernel Learning

G(η, α) = 1 α 1

w V ζw(η) α YHw Yα ρ ! 1

Proof Note that G(η, α) is a convex function in η and a concave function in α. The min-max interchange follows from Sion-Kakutani minimax theorem (Sion, 1958).

A.5 Proof of Theorem 3

Before stating the proof of Theorem 3, we ﬁrst prove the results in Lemma 10, Proposition 11 and Lemma 12, which will be employed therein (also see Bach, 2009, Lemma 10 and Proposition 11).

Lemma 10 Let ai > 0 i = 1, . . . , d, 1 < r < and 1 = n z Rd | z 0, Pd i=1 zi 1 o . Then, the following holds true:

i=1 aizr i =

and the minimum is attained at

i = 1, . . . , d.

Proof Take vectors u1 and u2 as those with entries a

1 r i zi and a 1

r i i = 1, . . . , d respectively. The result follows from the Holder s inequality: u 1 u2 u1 r u2 r r 1 . Note that if any ai = 0, then the optimal value of the above optimization problem is zero.

Proposition 11 The following convex optimization problems are dual to each other and there is no duality gap:

w V δw(γ, λ)Mw, (18)

min κ L max u V

κ2 uwλuw Mw

where L = {κ R|V| |V| | κ 0, P v A(w) κvw = 1, κvw = 0 v A(w)c, w V}, 1 = z R|V| | z 0, P v V zv 1 and Mw 0 w V.

Proof The optimization problem (19) may be equivalently rewritten as:

min κ L min A A, subject to A X

κ2 uwλuw Mw

Jawanpuria, Nath and Ramakrishnan

= min κ L max γ 1

γuκ2 uwλuw Mw

d2u (Lagrangian dual with respect to A)

= max γ 1 min κ L

u A(w) κ2 uw γuλuw

Mw (min-max interchange and rearranging terms)

Mw (Lemma 10 with respect to variables κ)

w V δw(γ, λ)Mw

Lemma 12 The following min-max interchange is equivalent:

min γ 1 min λv v ˆρ v V max αt S(yt,C) t G(γ, λ, α) = max αt S(yt,C) t min γ 1 min λv v ˆρ v V G(γ, λ, α),

where G(γ, λ, α) is as deﬁned in (10).

Proof We proceed by applying a change of variables. Note that γv = 0 implies that the variables λvw ( w D(v)) do not inﬂuence the objective of optimization problem (10). This follows from the deﬁnition of the δ(γ, λ) function. Hence, we deﬁne βvw = γvλvw, w D(v) as it is a one-to-one transformation for γv = 0 (see also Szafranski et al., 2010). The g HKL dual (10) (the L.H.S. of the proposed lemma) can be equivalently rewritten as:

min βvw 0 w D(v),v V P v βv D(v) ˆρ 1

max αt S(yt,C) t G(β, α), where βv D(v) = (βvw)w D(v),

G(β, α) = 1 α 1

w V δw(β)Hw

Yα, and δw(β) 1 = X

Note that δw(β) is a concave function of β (in the given feasibility set) and hence G(β, α) is convex-concave function with convex and compact feasibility sets. Therefore, we obtain minβ maxα G(β, α) = maxα minβ G(β, α) (with constraints over β and α as stated above) by applying the Sion-Kakutani minimax theorem (Sion, 1958). Finally, we revert to the original variables (γ, λ) by substituting γv = (P w D(v) (βvw)ˆρ) 1

ˆρ v V and λvw = βvw

γv w D(v), v V s.t. γv = 0. This gives us the equivalent R.H.S.

Now we begin the proof of Theorem 3. Proof From Lemma 12, the g HKL dual (10) can be equivalently written as:

max α S(y,C) 1 α 1

2 max γ 1 max λv v ˆρ v V

w V δw(γ, λ)α YHw Yα

Generalized Hierarchical Kernel Learning

where ˆρ = ρ 2 ρ. In the following, we equivalently rewrite the second part of the above formulation.

O = max λv v ˆρ v V max γ 1

w V δw(γ, λ) α YHw Yα | {z } Mw

= max λv v ˆρ v V min κ L max u V

κ2 uwλuw Mw

d2u (Proposition 11)

= max λv v ˆρ v V min κ L min A A (Eliminating u)

κ2 vwλvw Mw

= min κ L min A max λv v ˆρ v V A (Sion-Kakutani theorem)

κ2 vwλvw Mw

= min κ L min A A (Holder s inequality, ρ= ˆρ ˆρ 1)

s.t. A d 2 v

= min κ L max u V d 2 u

ρ (Eliminating A) (21)

Now consider the problem O ρ = minκ L maxu V d 2 ρ u P w D(u) κ2 uw Mw ρ. Its Lagrangian is

L(κ, A, η) = A + X

κ2 vw Mw ρ A

Minimization of L with respect to A leads to the constraint η 1. Hence, we have:

O ρ = max η 1 min κ L

w D(v) ηv d 2 v κ2 vw Mw ρ .

Using the special structure of L, the above can be rewritten as:

O ρ = max η 1

min κw |A(w)|

ηvd 2 ρ v κ2 ρ vw

where |A(w)| = n η R|A(w)| | η 0, P w A(w) ηw 1 o . By applying Lemma 10 with respect to variables κ, we obtain the following equivalence:

min κw |A(w)|

ηvd 2 ρ v κ2 ρ vw = ζw(η) =

v A(w) dρ vη1 ρ v

1 1 ρ . (22)

Jawanpuria, Nath and Ramakrishnan

From the above two results, we obtain the following equivalent dual of (21):

O = max η 1

w V ζw(η) α YHw Yα ρ ! 1

Substituting O in (20) by the above (23) and again interchanging the min-max completes the proof.

A.6 Proof of Theorem 4

Proof We begin by noting that ζv(η) (v V) is a concave function of η for all v (this is because when ρ (1, 2], ζv is a weighted q-norm in η, where q [ 1, 0) and hence is concave in the ﬁrst quadrant). By simple observations regarding operations preserving convexity we have that the objective in (13) is a convex function of η for a ﬁxed value of α. Hence g(η), which is a point-wise maximum over convex functions, is itself convex. The expression for g(η) is computed by employing Danskin s theorem (Bertsekas, 1999, Proposition B.25) and is as follows:

( g(η))i = (1 ε)

u D(i) dρ i

(1 ε)ηi + ε

ρ ζs u(η)ρ α YHu Y α ρ

w V ζs w(η) α YHw Y α ρ ! 1

where ρ = ρ 2(ρ 1), ζs w(η) = ζw((1 ε)η + ε |V|), i.e., the smoothed ζw(η) and α is an optimal solution of problem (13) with that η where the gradient is to be computed. Next, we show that g is Lipschitz continuous by showing that its gradient is bounded. Firstly, ρ (1, 2] and hence ρ [1, ). Next, let the minimum and maximum eigenvalues over all Hw (w V) be θ and σ respectively. Then we have θ α 2 α YHw Y α σ α 2. Using this, we obtain: P w V ζs w(η) α YHw Y α ρ θ ρ α 2 ρ P w V ζs w(η). Note

that P w V ζs w(η) ζs r(η) where r sources(V) and ζs r(η) dρ/(1 ρ) max ε |V| where dmax is the

maximum of dv (v V). Thus we obtain: P2 θ ρ α 2 ρε/|V| 1

2 ρ ρ 1 max.

Now, it is easy to see that u D(i), dρ i ((1 ε)ηi + ε |V|) ρζu(η)ρ d

where dmin is the minimum of dv (v V). Hence P1 |V|σ ρ α 2 ρd

ρ 1 ρ min. In addition, since 0 α C, we have α

m TC. Summarizing these ﬁndings, we obtain the following bound on the gradient:

g(η) 1 (1 ε)

2 ρ m TC2θ1 ρσ ρε

ρ |V| 2 ρ +1d

2 ρ ρ 1 max.

The proof will be similar for g HKLMT formulations in other learning settings.

Generalized Hierarchical Kernel Learning

A.7 Proof of Theorem 5

Proof Given a candidate solution η and α = [α 1 , . . . , α T ] (with associated primal (f = (f1, . . . , f T ), b, ξ)), the duality gap (D) between the two variational formulations in Lemma 9 is as follows:

D = max ˆαt S(yt,C) t G(η, ˆα) min ˆη 1 G(ˆη, α)

2ΩT (f)2 + C1 ξ min ˆη 1 G(ˆη, α)

Gap in solving with ﬁxed η z }| { ΩT (f)2 + C1 ξ 1 α +1

w V ζw(ˆη) α YHw Yα ρ ! 1

| {z } Gap in solving with ﬁxed α

With this upper bound on the duality gap, it is easy to see that the following condition is suﬃcient for the reduced solution (with active set W) to have D ϵ:

w V ζw(η) α YHw Yα ρ ! 1

ρ ΩT (f W)2 + 2(ϵ ϵW), (25)

where ϵW is the duality gap9 associated with the computation of the dual variables αW. Here as well as in the rest of the proof, the subscript ( )W implies the value of the variable obtained when the g HKLMT formulation is solved with V restricted to the active set W. In Appendix A.5, we had proved that the L.H.S. of the above inequality is equal to the R.H.S. of (21), i.e.,

w V ζw(η) (Mw) ρ ! 1

ρ = min κ L max v V d 2 v

where Mw = α WYHw YαW. Next, we obtain an upper bound of the above by substituting κ L in the R.H.S of (26). In particular, we employ the following: the value of κvw v, w W is obtained by solving the small10 problem (14). This is ﬁne because W = hull(W). For v Wc and w W , by deﬁnition of L and W, we have κvw = 0. Next, κvw is set to zero v W, w Wc. For the remaining κvw, v Wc and w Wc, we use the value of κ obtained by solving (21) with

ρ = 1, i.e., κvw = dv P u A(v) Wc du 1 (also see Section A.5 Bach, 2009). Note that the above constructed value of κ is feasible in the set L. With this choice of κ substituted in

9. This is given by the gap associated with the ˆρ-norm MKL solver employed in the mirror descent algorithm for solving the small problem (14). 10. The value of κvw ( v, w W) obtained in this manner satisfy the constraint set L restricted to W, i.e., LW = {κ R|W| |W| | κ 0, P

v A(w) κvw = 1, κvw = 0 v A(w)c W, w W}

Jawanpuria, Nath and Ramakrishnan

the R.H.S. of (26), we have the following inequalities:

w V ζw(η) α WYHw YαW ρ ! 1

ΩT (f W)2, max u Wc

α WYHw YαW P v A(w) Wc dv 2

( Speciﬁc choice of κ)

ΩT (f W)2, max u sources(Wc)

α WYHw YαW P v A(w) Wc dv 2

( W = hull(W))

ΩT (f W)2, max u sources(Wc)

α WYHw YαW P v A(w) D(u) dv 2

( P v A(w) Wc dv P v A(w) D(u) dv)

ΩT (f W)2, max u sources(Wc)

α WYHw YαW P v A(w) D(u) dv 2

( β 1 β ρ ρ 1)

Employing the above upper bound in (25) leads to the result in Theorem 5. Note that in practice, the last upper bound is not loose for Rule Ensemble Learning (REL) application. This is because most of the matrices, especially near the bottom of the lattice, will be (near) zero-matrices larger the conjunctive rule, the fewer are the examples which may satisfy it.

A.8 g HKLMT with General Convex Loss Functions

In this section, we present extension of the proposed algorithm to other learning settings like regression. In particular, we consider the case where the loss function ℓ( , ) is a general convex loss function such as the hinge loss, the square loss, the Huber loss, etc. The g HKLMT primal formulation with a general convex loss function ℓ( , ) was given in equation (6). The specialized g HKLMT dual formulation corresponding to (6) is as follows:

min η 1 max αt Rm,1 αt=0 t C

i=1 ϕ ti αti

w V ζw(η) α Hwα ρ ! 1

Generalized Hierarchical Kernel Learning

where α = [α 1 , . . . , α T ] , ζw(η) = P v A(w) dρ vη1 ρ v 1 1 ρ (refer Theorem 3 for details)

and ϕ ti denotes the Fenchel11 conjugate (Boyd and Vandenberghe, 2004) of the function ϕti : z ℓ(yti, z).

A.9 Prediction Function for g HKLMT with the Hinge Loss Function

Let the ﬁnal active set be W and ( ηW, αW) be the optimal solution of (12). Then the prediction function for an instance xtj belonging to the tth task is given by

Ft(x) = ( αW y) X

w W θw(ζw( ηW)) 1

ρ Hw( , xtj)

where symbol denote element-wise product, Hw is the kernel matrix corresponding to the multi-task kernel (11), Hw( , xtj) = ((Hw(xt i, xtj))m i=1)T t =1 and

(ζw( ηW)) 1

ρ α WYHw Y αW P v W (ζv( ηW)) 1

ρ α WYHv Y αW ρ 1

A.10 Proof of Corollary 6

Note that proving the computational complexity of the matrix Ku (u sources(Wc)) in (15) to be polynomial time in size of the active set and the training set dimensions suﬃces to prove the corollary. This is because all the other steps in Algorithms 3 and 2 are of polynomial time complexity (discussed in Section 4). We begin the proof by introducing some indexing notations related to the multi-task matrices. Let the entries in Hw, the m T m T multi-task kernel matrix, be arranged in the following form: the entry corresponding to the input pair (xt1i, xt2j) be in the ((t1 1) m + i)th row and ((t2 1) m + j)th column of Hw. Next we observe that the expression for Ku in Theorem 5 may be rewritten as:

Kw P v A(w) D(u) dv 2

where: i) Kw is a m T m T matrix corresponding to the base kernel kw and constructed from the inputs from all the tasks, ii) KT is a m T m T such that the entry corresponding to the ((t1 1) m + i)th row and ((t2 1) m + j)th column (1 i, j m) of KT is B(t1, t2), and iii) is the symbol for element-wise product (Hadamard product).

11. Fenchel conjugate ϕ (z) of a convex function ϕ(u) is given by ϕ (z) = supu z u ϕ(u). As an example,

for hinge loss ϕ(u) = ℓ(u, y) = max(0, 1 uy), ϕ (z) = zy if zy [ 1, 0] otherwise

Jawanpuria, Nath and Ramakrishnan

In the above expression, Ku is computable in polynomial time if and only if Tu is computable in polynomial time. The proof of the corollary follows from observing the expression of the suﬃcient condition for optimality of the HKL (Bach, 2009, Equation 21), which also involves the term Tu.

A.11 Proof of Theorem 7

Given an active set W of size W, proving that the computational complexity of the veriﬁcation of the suﬃcient condition of optimality (15) is polynomial in terms of the active set and the training set sizes suﬃces to prove Theorem 7. This is because all the other steps in Algorithms 3 and 2 are of polynomial time complexity (discussed in Section 4). In the REL setup, the DAG is the conjunction lattice and the embedded kernels kv v V may be rewritten as:

kv(xi, xj) = φv(xi) φv(xj) =

c Sv φc(xi)

c Sv φc(xj)

c Sv kc(xi, xj),

where Sv is the set of basic propositions involved in the conjunction φv and is the symbol for element-wise product (Hadamard product). The kernels corresponding to the basic propositions are in fact the base kernels embedded in the second level nodes of the lattice V. Employing the above deﬁnition of kv(xi, xj), the matrices Ku (in L.H.S. of (15)) are computed as:

Kw P v A(w) D(u) dv 2 =

Kc (1 + a)2 + 11

where Kc is the kernel matrix corresponding to the basic proposition φc, B is the set of all basic propositions and the parameters dv (v V) are deﬁned as dv = a|Sv| (a > 0). It is obvious that a trivial computational complexity of computing Ku (u V) is O(pm2). In practice, this complexity can be reduced to O(m2) by caching the matrices Ku. For illustration, suppose Ku1 needs to be computed, given that Ku0 is cached and u0 is a parent of u1. Let the extra basic proposition contained in φu1 (with respect to φu0) be φe. Then Ku1 can be calculated as follows:

Ku1 = Ku0 Ke

Ke (1 + a)2 + 11 ,

where is the symbol for element-wise division of matrices. Hence, plugging the REL speciﬁc values in the runtime complexity of the g HKL algorithm, ω =constant and z = p, the runtime complexity of the g HKL based REL algorithm is O(m3W 3 log(W) + m2W 2p).

A.12 REL Binary Classiﬁcation Results in AUC

Table 5 reports the REL binary classiﬁcation results in AUC (area under the ROC curve). The experimental details (and results measured in F1-score) are discussed in Section 6.

Generalized Hierarchical Kernel Learning

Rule Fit SLI ENDER HKLg HKLρ ℓ1-MKL ρ = 2 ρ = 1.5 ρ = 1.1

TIC 0.736 0.05 0.482 0.21 0.783 0.036 0.836 0.024 0.967 0.023 0.973 0.02 0.975 0.018

BCW 0.941 0.011 0.917 0.051 0.958 0.039 0.981 0.008 0.984 0.005 0.93 0.099 0.93 0.099

DIA 0.67 0.027 0.576 0.115 0.761 0.02 0.746 0.050 0.766 0.046 0.733 0.058 0.636 0.118

HAB 0.537 0.054 0.17 0.155 0.575 0.039 0.524 0.078 0.556 0.07 0.482 0.11 0.383 0.166

HTC 0.764 0.03 0.541 0.215 0.805 0.031 0.802 0.085 0.837 0.035 0.763 0.12 0.753 0.118

BLD 0.546 0.06 0.175 0.256 0.68 0.028 0.660 0.025 0.667 0.034 0.634 0.028 0.519 0.079

HTS 0.765 0.028 0.712 0.085 0.801 0.022 0.825 0.032 0.849 0.021 0.83 0.027 0.811 0.056

MK3 0.972 0.632 0.998 0.995 1 0.998 0.957

VTE 0.955 0.022 0.919 0.048 0.965 0.014 0.977 0.009 0.972 0.016 0.948 0.015 0.945 0.016

BCC 0.578 0.05 0.469 0.078 0.622 0.043 0.627 0.063 0.637 0.055 0.576 0.089 0.513 0.124

MAM 0.818 0.02 0.763 0.08 0.887 0.006 0.866 0.028 0.882 0.023 0.85 0.032 0.839 0.03

LIV 0.607 0.017 0.093 0.168 0.619 0.038 0.619 0.074 0.623 0.038 0.583 0.11 0.565 0.109

Table 5: Results on binary REL classiﬁcation. We report the average AUC along with standard deviation, over ten random train-test splits.

J. Aﬂalo, A. Ben-Tal, C. Bhattacharyya, J. Saketha Nath, and S. Raman. Variable sparsity kernel learning. Journal of Machine Learning Research, 12:565 592, 2011.

M. A. Alvarez, L. Rosasco, and N. D. Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4:195 266, 2012.

R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817 1853, 2005.

A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73:243 272, 2008.

F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in Neural Information Processing Systems, 2008.

F. Bach. High-dimensional non-linear variable selection through hierarchical kernel learning. Technical report, INRIA, France, 2009.

F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of International Conference on Machine Learning, 2004.

Jawanpuria, Nath and Ramakrishnan

J. Baxter. A model of inductive bias learning. Journal of Artiﬁcial Intelligence Research, 12:149 198, 2000.

A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003.

S. Ben-David and R. Schuller. A notion of task relatedness yielding provable multiple-task learning guarantees. Machine Learning, 73:273 287, 2008.

A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization: Analysis, algorithms and engineering applications. MPS/ SIAM Series on Optimization, 1, 2001.

D. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, 1999.

K. Blake and M. Lichman. UCI machine learning repository, 2013. URL http://archive. ics.uci.edu/ml.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

R. Caruana. Mutitask learning. Machine Learning, 28:41 75, 1997.

P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3:261 283, 1989.

W. W. Cohen and Y. Singer. A simple, fast, and eﬀective rule learner. In AAAI Conference on Artiﬁcial Intelligence, 1999.

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (3. ed.). MIT Press, 2009.

K. Dembczy nski, W. Kot lowski, and R. S lowi nski. Maximum likelihood rule ensembles. In Proceedings of the International Conference of Machine Learning, 2008.

K. Dembczy nski, W. Kot lowski, and R. S lowi nski. ENDER - A statistical framework for boosting decision rules. Data Mining and Knowledge Discovery, 21:52 90, 2010.

T. Evgeniou and M. Pontil. Regularized multi-task learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004.

T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615 637, 2005.

J. H. Friedman and B. E. Popescu. Predictive learning via rule ensembles. Annals of Applied Statistics, 2:916 954, 2008.

L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: A convex formulation. In Advances in Neural Information Processing Systems, 2008.

A. Jain, S. V. N. Vishwanathan, and M. Varma. SPG-GMKL: Generalized multiple kernel learning with a million kernels. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012.

Generalized Hierarchical Kernel Learning

P. Jawanpuria and J. S. Nath. Multi-task multiple kernel learning. In SIAM International Conference on Data Mining, 2011.

P. Jawanpuria and J. S. Nath. A convex feature learning formulation for latent task structure discovery. In Proceedings of the International Conference on Machine Learning, 2012.

P. Jawanpuria, J. S. Nath, and G. Ramakrishnan. Eﬃcient rule ensemble learning using hierarchical kernels. In Proceedings of the International Conference of Machine Learning, 2011.

P. Jawanpuria, M. Varma, and J. S. Nath. On p-norm path following in multiple kernel learning for non-linear feature selection. In Proceedings of the International Conference of Machine Learning, 2014.

M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. ℓp-norm multiple kernel learning. Journal of Maching Learning Research, 12:953 997, 2011.

G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semideﬁnite programming. Journal of Machine Learning Research, 5: 27 72, 2004.

H. Lee, A. Battle, R. Raina, and A. Y. Ng. Eﬃcient sparse coding algorithms. In Advances in Neural Information Processing Systems, 2007.

J. Liu and J. Ye. Eﬃcient ℓ1/ℓq norm regularization. Technical Report ar Xiv:1009.4766, 2010.

K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsity in multi-task learning. In Proceedings of the Annual Conference on Learning Theory, 2009.

C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099 1125, 2005.

R. S. Michalski. A theory and methodology of inductive learning. Artiﬁcial Intelligence, 20:111 161, 1983.

S. Negahban and M. Wainwright. Phase transitions for high-dimensional joint support recovery. In Advances in Neural Information Processing Systems, 2009.

G. Obozinski, Martin J. Wainwright, and M.I. Jordan. Support union recovery in highdimensional multivariate regression. Annals of Statistics, 39:1 17, 2011.

F. Orabona, J. Luo, and B. Caputo. Multi kernel learning with online-batch optimization. Journal of Machine Learning Research, 13:227 253, 2012.

J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods - Support Vector Learning, 1999.

J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81 106, 1986.

Jawanpuria, Nath and Ramakrishnan

A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. Simple MKL. Journal of Machine Learning Research, 9:2491 2521, 2008.

R. L. Rivest. Learning decision lists. Machine Learning, 2:229 246, 1987.

B. Sch olkopf and A. Smola. Learning with Kernels. MIT press, Cambridge, 2002.

D. Sheldon. Graphical multi-task learning. Technical report, Cornell University, 2008.

M. Sion. On general minimax theorem. Paciﬁc Journal of Mathematics, 1958.

M. Szafranski, Y. Grandvalet, and P. M. Mahoudeaux. Hierarchical penalization. In Advances in Neural Information Processing Systems, 2007.

M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite kernel learning. Machine Learning, 79:73 103, 2010.

B. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variable selection. Technometrics, 47:349 363, 2005.

V. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.

S. V. N. Vishwanathan, Z. Sun, N. T.-Ampornpunt, and M. Varma. Multiple kernel learning and the SMO algorithm. In Advances in Neural Information Processing Systems, 2010.

S. M. Weiss and N. Indurkhya. Lightweight rule induction. In Proceedings of the International Conference of Machine Learning, 2000.

Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classiﬁcation with dirichlet process priors. Journal of Maching Learning Research, 8:35 63, 2007.

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68:49 67, 2006.