# convex_calibrated_surrogates_for_hierarchical_classification__eb41ff3f.pdf

Convex Calibrated Surrogates for Hierarchical Classiﬁcation

Harish G. Ramaswamy HARISH GURUP@CSA.IISC.ERNET.IN Indian Institute of Science, Bangalore, INDIA

Ambuj Tewari TEWARIA@UMICH.EDU University of Michigan, Ann Arbor, USA

Shivani Agarwal SHIVANI@CSA.IISC.ERNET.IN Indian Institute of Science, Bangalore, INDIA

Hierarchical classiﬁcation problems are multiclass supervised learning problems with a predeﬁned hierarchy over the set of class labels. In this work, we study the consistency of hierarchical classiﬁcation algorithms with respect to a natural loss, namely the tree distance metric on the hierarchy tree of class labels, via the usage of calibrated surrogates. We ﬁrst show that the Bayes optimal classiﬁer for this loss classiﬁes an instance according to the deepest node in the hierarchy such that the total conditional probability of the subtree rooted at the node is greater than 1

2. We exploit this insight to develop new consistent algorithm for hierarchical classiﬁcation, that makes use of an algorithm known to be consistent for the multiclass classiﬁcation with reject option (MCRO) problem as a subroutine. Our experiments on a number of benchmark datasets show that the resulting algorithm, which we term Ov A-Cascade, gives improved performance over other state-of-the-art hierarchical classiﬁcation algorithms.

1. Introduction

In many practical applications of the multiclass classiﬁcation problem the class labels live in a pre-deﬁned hierarchy. For example, in document classiﬁcation the class labels are topics and they form topic hierarchies; in computational biology the class labels are protein families and they are also best organized in a hierarchy. See Figure 1 for an example hierarchy used in mood classiﬁcation of speech. Such problems are commonly known in the machine learning literature as hierarchical classiﬁcation.

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

Figure 1. Speech based mood classiﬁcation hierarchy in the Berlin dataset (Burkhardt et al., 2005) used by Xiao et al. (2007)

Hierarchical classiﬁcation has been the subject of many studies (Wang et al., 1999; Sun & Lim, 2001; Cai & Hofmann, 2004; Dekel et al., 2004; Rousu et al., 2006; Cesa Bianchi et al., 2006a;b; Wang et al., 2011; Gopal et al., 2012; Babbar et al., 2013; Gopal & Yang, 2013). For a detailed review and more references we refer the reader to a survey on hierarchical classiﬁcation by Silla Jr. & Freitas (2011).

The label hierarchy has been incorporated into the problem in various ways in different approaches. The most prevalent and technically appealing approach is to involve the hierarchy in the ﬁnal evaluation metric and design an algorithm that does well on this evaluation metric. We shall work in the setting where class labels are single nodes in a tree, and use the very natural evaluation metric that penalizes predictions according to the tree-distance between the prediction and truth (Sun & Lim, 2001; Cai & Hofmann, 2004; Dekel et al., 2004).

While hierarchical classiﬁcation problems are actively studied, there is a gap between theory and practice even basic statistical properties of hierarchical classiﬁcation algorithms have not been examined in depth. This paper addresses this gap and its main contributions are summarized below:

Convex Calibrated Surrogates for Hierarchical Classiﬁcation

We show that the Bayes optimal classiﬁer for the treedistance loss classiﬁes an instance according to the deepest node in the hierarchy such that the total conditional probability of the subtree rooted at the node is greater than 1

We reduce the problem of ﬁnding the Bayes optimal classiﬁer for the tree-distance loss to the problem of ﬁnding the Bayes optimal classiﬁer for multiclass classiﬁcation with reject option (MCRO) problem.

We construct a convex optimization based consistent algorithm for the tree-distance loss based on the above reduction and observe that in one particular instantiation called the Ov A-cascade, this optimization problem can be solved only using binary SVM solvers.

We run the Ov A-cascade algorithm on several benchmark datasets and demonstrate improved performance.

2. Preliminaries

Let the instance space be X, and let Y = [n] = {1, . . . , n} be a ﬁnite set of class labels. Let H = ([n], E, W) be a tree over the class labels, with edge set E, and positive, ﬁnite edge lengths for the edges in E given by W. Let the root node be r [n]. Let loss function ℓH : [n] [n] R+ be

ℓH(y, y ) = Shortest path length in H between y and y .

We call this the H-distance loss (or simply tree-distance loss). Given training examples (X1, Y1), . . . , (Xm, Ym) drawn i.i.d. from a distribution D on X Y, the goal is to learn a prediction model g : X [n] with low expected ℓH-regret deﬁned as

RℓH D [g] = E[ℓH(Y, g(X))] inf g :X [n] E[ℓH(Y, g (X))] ,

where expectations are over (X, Y ) D. Ideally, one wants the ℓH-regret of the learned model to be close to zero. An algorithm which when given a random training sample as above produces a (random) model hm : X T is said to be consistent w.r.t. ℓH if the ℓH-regret of the learned model gm converges in probability to zero.

However, minimizing the discrete ℓH-regret directly is computationally difﬁcult; therefore one uses instead a surrogate loss function ψ : [n] Rd R+ for some d Z+ and learns a model f : X Rd by minimizing (approximately, based on the training sample) the ψ-error E(X,Y ) D[ψ(Y, f(X))]. Predictions on new instances x X are then made by applying the learned model f and mapping back to predictions in the target space [n] via some mapping Υ : Rd [n], giving g(x) = Υ(f(x)). Let the ψ-regret of a function f : X Rd be

Rψ D[f] = E[ψ(Y, f(X))] inf f :X Rd E[ψ(Y, f (X))] .

Under suitable conditions, algorithms that approximately minimize the ψ-error based on a training sample are known to be consistent with respect to ψ, i.e. the ψ-regret of the learned model f approaches zero with larger training data.1 Also, if ψ is convex in its second argument, the ψerror minimization problem becomes a convex optimization problem and can be solved efﬁciently.

We seek a surrogate ψ : [n] Rd R+ for some d Z+ and a predictor Υ : Rd [n] such that ψ is convex in its second argument and satisﬁes a bound of the following form holding for all f : X Rd and distributions D

RℓH D [Υ f] ξ Rψ D[f], (1)

where ξ > 0 is a constant. A surrogate and a predictor (ψ, Υ), satisfying such a bound, which we call a (ψ, ℓH, Υ)-excess risk transform,2 would immediately give an algorithm consistent w.r.t. ℓH from an algorithm consistent w.r.t. ψ. We also say that such a (ψ, Υ) is calibrated (Zhang, 2004; Ramaswamy & Agarwal, 2012) w.r.t. ℓH.

2.1. Conventions and Notations

n denotes the probability simplex in Rn: n = {p Rn + : P

For the tree H = ([n], E, W) with root r we deﬁne the following several objects. For every y [n] deﬁne the sets D(y), C(y), U(y) as follows:

D(y) = Set of descendants of y including y

P(y) = Parent of y

C(y) = Set of children of y

U(y) = Set of ancestors of y, not including y.

For all y [n], deﬁne the level of y denoted by lev(y), and the mapping Sy : n [0, 1] as follows:

lev(y) = |U(y)|

i D(y) pi .

Let the height of the tree be h = maxy [n] lev(y). Deﬁne

1For example, an algorithm consistent w.r.t. ψ can be obtained by minimizing the regularized empirical ψ-risk over an RKHS function class with Gaussian kernel and a regularization parameter approaching 0 with increasing sample size. 2An inequality which upper bounds the ℓH-regret in terms of a function ξ of the ψ-regret, with ξ(0) = 0 and ξ continuous at 0 would also qualify to be an excess risk transform.

Convex Calibrated Surrogates for Hierarchical Classiﬁcation

the sets N=j, N j and scalars αj, βj for 0 j h as:

N=j = {y [n] : lev(y) = j}

N j = {y [n] : lev(y) j}

αj = max y,y N=j ℓH(y, y )

βj = max y N=j ℓH(y, P(y)).

By reordering the classes we ensure that lev is a nondecreasing function and hence we always have that N j = [nj] for some integers nj and r = 1.

For integers 0 j h deﬁne the function ancj : [n] N j and Aj : n nj such that for all y [n], y [nj],

( y if lev(y) j ancestor of y at level j otherwise

Aj y (p) = X

i [n]:ancj(i)=y pi

( py if lev(y ) < j Sy(p) if lev(y ) = j .

Note that in all the above deﬁnitions the only terms that depend on the edge lengths W are the scalars αj and βj.

3. Bayes Optimal Classiﬁer for the Tree-Distance Loss

In this section we characterize the Bayes optimal classiﬁer minimizing the expected tree-distance loss. We show that such a predictor can be viewed as a greater than 1

2 conditional probability subtree detector . We then design a scheme for computing this prediction based on this observation.

The following theorem is the key result of this section. Figure 2 gives an illustration for this theorem.

Theorem 1. Let H = ([n], E, W) and let ℓH : [n] [n] R+ be the tree-distance loss for the tree H. For x X, let p(x) n be the conditional probability of the label given the instance x. Then there exists a g : X [n] such that for all x X the following holds:

(a) Sg (x)(p(x)) 1

(b) Sy(p(x)) 1

2, y C(g (x)) .

Also, g is a Bayes optimal classiﬁer for the tree distance loss, i.e. RℓH D [g ] = 0 .

1 p1 = 0 S1(p) = 1

p2 = 0.2 S2(p) = 0.3

p3 = 0 S3(p) = 0.7

p4 = 0.1 S4(p) = 0.1

p5 = 0 S5(p) = 0

p6 = 0 S6(p) = 0.1

p7 = 0.2 S7(p) = 0.6

p4 = 0.1 S4(p) = 0.1

p5 = 0 S5(p) = 0

p10 = 0.2 S10(p) = 0.2

p11 = 0.2 S11(p) = 0.2

Figure 2. An example tree and an associated conditional probability vector p(x) for some instance x, along with S(p(x)). The Bayes optimal prediction is shaded here.

For any instance x, with conditional probability p n, Theorem 1 says that predicting y [n] that has the largest level and has Sy(p) 1

2 is optimal. Surprisingly, this does not depend on the edge lengths W.

Theorem 1 suggests the following scheme to ﬁnd the optimal prediction for a given instance, with conditional probability p:

1. For each j {1, 2, . . . , h} create a multiclass problem instance with the classes being elements of N j = [nj], and the probability associated with each class in y N j is equal to Aj y(p), i.e. py if lev(y) < j and equal to Sy(p) if lev(y) = j.

2. For each multiclass problem j {1, 2, . . . , h}, if there exists a class with probability mass at least 1

2 assign it to v j , otherwise let v j = .

3. Find the largest j such that v j = and return the corresponding v j , or return the root 1 if v j = for all j [h].

We will illustrate the above procedure for the example in Figure 2. Example 1. From Figure 2 we have that h = 3. The three induced multiclass problems are given below.

1. n1 = 3, and the class probabilities are given as 1 10[0, 3, 7]. Clearly, v 1 = 3.

2. n2 = 7, and the class probabilities are given as 1 10[0, 2, 0, 1, 0, 1, 6]. Clearly v 2 = 7.

3. n3 = 11, and the class probabilities are given as 1 10[0, 2, 0, 1, 0, 0, 2, 1, 0, 2, 2]. Clearly, v 3 = .

And hence the largest j such that v j = is 2, and the scheme returns v 2 = 7.

Convex Calibrated Surrogates for Hierarchical Classiﬁcation

The reason such a scheme as the one above is of interest to us is that the second step in the above scheme exactly corresponds to the Bayes optimal classiﬁer for the abstain loss, the evaluation metric used in the MCRO problem, which we brieﬂy explain in the next section.

4. Multiclass Classiﬁcation with Reject Option

In some multiclass problems like medical diagnosis, it is better to abstain from predicting on instances where the learner is uncertain rather than predicting the wrong class. This feature can be incorporated via an evaluation metric called the abstain loss, and designing algorithms that perform well on this evaluation metric instead of the standard zero-one loss. The n-class abstain loss ℓ?,n : [n] ([n] { }) R+, (Ramaswamy et al., 2015) is deﬁned as

ℓ?,n(y, y ) =

1 if y = y and y = 1 2 if y = 0 if y = y .

It can be seen that the Bayes optimal risk for the abstain loss is attained by the function g : X ([n] { }) given by

( argmaxy [n]py(x) if maxy [n] py(x) 1

2 otherwise ,

where py(x) = P(Y = y|X = x).

The ℓ?,n-regret of a function g : X ([n] { }) is

Rℓ?,n D [g] = E[ℓ?,n(Y, g(X))] inf g E[ℓ?,n(Y, g (X))] .

Ramaswamy et al. (2015) give three different surrogates and predictors with excess risk transforms relating the surrogate regret to the ℓ?,n-regret, one of which we give below.

Deﬁne the surrogate ψOv A,n : [n] Rn R+ and predictor ΥOv A,n τ : Rn ([n] ) as

ψOv A,n(y, u) =

i=1 1(y = i)(1 ui)++1(y = i)(1+ui)+

ΥOv A,n τ (u) =

( argmaxi [n]ui if maxj uj > τ otherwise ,

where (a)+ = max(a, 0) and τ ( 1, 1) is a threshold parameter, and ties are broken arbitrarily, say, in favor of the label y with the smaller index.

The following theorem by Ramaswamy et al. (2015), gives an (ψOv A,n, ℓ?,n, ΥOv A,n τ )-excess risk transform.

Theorem 2 ((Ramaswamy et al., 2015)). Let n N and τ ( 1, 1). Let D be any distribution over X [n]. Then, for all f : X Rn

Rℓ?,n D [ΥOv A,n τ f] 1 2(1 |τ|)RψOv A,n

In the next section we use such surrogates calibrated with the abstain loss as a black box to construct calibrated surrogates for the tree-distance loss.

5. Cascade Surrogate for Hierarchical Classiﬁcation

In this section we construct a template surrogate ψcas and template predictor Υcas based on the scheme in Section 3, and is constituted of simpler surrogates ψj and predictors Υ? j. We then give a (ψcas, ℓH, Υcas)-excess risk transform assuming the existence of abstain loss excess risk transforms for the component surrogates and predictors, i.e. (ψj, ℓ?,nj, Υ? j)-excess risk transforms.

For all j {1, 2, . . . , h}, let the surrogate ψj : [nj] Rdj R+ and predictor Υ? j : Rdj ([nj] { }) be such that they are calibrated w.r.t. the abstain loss with nj classes for some integers dj. Let d = Pj i=1 dj. Let any u Rd be decomposed as u = [u 1 , . . . , u h ] , with each uj Rdj. The template surrogate, that we call the cascade surrogate ψcas : [n] Rd R+, is deﬁned in terms of its constituent surrogates as follows:

ψcas(y, u) =

j=1 ψj(ancj(y), uj) . (2)

The template predictor, Υcas, is deﬁned via the function Υcas j : Rd1 . . . Rdj [nj] which is deﬁned recursively as follows:

Υcas j (u1, . . . , uj)

( Υ? j(uj) if Υ? j(uj) = Υcas j 1(u1, . . . , uj 1) otherwise . (3)

The function Υcas 0 takes no arguments and simply returns 1 (the root node). Occasionally we abuse notation by representing Υcas j (u1, . . . , uj) simply as Υcas j (u).

The template predictor, Υcas : Rd [n] is simply deﬁned as Υcas(u) = Υcas h (u1, . . . , uh) .

The lemma below, captures the essence of the reduction from the hierarchical classiﬁcation problem to the MCRO problem. A proof outline is also provided.

Lemma 3. Let H = ([n], E, W) be a tree with height h. For all j [h], let αj = maxy,y N=j ℓH(y, y ). For any distribution D over X [n], let Aj(D) be the distribution

Convex Calibrated Surrogates for Hierarchical Classiﬁcation

over X [nj] given by the distribution of (X, ancj(Y )) with (X, Y ) D. For all j [h], let fj : X Rdj be such that f(x) = [f1(x) , . . . , fh(x) ] . Then for all distributions D over X [n] and all functions f : X Rd

RℓH D [Υcas f]

j=1 2αj Rℓ?,nj Aj(D)[Υ? j fj] .

Proof. (Outline:)

Due to linearity of expectation, it is sufﬁcient to ﬁx a singleton X, and give proofs for all distributions p n over class labels, instead of all distributions D over X Y.

For a given conditional probability vector p n, and vector u Rd, the analysis is based on whether the abstain loss predictor at the deepest level (level farthest from root) abstains or not.

1. If the abstain loss predictor at the deepest level does not abstain (Case 1 in the proof), then the tree-distance regret is bounded by the maximum distance between any two nodes at the deepest level αh, with a discount factor depending on the conditional probability of the predicted class. This can be simply be bounded by 2αh times the abstain loss regret.

2. If the abstain loss predictor at the deepest level does abstain and the optimal prediction is not in the deepest level (Case 2a in the proof), then prediction for the deepest level is correct and hence one can show that the tree-distance regret is simply bounded by the treedistance regret for the modiﬁed problem where all the probability mass associated with the nodes in deepest level are absorbed by their parents.

3. If the abstain loss predictor at the deepest level does abstain and the optimal prediction is in the deepest level (Case 2b in the proof), then one can bound the tree-distance regret by the sum of two terms

(a) The abstain loss regret, weighted by twice the largest distance between any node at the deepest level and its parent βh. This captures the error made by choosing to predict at a shallower level than the level of optimal prediction. (b) The tree-distance regret on the modiﬁed problem mentioned in case 2a. This captures the error made on shallower levels.

In all cases, the tree-distance regret can be bounded by the sum of the tree-distance regret on the modiﬁed problem and 2αh times the abstain loss regret. Applying this bound recursively gets our desired bound.

Lemma 3 bounds the ℓH regret on distribution D, by a weighted sum of abstain loss regrets, each over a modiﬁed distribution derived from D. Each of the components of the surrogate ψcas is exactly designed to minimize the abstain loss for the corresponding modiﬁed distribution. Assuming a (ψj, ℓ?,nj, Υ? j)-excess risk transform for all j [h], one can easily derive (ψcas, ℓH, Υcas)-excess risk transform as in Equation 1. This is done in the theorem below. Theorem 4. Let H = ([n], E, W) be a tree with height h. For all j [h], let ψj : [nj] Rdj R+ and Υ? j : Rdj nj be such that for all fj : X Rdj, and all distributions D over X [nj] we have

Rℓ?,nj D [Υ? j fj] C Rψj

for some constant C > 0. Then for all f : X Rd and distributions D over X [n],

RℓH D [Υcas f] 2C max y,y [n] ℓH(y, y ) Rψcas

Hence one just needs to plug in an appropriate surrogate ψj

to get concrete consistent algorithms for hierarchical classiﬁcation. The results of Ramaswamy et al. (2015) give three such surrogates, but we will focus on the one vs all hinge surrogate here, as the resulting algorithm can be easily parallelized and gives the best empirical results.

6. Ov A-Cascade Algorithm

When ψj = ψOv A,nj and Υ? j = ΥOv A,nj τj for some τj ( 1, 1), we call the resulting cascade surrogate ψcas and predictor Υcas together as Ov A-Cascade. In this case we have dj = nj. In the surrogate minimizing algorithm for Ov A-cascade, one solves h one-vs-all SVM problems. Problem j has nj classes, with the classes corresponding to the nj 1 nodes in the hierarchy at level less than j, and nj nj 1 super-nodes in the hierarchy at level j which also absorb the nodes of its descendants. The resulting training and prediction algorithms can thus be simpliﬁed and they are presented in Algorithms 1 and 2. The training phase requires an SVM optimization sub-routine, SVM-Train, which takes in a binary dataset and a regularization parameter C and returns a real valued function over the instance space minimizing the regularized hinge loss over an appropriate function space.

Theorems 2 and 4 immediately give the following corollary. Corollary 5. Let H = ([n], E, W) be a tree with height h. Let the component surrogates and predictors of ψcas and Υcas be ψj = ψOv A,nj and Υj = ΥOv A,nj τj . Then, for all distributions D and functions f : X Rd,

RℓH D [Υcas f] maxy,y [n] ℓH(y, y )

1 maxj |τj| Rψcas

Convex Calibrated Surrogates for Hierarchical Classiﬁcation

Algorithm 1 OVA-Cascade Training

Input: S = ((x1, y1), . . . , (xm, ym)) (X [n])m, H = ([n], E). Parameters: Regularization parameter C > 0

for i = 1 : n

Let tj = 2 1(yj D(i)) 1, j [m] Ti = ((x1, t1), . . . , (xm, tm)) (X {+1, 1})m. fi=SVM-Train(Ti, C) Let t j = 2 1(yj = i) 1, j [m] T i = ((x1, t 1), . . . , (xm, t m)) (X {+1, 1})m. f i=SVM-Train(T i, C) end for

Algorithm 2 OVA-Cascade Prediction

Input: x X, H = ([n], E), trained models fi, f i for all i [n] Parameters: Scalars τ1, . . . , τh in ( 1, 1)

for j = h down to 1

Construct u Rnj such that,

( fi(x) if lev(i) = j f i(x) if lev(i) < j if maxi ui > τj

return argmaxiui end if end for return 1

To get the best bound from Corollary 5, one must set τj = 0 for all j [h]. However, using a slightly more intricate version of Theorem 2 and Lemma 3 one can give a better upper bound for the ℓH-regret than in Theorem 4, and this tighter upper bound is minimized for a different τj. This observation is captured by the Theorem below. Theorem 6. Let H = ([n], E, W) be a tree with height h. For all j [h], let αj = maxy,y N=j ℓH(y, y ) and let βj = maxy N=j ℓH(y, P(y)). For j [h], let τj = αj βj

αj+βj . Let the component surrogates and predictors

of ψcas and Υcas be ψj = ψOv A,nj and Υj = ΥOv A,nj τj . Then, for all distributions D and functions f : X Rd,

RℓH D [Υcas f] 1

2 max j [h](αj + βj) Rψcas

One can clearly see the effect of improved bounds given by setting τj as in Theorem 6 for the unweighted hierarchy, in which case αj = 2j and βj = 1. Corollary 7. Let the hierarchy H be an unweighted tree with all edges having length 1. Let the component surrogates and predictors of ψcas and Υcas be ψj = ψOv A,nj and Υj = ΥOv A,nj τj .

a. For all j [h] let τj = 0, then, for all distributions D

and functions f : X Rd,

RℓH D [Υcas f] 2h Rψcas

b. For all j [h] let τj = 2j 1

2j+1, then, for all distributions D and functions f : X Rd,

RℓH D [Υcas f] h + 1

Thus setting τj = 2j 1

2j+1 gives almost a factor 2 improvement over setting τj = 0. This threshold setting is also intuitively satisfying as it says to use a higher threshold and predict conservatively (abstain more often) in deeper levels and to use a lower threshold and predict aggressively in levels nearer to the root. In practice, the optimal thresholds are distribution dependent and are best obtained via cross-validation.

7. Experiments

We run our cascade surrogate based algorithm for hierarchical classiﬁcation on some standard document classiﬁcation tasks with a class hierarchy and compare the results against other standard algorithms. We use the unweighted tree-distance loss as the evaluation metric. The details of the datasets and the algorithms are given below.

7.1. Datasets

We used several standard multiclass document classiﬁcation datasets, all of which have one class label per example. The basic statistics of the datasets is given in Table 1.

CLEF (Dimitrovski et al., 2011) Medical X-ray images organized according to a hierarchy.

IPC 3 Patents organized according to the International Patent Classiﬁcation Hierarchy.

LSHTC-small, DMOZ-2010 and DMOZ-2012 4

Web-pages, from the LSHTC (Large-Scale Hierarchical Text Classiﬁcation) challenges 2010-12, organized according to a hierarchy.

We used the standard train-test splits wherever available and possible. For the DMOZ-2010 and 2012 datasets however we created our own train-test splits because the given test sets do not contain class labels and the oracle for evaluating submissions does not accept interior nodes as predictions.

3http://www.wipo.int/classiﬁcations/ipc/en/support/ 4http://lshtc.iit.demokritos.gr/node/3

Convex Calibrated Surrogates for Hierarchical Classiﬁcation

Table 1. Dataset Statistics Dataset #Train #Validation #Test #Labels #Leaf-Labels Depth #Features CLEF 9,000 1,000 1,006 97 63 3 89 LSHTC-small 4,463 1,860 1,858 2,388 1,139 5 51,033 IPC 35,000 11,324 28,926 553 451 3 541,869 DMOZ-2010 80,000 13,805 34,905 17,222 12,294 5 381,580 DMOZ-2012 250,000 50,000 83,408 13,347 11,947 5 348,548

Table 2. Average tree-distance loss on the test set. Runs that failed due to memory issues are denoted by a - .

Root OVA HSVMmargin

CLEF 3.00 1.10 0.98 1.00 0.91 0.95 0.97 LSHTC-small 4.77 4.12 3.47 3.54 3.20 3.19 3.26 IPC 2.97 2.29 - - - 2.06 2.05 DMOZ-2010 4.65 3.96 - - - 3.12 3.16 DMOZ-2012 4.75 2.83 - - - 2.46 2.48

Table 3. Training times (not including validation) in hours (h) or seconds (s). Runs that failed due to memory issues are denoted by a - .

Root OVA HSVMmargin

CLEF 0 s 35s 50 s 45 s 20 s 50 s 66 s LSHTC-small 0 h 0.24 h 2.1 h 1.8 h 1.7 h 0.3 h 0.5 h IPC 0 h 2.6 h - - - 2.9 h 4.2 h DMOZ-2010 0 h 36 h - - - 59 h 146 h DMOZ-2012 0 h 201 h - - - 220 h 361 h

7.2. Algorithms

We run a variety of algorithms on the above datasets. The details of the algorithms are given below.

Root: This is a simple baseline method where the returned classiﬁer always predicts the root of the hierarchy.

OVA: This is the standard One vs All algorithm which completely ignores the hierarchy information and treats the problem as one of standard multiclass classiﬁcation.

HSVM-margin and HSVM-slack : These algorithms are Struct-SVM like (Tsochantaridis et al., 2005) algorithms for the tree-distance loss as proposed in Cai & Hofmann (2004). HSVM-margin and HSVM-slack use margin and slack rescaling respectively, and are considered among the state-of-the-art algorithms for hierarchical classiﬁcation.

OVA-Cascade: This is the algorithm in which we minimize the surrogate ψcas with the component surrogates being ψj = ψOv A,nj and is detailed as Algorithms 1 and 2. All the datasets in Table 1 have the property that all instances are associated only with a leaf-label (note however that we can still predict interior nodes), and hence the step of computing f i in Algorithm 1 can be skipped, and f i can be set to be identically equal to negative inﬁnity for

all i [n]. Note that, in this case, the training phase is very similar to the less-inclusive policy using the local node approach (Silla Jr. & Freitas, 2011). We use LIBLINEAR (Fan et al., 2008) for the SVM-train subroutine and use the simple linear kernel. The regularization parameter C is chosen via a separate validation set. The thresholds τj for j [h] are also chosen via a coarse grid search using the validation set.

Plug-in classiﬁer: This algorithm is based on estimating the conditional probabilities using a logistic loss. Speciﬁcally, it estimates Sy(p) for all non-root nodes y. This is done by creating a binary dataset for each y, with instances having labels which are the descendants of y being positive and the rest being negative, and running a logistic regression algorithm on this dataset. The ﬁnal predictor is simply based on Theorem 1, it chooses the deepest node y such that the estimated value of Sy(p) is greater than 1

CS-Cascade: This algorithm also minimizes the cascade surrogate ψcas, but with the component surrogates ψj being the Crammer-Singer surrogate (Crammer & Singer, 2001). From the results of Ramaswamy et al. (2015), one can derive excess risk transforms for the resulting cascade surrogate as well. As all instances have labels which are leaf nodes, the h subproblems all turn out to be multiclass learn-

Convex Calibrated Surrogates for Hierarchical Classiﬁcation

ing problems with nj classes for each of which we use the Crammer-Singer algorithm. We optimize the Crammer Singer surrogate over the standard multiclass linear function class using the LIBLINEAR software. Once again we use the same regularization parameter C for all the h problems which we choose using the validation set. We also use a threshold vector tuned on the validation set over a coarse grid.

The three algorithms Ov A-Cascade, Probability estimation and CS-cascade are all motivated by our analysis and would form consistent algorithms for the tree-distance loss if used with an appropriate function class.

7.3. Discussion of Results

Table 2 gives the average tree-distance loss incurred by various algorithms on some standard datasets and Table 3 gives the times taken for running these algorithms on a 4-core CPU.5 Some of the algorithms, like HSVM, and CS-cascade could not be run on the larger datasets due to memory issues. In the smaller datasets of CLEF and LSHTC-small where all the algorithms could be run, the algorithms motivated by our analysis Ov A-cascade, Plug-in and CS-cascade perform the best. In the bigger datasets, only the Ov A-cascade, plug-in and the ﬂat Ov A algorithms could be run, and both Ov A-cascade and Plugin perform signiﬁcantly better than the ﬂat Ov A. While both Ov A-cascade and Plug-in give comparable error performance, the Ov A-cascade only takes about half as much time as the Plug-in and hence is more preferable.

8. Conclusion

The reduction of the hierarchical classiﬁcation problem to the problem of multiclass classiﬁcation with a reject option gives an interesting and powerful family of algorithms. Extending such results to other related settings, such as the case where there is a graph over the set of class labels, or where a subset of the label set is allowed to be predicted instead of a single label, are interesting future directions.

Acknowledgements

HGR and AT gratefully acknowledge the support of NSF via grant CCF-1422157. HGR also acknowledges support from TCS via a Ph D Fellowship and a grant from the Indo-US virtual institute for mathematical and statistical sciences (VIMSS) . SA acknowledges support from the Department of Science & Technology (DST) of the Indian Government under a Ramanujan Fellowship, from the Indo-US Science & Technology Forum (IUSSTF), and from Yahoo in the form of an unrestricted grant.

5HSVM, and CS-cascade effectively only use a single core due to lack of parallelization.

Babbar, R., Partalas, I., Gaussier, E., and Amin, M.-R. On ﬂat versus hierarchical classiﬁcation in large-scale taxonomies. In Advances in Neural Information Processing Systems 26, 2013.

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., and Weiss, B. A database of german emotional speech. In Proceedings of the 9th European conference on speech communication and technology, 2005.

Cai, L. and Hofmann, T. Hierarchical document categorization with support vector machines. In International Conference on Information and Knowledge Management, 2004.

Cesa-Bianchi, N., Gentile, C., and Zaniboni, L. Hierarchical classiﬁcation: combining Bayes with SVM. In International Conference on Machine Learning, 2006a.

Cesa-Bianchi, N., Gentile, C., and Zaniboni, L. Incremental algorithms for hierarchical classiﬁcation. Journal of Machine Learning Research, 7:31 54, 2006b.

Crammer, K. and Singer, Y. On the learnability and design of output codes for multiclass problems. Machine Learning, 2001.

Dekel, O., Keshet, J., and Singer, Y. Large margin hierarchical classiﬁcation. In International Conference on Machine Learning, 2004.

Dimitrovski, I., Kocev, D., Suzana, L., and Dzeroski, S. Hierchical annotation of medical images. Pattern Recognition, 2011.

Fan, R., Chang, K., Hsieh, C., Wang, X., and Lin, C. Liblinear: A library for large linear classiﬁcation. Journal of Machine Learning Research, 9:1871 1874, 2008.

Gopal, S. and Yang, Y. Recursive regularization for largescale classiﬁcation with hierarchical and graphical dependencies. In International Conference on Knowledge Discovery and Data Mining, 2013.

Gopal, S., Bai, B., Yang, Y., and Niculescu-Mizil, A. Bayesian models for large-scale hierarchical classiﬁcation. In Advances in Neural Information Processing Systems 25, 2012.

Ramaswamy, H. G. and Agarwal, S. Classiﬁcation calibration dimension for general multiclass losses. In Advances in Neural Information Processing Systems 25, 2012.

Ramaswamy, H. G., Tewari, A., and Agarwal, S. Consistent algorithms for multiclass classiﬁcation with a reject option. ar Xiv:1505.04137, 2015.

Convex Calibrated Surrogates for Hierarchical Classiﬁcation

Rousu, J., Saunders, C., Szedmak, S., and Shawe-Taylor, J. Kernel-based learning of hierarchical multilabel classiﬁcation models. Journal of Machine Learning Research, 7:1601 1626, 2006.

Silla Jr., C. N. and Freitas, A. A. A survey of hierarchical classiﬁcation across different application domains. Data Mining and Knowledge Discovery, 2011.

Sun, A. and Lim, E.-P. Hierarchical text classiﬁcation and evaluation. In International Conference on Data Mining, 2001.

Tsochantaridis, I., Joachims, T., Hoffman, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453 1484, 2005.

Wang, H., Shen, X., and Pan, W. Large margin hierarchical classiﬁcation with mutually exclusive class membership. Journal of Machine Learning Research, 12:2721 2748, 2011.

Wang, K., Zhou, S., and Liew, S. C. Building hierarchical classiﬁers using class proximity. In International Conference on Very Large Data Bases, 1999.

Xiao, Z., Dellandr ea, E., Dou, W., and Chen, L. Hierarchical classiﬁcation of emotional speech. IEEE Transactions on Multimedia, 2007.

Zhang, T. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5:1225 1251, 2004.