# connecting_sphere_manifolds_hierarchically_for_regularization__49d1a02d.pdf

Connecting Sphere Manifolds Hierarchically for Regularization

Damien Scieur * 1 Youngsung Kim * 2

This paper considers classiﬁcation problems with hierarchically organized classes. We force the classiﬁer (hyperplane) of each class to belong to a sphere manifold, whose center is the classiﬁer of its super-class. Then, individual sphere manifolds are connected based on their hierarchical relations. Our technique replaces the last layer of a neural network by combining a spherical fully-connected layer with a hierarchical layer. This regularization is shown to improve the performance of widely used deep neural network architectures (Res Net and Dense Net) on publicly available datasets (CIFAR100, CUB200, Stanford dogs, Stanford cars, and Tiny-Image Net).

1. Introduction

Applying inductive biases or prior knowledge to inference models is a popular strategy to improve their generalization performance (Battaglia et al., 2018). For example, a hierarchical structure can be found based on the similarity or shared characteristics between samples, and thus becomes a basic criterion to categorize particular objects. The known hierarchical structures provided by the datasets (e.g., birds taxonomy (Welinder et al., 2010) and Word Net hierarchy (Deng et al., 2009)) can help the network identify the similarity between the given samples.

In classiﬁcation tasks, the last layer of neural networks maps embedding vectors to a discrete space (class). This mapping with a fully-connected layer has no mechanism forcing similar classes to be distributed close to each other in the embedding. Therefore, we may observe classes to be uniformly distributed after training, as this simpliﬁes the categorization by the last fully-connected layer. This behavior is a consequence of seeing the label structure as ﬂat, i.e., it omits to consider the hierarchical relationships

*Equal contribution 1Samsung SAIT AI Lab, Montreal 2Samsung Advanced Institute of Technology (SAIT). Correspondence to: Youngsung Kim <yskim.ee@gmail.com>, Damien Scieur <damien.scieur@gmail.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

between classes (Bilal et al., 2017).

To alleviate this problem, in this study, we force similar classes to be closer in the embedding by forcing their hyperplanes to follow a given hierarchy. One way to realize that is by making child nodes dependent on parent nodes and constraining their distance through a regularization term. However, the norm itself does not give relevant information on the closeness between hyperplanes (classiﬁers). Indeed, two classiﬁers are close if they classify two similar points in the same (super-)class. This means similar classiﬁers (hyperplanes) have to indicate a similar direction. Therefore, we have to focus on the angle between classiﬁers, which can be achieved through spherical constraints.

Contributions. In this paper, we propose a simple strategy to incorporate hierarchical information in deep neural network architectures with minimal changes to the training procedure, by modifying only the last layer. Given a hierarchical structure in the labels under the form of a tree, we explicitly force the individual classiﬁers to belong to a sphere, whose center is the classiﬁer of their super-class, recursively until reaching the root (see Figure 2). We introduce the spherical fully-connected layer and the hierarchically connected layer, whose combination implements our proposed technique. We investigate the impact of Riemannian optimization instead of simple norm normalization. Finally, we empirically show that our proposed method makes embedding vectors distributed closely (clearly grouped) along similar classes with superior classiﬁcation performances.

By its nature, our technique is quite versatile because it only affects the structure of last fully-connected layer of the neural network. Thus, it can be combined with many other strategies (like spherical CNN from Xie et al. (2017), or other deep neural network architectures).

Related works. Hierarchical structures are well-studied, and their properties can be effectively learned using manifold embedding. The design of the optimal embedding to learn the latent hierarchy is a complex task, and was extensively studied in the past decade. For example, Word2Vec (Mikolov et al., 2013b;a) and Poincaré embedding (Nickel & Kiela, 2017) showed a remarkable performance in hierarchical representation learning. (Du et al., 2018) forced the representation of sub-classes to orbit around the representation of their super-class to ﬁnd similarity based embed-

Connecting Sphere Manifolds Hierarchically

ding. Recently, using elliptical manifold embedding (Batmanghelich et al., 2016), hyperbolic manifolds (Nickel & Kiela, 2017; De Sa et al., 2018; Tifrea et al., 2018), and a combination of the two (Gu et al., 2019; Bachmann et al., 2019), shown that the latent structure of many data was non Euclidean (Zhu et al., 2016; Bronstein et al., 2017; Skopek et al., 2019).

Mixing hierarchical information and structured prediction is not new, especially in text analysis (Koller & Sahami, 1997; Mc Callum et al., 1998; Weigend et al., 1999; Wang et al., 1999; Dumais & Chen, 2000). Partial order structure of the visual-semantic hierarchy is exploited using a simple order pair with max-margin loss function in (Vendrov et al., 2016). The results of previous studies indicate that exploiting hierarchical information during training gives better and more resilient classiﬁers, in particular when the number of classes is large (Cai & Hofmann, 2004). For a given hierarchy, it is possible to design structured models incorporating this information to improve the efﬁciency of the classiﬁer. For instance, for support vector machines (SVMs), the techniques reported in (Cai & Hofmann, 2004; 2007; Gopal et al., 2012; Sela et al., 2011) use hierarchical regularization, forcing the classiﬁer of a super-class to be close to the classiﬁers of its sub-classes. However, the intuition is very different in this case, because SVMs do not learn the embedding.

In this study, we consider that the hierarchy of the class labels is known. Moreover, we do not change prior layers of the deep neural network, and only work on the last layer that directly contribute to build hyperplanes for a classiﬁcation purpose. Our work is thus orthogonal to those works on embedding learning, but not incompatible.

Comparison with hyperbolic networks. Hyperbolic network is a recent technique that showed impressive results for hierarchical representation learning. Poincaré networks (Nickel & Kiela, 2017) were originally designed to learn the latent hierarchy of data using low-dimension embedding. To alleviate their drawbacks due to a transductive property which cannot be used for unseen graph inference, hyperbolic neural networks equipped set aggregation operations have been proposed (Chami et al., 2019; Liu et al., 2019). These methods focus on learning embedding using a hyperbolic activation function for hierarchical representation. Our technique is orthogonal to these works: First, we assume that the hierarchical structure is not learned but already known. Second, our model focuses on generating individual hyperplanes of embedding vectors given by the network architecture. Therefore, our technique and hyperbolic networks are not mutually exclusive. Finally, even if this study focuses on spheres embedded in Rd, it is straightforward to consider spheres embedded in hyperbolic spaces.

Figure 1. To reference the node at the bottom, we use the notation np with p = {1, 3, 2}. We use curly brackets {} to write a path, and angle brackets for the concatenation of paths.

2. Hierarchical Regularization

2.1. Deﬁnition and Notations

We assume we have samples with hierarchically ordered classes. For instance, apple, banana, and orange are classes that may belong to the super-class fruits. This represents hierarchical relationships with trees, as depicted in Figure 1.

We identify nodes in the graph through their path. To represent the leaf (highlighted in blue in Figure 1), we use the notation n{1,3,2}. This means it is the second child of the super-class n{1,3}, and recursively, until we reach the root.

More formally, we identify nodes as np, where p is the path to the node. A path uniquely deﬁnes a node where only one possible path exists. Using the concatenation, between the path p and its child i, a new path p can be deﬁned as follows,

p = p, i (1)

We denote P the set of all paths in the tree starting from the root, with cardinality |P|. Notice that |P| is also the number of nodes in the tree (i.e., number of classes and super-classes). We distinguish the set P from the set L, the set of paths associated to nodes whose label appears in the dataset. Although L may equal to P, this is not the case in our experiments. We show an example in Appendix A.

2.2. Similarity Between Objects and Their Representation

Let X be the network input (e.g. an image), and φθ(X) be its representation, i.e., the features of X extracted by a deep neural network parameterized by θ. We start with the following observation:

Given a representation, super-class hyperplane should be similar to hyperplanes for their sub-classes.

This assumption implies the following direct consequence.

All objects whose labels belong to the same super-class have a similar representation.

Connecting Sphere Manifolds Hierarchically

That is a natural property that we may expect from a good representation. For instance, two dogs from different breeds should share more common features than that of a dog shares with an apple. Therefore, the parameter of the classiﬁers that identify dog s breed should also be similar. Their difference lies in the parameters associated to some speciﬁc features that differentiate breeds of dogs.

Although this is not necessarily satisﬁed with arbitrary hierarchical classiﬁcation, we observe this in many existing datasets. For instance, Caltech-UCSD Birds 200 and Stanford dogs are datasets that classify, respectively, birds and dogs in term of their breeds. A possible example where this assumption may not be satisﬁed is a dataset whose super-classes are labels whose ﬁrst letter is .

2.3. Hierarchical Regularization

Starting from a simple observation in the previous section, we propose a regularization method that forces the network to have similar representation for classes along a path p, which implies having similar representation between similar objects. More formally, if we have an optimal classiﬁer wp for the super-class p and a classiﬁer w p,i for the class p, i , we expect that

wp w p,i is small. (2)

If this is satisﬁed, hyperplanes for objects in the same superclass are also similar because

w p,i w p,j = (w p,i wp) (w p,j wp)

wp w p,i | {z } small

+ wp w p,j | {z } small

However, the optimal hyperplane (classiﬁer) for an arbitrary representation φθ(X) may not satisfy (2). The naive and direct way to ensure (2) is through hierarchical regularization, which forces hyperplanes (classiﬁers) in the same path to be close to each other.

2.4. Hierarchical Layer and Hierarchically Connected Layer

In the previous section, we described the hierarchical regularization method given a hierarchical structure in the classes. In this section, we show how to conveniently parametrize (2). We ﬁrst express the hyperplane (classiﬁer) as a sum of vectors δ deﬁned recursively as follows:

w p,i = wp + δ p,i , δ{} = 0, (4)

where {} is the root. It is possible to consider δ{} = 0, which shifts separating hyperplanes. We do not consider this case in this paper. Given (4), we have that δ p,i is

small in (2). Finally, it sufﬁces to penalize the norm of δ p,i during the optimization. Notice that, by construction, the number of δ s is equal to the number of nodes in the hierarchical tree.

Next, consider the output of CNNs for classiﬁcation,

φθ( )T W, (5)

where θ denotes the parameters of the hidden layers, W = [w1, . . . , w|L|] denotes the last fully-connected layer, and wi denotes the classiﬁer for the class i. For simplicity, we omit potential additional nonlinear functions, such as a softmax, on top of the prediction.

We have parametrized wi following the recursive formula in (4). To deﬁne the matrix formulation of (4), we ﬁrst introduce the Hierarchical layer H which plays an important role. This hierarchical layer can be identiﬁed to the adjacency matrix of the hierarchical graph.

Deﬁnition 1. (Hierarchical layer). Consider ordering over the sets P and L, i.e., for i = 1, . . . , |P| and j = 1, . . . , |L|,

P = {p1, . . . , pi, . . . , p|P|} and L = {p1, . . . , pj, . . . , p|L|}.

In other words, we associate to all nodes an index. Then, the hierarchical layer H is deﬁned as

H {0, 1}|P| |L|, Hi, j = 1 if npi npj, 0 otherwise. (6) where npi npj means npj is a parent of npi.

We illustrate an example of H in Appendix A. The next proposition shows that (5) can be written using a simple matrix-matrix multiplication, involving the hierarchical layer.

Proposition 1. Consider a representation φθ( ), where φθ( ) Rd. Let W be the matrix of hyperplanes

W = [wp1, . . . , wp|L|], pi L, (7)

where the hyperplanes are parametrized as (4). Let be deﬁned as

Rd |P|, = [δp1, . . . , δp|P|], (8)

where P and L are deﬁned in Section 2.1. Consider the hierarchical layer deﬁned in Deﬁnition 1. Then, the matrix of hyperplanes W can be expressed as

We can see W = H as a combination of an augmented fully-connected layer, combined with the hierarchical layer that selects the corresponding columns of , hence the term hierarchically connected layer. The ℓ2 regularization of the

Connecting Sphere Manifolds Hierarchically

Figure 2. (Left) Example of hyperplanes wp, formed through the sum of δp. The hyperplane w{1,3,2,1} associated to the class n{1,3,2,1} is in green, the construction with the δ s in blue, and all intermediate w in red. (Right) Riemannian versus projected gradient descent. Riemannian optimization follows approximately geodesics, while projected gradient steps can jump very far from δt p.

δ can be conducted by the parameter weight decay, which is widely used in training of neural networks. The hierarchical layer H is ﬁxed, while is learnable. This does not affect the complexity of the back-propagation signiﬁcantly, as H is a simple linear form.

The size of the last layer slightly increases, from |L| d to |P| d, where d is the dimension of the representation φθ( ). For instance, in the case of Tiny-Image Net, the number of parameters of the last layer only increases by roughly 36% ; nevertheless, the increased number of parameters of the last layer is still usually negligible in comparison with the total number of parameters for classical network architectures.

3. Hierarchical Spheres

The hierarchical (ℓ2) regularization introduced in the previous section induces separated hyperplanes along a path to be close to each other. However, this approach has a signiﬁcant drawback.

We rewind (2), which models the similarity of two hyperplanes wp and w p, i . The similarity between hyperplanes (individual hyperplanes) should indicate that they point roughly the same direction, i.e., wp wp w p, i w p, i

is small. (10)

However, this property is not necessarily captured by (2). For instance, assume that wp = w p, i , i.e., the hyperplanes point in two opposite directions (and thus completely different). Then, (2) can be arbitrarily small in the function of wp but not in (10):

wp w p, i = 2 wp ; wp wp w p, i w p, i

(11) This can be avoided, for example, by deploying the regularization parameter (or weight decay) independently for

each δp . However, it is costly in terms of hyper-parameter estimation.

In order to enforce the closeness of embedding vectors whose paths are similar, we penalize large norms of δ. We also want to bound it away from zero to avoid the problem of hyperplanes that point in different direction may have a small norm. This naturally leads to a spherical constraint. Indeed, we transform the ℓ2 regularization over δp by ﬁxing its norm in advance, i.e.,

δp = Rp > 0. (12)

In other words, we deﬁne δp on a sphere of radius Rp. The fully-connected layer is then constrained on spheres, hence it is named spherical fully-connected layer.

Hence, we have w p,i constrained on a sphere centered at wp. This constraint prevents the direction of w p,i from being too different from that of wp, while bounding the distance away from zero. This does not add hyperparameters: instead of weight decay, we have the radius Rp of the sphere.

3.1. Radius Decay w.r.t. Path Length

We allow the radius of the spheres, Rp, to be deﬁned as a function of the path. In this study, we use a simple strategy called radius decay, where Rp decreases w.r.t. the path length:

Rp = R0γ|p|, (13)

where R0 is the initial radius, γ is the radius decay parameter, and |p| is the length of the path. The optimal radius decay can be easily found using cross-validation. The radius decay is applied prior to learning (as opposed to weight-decay); then, the radius remains ﬁxed during the optimization. As opposed to weight-decay, whose weight are multiplied by some constant smaller than one after each iteration, the radius decay here depends only on the path length, and the radius remains ﬁxed during the optimization process.

The simplest way to apply the radius decay is by using the following predeﬁned diagonal matrix D,

Di, i = R0γ|pi|, pi P, 0 otherwise, (14)

where pi follows the ordering from Deﬁnition 1. Finally, the last layer of the neural network reads,

φθ( ) | {z } Network

DH | {z } Last layer

The only learnable parameter in the last layer is .

Connecting Sphere Manifolds Hierarchically

3.2. Optimization

There are several ways to optimize the network in the presence of the spherical fully-connected layer: by introducing the constraint in the model, naively by performing normalization after each step, or by using Riemannian optimization algorithms. For simplicity, we consider the minimization problem, min θ, f(θ, ), (16)

where θ are the parameters of the hidden layers, the spherical fully-connected layer from (8), and f the empirical expectation of the loss of the neural network. For clarity, we use noiseless gradients, but all results also apply to stochastic ones. The superscript t denotes the t-th iteration.

Integration of the constraint in the model. We present the simplest way to force the column of to lie on a sphere, as this does not require a dedicated optimization algorithm. It is sufﬁcient to normalize the column of by their norm in the model. By introducing a dummy variable , which is the normalized version of , the last layer of the neural network (15) reads

= . . . , δp δp , . . . , φθ( ) DH. (17)

Then, any standard optimization algorithm can be used for the learning phase. Technically, is not constrained on a sphere, but the model will act as if follows such constraint.

Optimization over spheres Riemannian (stochastic) gradient descent. The most direct way to optimize over a sphere is to normalize the columns of by their norm after each iteration. However, this method has no convergence guarantee, and requires a modiﬁcation in the optimization algorithm. Instead, we perform Riemannian gradient descent which we explain only brieﬂy in this manuscript. We provide the derivation of Riemannian gradient for spheres in Appendix B.

Riemannian gradient descent involves two steps: ﬁrst, a projection to the tangent space, and then, a retraction to the manifold. The projection step computes the gradient of the function on the manifold (as opposed to the ambient space Rd), such that its gradient is tangent to the sphere. Then, the retraction simply maps the new iterate to the sphere. With this two-step procedure, all directions pointing outside the manifold, (i.e., orthogonal to the manifold, thus irrelevant) are discarded by the projection. These two steps are summarized below,

st = (δt p)T δpf(θt, t) δt p δpf(θt, t), (18)

δt+1 p = δt p + htst

δtp + htst , (19)

where st is the projection of the descent direction to the tangent space, and δt+1 p is the retraction of the gradient descent step with stepsize h. In our experiments, we used the Geoopt optimizer (Kochurov et al., 2020), which implements Riemannian gradient descent on spheres.

4. Experiments

We experimented the proposed method using ﬁve publicly available datasets, namely CIFAR100 (Krizhevsky, 2009), Caltech-UCSD Birds 200 (CUB200) (Welinder et al., 2010), Stanford-Cars (Cars) (Krause et al., 2013), Stanford-dogs (Dogs) (Khosla et al., 2011), and Tiny-Image Net (Tiny Im Net) (Deng et al., 2009). CUB200, Cars, and Dogs datasets are used for ﬁne-grained categorization (recognizing bird, dog bleeds, or car models), while CIFAR100 and Tiny-Im Net datasets are used for the classiﬁcation of objects and animals. Unlike the datasets for object classiﬁcation, the ﬁne-grained categorization datasets show low inter-class variances. See Appendix C.2 and C.3 for more details about the dataset and their hierarchy, respectively.

4.1. Deep Neural Network Models and Training setting

We used the deep neural networks (Res Net (He et al., 2016) and Dense Net (Huang et al., 2017)). The input size of the datasets CUB200, Cars, Dogs, and Tiny-Im Net is 224 224 (for Tiny-Im Net, resized from 64 64), and 32 32 pixels for CIFAR100. Since the input-size of CIFAR100 does not ﬁt to the original Res Net and Dense Net, we used a smaller kernel size (3 instead of 7) at the ﬁrst convolutional layer and a smaller stride (1 instead of 2) at the ﬁrst block.

Remark: we do not use pretrained models. All networks used in the experiments are trained from scratch, i.e., we did not employ ﬁne-tuning or transfer learning strategies using the pretrained models. This is because most publicly available pretrained models used Image Net for training while images in CUB200, Dogs and Tiny-Im Net overlap with the ones from Image Net.

We used the stochastic gradient descent (SGD) over 300 epochs, with a mini-batch of 64 and a momentum parameter of 0.9 for training. The learning rate schedule is the same for all experiments, starting at 0.1, then decaying by a factor of 10 after 150, then 225 epochs. All tests are conducted using NVIDIA Tesla V100 GPU with the same random seed. Settings in more detail are provided in the supplementary material. We emphasize that we used the same parameters and learning rate schedule for all scenarios. Those parameters and schedule were optimized for SGD on plain networks, but are probably sub-optimal for our proposed methods.

Connecting Sphere Manifolds Hierarchically

Table 1. Test accuracy (%) for ﬁne-grained classiﬁcation. Radius decay is ﬁxed at 0.5.

Baseline Proposed parametrization

Dataset Architecture Plain Multitask Hierarchy +Manifold +Riemann

Res Net-18 54.88 53.99 58.28 60.42 60.98 Res Net-50 54.09 52.17 57.59 59.00 60.01 Dense Net-121 50.55 56.61 61.10 60.22 61.98 Dense Net-161 50.91 60.67 60.67 62.73 63.55

Res Net-18 79.83 82.85 84.96 84.74 84.16 Res Net-50 82.86 82.86 83.34 84.51 84.65 Dense Net-121 79.78 85.39 85.97 86.00 85.54 Dense Net-161 79.85 85.79 86.23 86.90 85.76

Res Net-18 59.17 59.88 60.30 61.83 61.36 Res Net-50 57.44 58.97 59.31 59.81 63.70 Dense Net-121 56.00 64.39 62.19 64.95 65.89 Dense Net-161 55.49 64.23 65.28 65.68 65.90

Table 2. Test accuracy (%) for object classiﬁcation. Radius decay is ﬁxed at 0.5 and 0.9 for CIFAR100 and Tiny-Imnet, respectively.

Baseline Proposed parametrization

Dataset Architecture Plain Multitask Hierarchy +Manifold +Riemann

Res Net-18 69.47 69.37 70.89 70.06 71.89 Res Net-50 71.04 71.74 73.75 73.76 73.97 Dense Net-121 74.50 75.62 76.38 76.52 76.28 Dense Net-161 75.30 76.57 77.01 77.01 76.64

Tiny-Im Net

Res Net-18 64.70 64.81 64.33 64.74 65.13 Res Net-50 66.43 66.39 66.52 66.67 65.69 Dense Net-121 64.27 67.15 67.19 67.86 67.45 Dense Net-161 67.22 67.62 67.63 68.95 67.82

Table 3. Comparison with hyperbolic networks (Khrulkov et al., 2020): test accuracy (%) for ﬁne-grained (top) or objects (bottom) classiﬁcation, using Res Net-18. The number in parenthesis is the STD over 6 runs with different random seeds. lr denotes the initial learning rate.

Dataset Hyperbolic (lr: 0.01) Hyperbolic (lr: 0.001) Hyperbolic (lr: 0.0001) Ours (+Riemann)

CUB200 7.49 ( 0.68) 40.14 ( 0.49) 26.93 ( 0.50) 61.62 ( 0.35) Cars 15.79 ( 16.54) 62.41 ( 0.86) 42.40 ( 1.23) 83.41 ( 0.54) Dogs 18.70 ( 12.95) 48.85 ( 0.59) 40.19 ( 0.23) 61.35 ( 0.37)

CIFAR-10 42.30 ( 7.66) 56.37 ( 10.45) 59.67 ( 0.37) 71.82 ( 0.18) Tiny Im Net 14.53 ( 11.17) 48.83 ( 13.16) 53.17 ( 0.29) 64.93 ( 0.18)

4.2. Results

Tables 1 and 2 show a comparison of the results obtained with several baseline methods and our methods. The ﬁrst method, Plain , is a plain network for subclass classiﬁcation without hierarchical information. The second one, Multitask is simply the plain network with multitask (subclass and super-class classiﬁcation) setting using the hierarchical information. The third one, Hierarchy , uses our parametrization W = H with the hierarchical layer H, but the columns of are not constrained on spheres. Then, +Manifold means that is restricted on a sphere using the normalization technique from Section 3.2 (Integration of the constraint in the model). Finally, +Riemann means

we used Riemannian optimization from Section 3.2 (Optimization over spheres). We show the experimental results on ﬁne-grained classiﬁcation (Table 1) and general object classiﬁcation (Table 2).

Note that the multitask strategy in our experiment (and contrary to our regularization method) does require an additional hyper-parameter that combines the two losses, because we train classiﬁers for super-classes and sub-classes simultaneously.

Fine-grained categorization. As shown in Table 1, our proposed parameterization signiﬁcantly improves the test accuracy over the baseline networks (Res Net-18/50 and

Connecting Sphere Manifolds Hierarchically

Dense Net-121/160). Even the simple hierarchical setting which uses the hierarchical layer only (without spheres) shows superior performance compared to the baseline networks. Integrating the manifolds with Riemannian SGD further improves the generalization performance.

Surprisingly, the plain network with deeper layers shows degraded performance. This can be attributed to overﬁtting which does not occur with our regularization method, where larger networks show better performance, indicating the high efﬁciency of our approach.

Object classiﬁcation. We show test accuracy (%) of our proposed methods with different network models using CIFAR-100 and Tiny-Im Net, in Table 2. As shown in the table, our proposed method has better classiﬁcation accuracy than that of the baseline methods. Compared to the ﬁne-grained classiﬁcation datasets, the general object classiﬁcation datasets have less similar classes on the same superclass. In these datasets, our method achieved relatively small gains. A higher inter-class variance may explain the lower improvement compared to ﬁne-grained categorization.

For Tiny-Im Net, Res Net-18 (11.28M param.) with our parametrization achieves better classiﬁcation performance than plain Res Net-50 (23.91M param.). The same applies to Dense Net-121 and Dense Net-161. These results indicate that our regularization method, which does not introduce new parameters in the embedding layer, can achieve a classiﬁcation performance similar than more complex models.

Comparison with hyperbolic networks. Hyperbolic networks using Poincaré ball based Multidimensional Logistic Regression (MLR) (Khrulkov et al., 2020; Shimizu et al., 2021) can be used for the classiﬁcation task. Khrulkov et al. (2020) showed improved performance on shallow networks with four convolutional layers for image classiﬁcation with few-shot learning, but not on deep networks, e.g., Res Net12. Shimizu et al. (2021) showed improved performance on subtree classiﬁcation (binary classiﬁcation for each node) and for language translation tasks when using low dimensional embeddings ( 10 and 64, respectively).

Table 3 compares our method with the hyperbolic networks for object classiﬁcation. We report the averaged test accuracy of six runs of hyperbolic model (Poincaré ball based MLR) where MLR layers are added to Res Net-18 (512 dimensional embedding) based on the codes from (Khrulkov et al., 2020). We compare those results with our method (Hierarchy+Manifold+Riemann). For the experiments, we followed training settings described in Section 4.1 except for the initial learning rate (lr). Without an extensive hyperparameter search, the method easily diverges, as it is very sensitive to the radius and initial learning rate, as described in (Khrulkov et al., 2020). This Hyperbolic method using deeper networks on higher dimensional embedding space

showed unstable and degraded performances for visual object classiﬁcation compared to our proposed method.

4.3. Riemannian vs. Projected SGD

Overall, Riemannian SGD showed slightly superior performance compared to projected SGD for ﬁne-grained datasets, although, in most cases, the performance was similar. For instance, with the Dogs dataset on Res Net-50, Riemannian SGD shows a performance 4% higher than the projected SGD. For object classiﬁcation, Riemannian SGD improves the classiﬁcation performance a bit less compared to that the projected SGD. For the different radius decay parameters (0.5 in Table 1 and 0.9 in Table 2), the optimized learning rate (e.g. with a large value) of Riemannian SGD may improve the performance further.

4.4. Ablation Study

In this section, we conduct an ablation study with different perspectives to answer following questions: 1) Can learnable sphere radius be more effective than ﬁxed radius decay, and 2) is hierarchical modeling really effective?

Table 4. Test accuracy (%) with a learnable radius. Clearly, using learnable parameter degrades considerably the performance compared to using the predeﬁned radius decay (shown in parentheses, selected from Table 1). Res Net-18.

Proposed parametrization

Dataset Hierarchy +Manifold +Riemann

CUB200 53.40 (58.28) 58.35 (60.42) 58.24 (60.98) Cars 81.51 (84.96) 82.54 (84.74) 82.40 (84.16)

Table 5. Test accuracy (%) with a random hierarchy tree. Radius decay is ﬁxed at 0.5. Clearly, using a random hierarchy degrades considerably the performance (shown in parentheses, selected from Table 1 which are the results with the original hierarchical tree). This validates the importance of the proper hierarchy information. Res Net-18.

Baseline Proposed parametrization

Dataset Multitask Hierarchy +Manifold +Riemann

CUB200 47.55 (53.99) 50.28 (58.28) 56.96 (60.42) 56.43 (60.98) Cars 79.98 (82.85) 81.07 (84.96) 82.02 (84.74) 81.84 (84.16)

Table 6. Test accuracy (%) for super-class classiﬁcation. Radius decay is ﬁxed at 0.5. For the super-class classiﬁcation, without any modiﬁcation of the proposed layers, we trained the model on the dataset, then calculated classiﬁcation accuracy using the δ s corresponding to the parent classes. Res Net-18.

Baseline Proposed parametrization

Dataset Multitask Hierarchy +Manifold +Riemann

CUB200 53.68 58.87 61.17 62.22 Cars 86.88 87.91 91.23 90.97

Connecting Sphere Manifolds Hierarchically

(b) Multitask

(c) Hierarchy

(d) +Manifold

(e) +Riemann

Figure 3. Visualization of a high dimensional embedding vector using t-SNE on 2D plane (using Res Net-18 and Cars dataset). We show zoom-in ﬁgure highlighted with black sold rectangle by cropping sample distribution having similar super-classes for clarity. (a) Plain, (b) Multitask, and proposed parameterization ((c) Hierarchy, (d) +Manifold, and (e) +Riemann). In overall, our proposed methods ((c), (d), (e)) show clearer separation among similar super-classes with smaller deviation. All ﬁgures are drawn using embedding vectors shown the similar classiﬁcation accuracy (approximately 80 %, maximum performance of the baseline) for fairness.

(b) Multitask

(c) Hierarchy

(d) +Manifold

(e) +Riemann

Figure 4. Visualization of 2-dimensional embedding vector (using Res Net-18 and Cars dataset). (a) Plain, (b) Multitask, and proposed parameterization ((c) Hierarchy, (d) +Manifold, and (e) +Riemann). Our proposed methods draw that samples with 196 sub-clasess are grouped sharply along with 9 super-classes compared to the baseline methods, i.e., the distribution of points becomes more and more concentrated and separable. We showed the distribution from low accuracy (approx. 6 % accuracy). Two-dimensional vectors seem too small to classify images of 196 classes which is highly non-linearly distributed different from MNIST dataset with ten classes and gray level images shown in (Liu et al., 2016).

Learnable sphere radius. Instead of the predeﬁned simple radius decay equation in (13), represented by the diagonal matrix D in (14), we replace it with a learnable diagonal matrix of parameters which is trainable using backpropagation. As shown in Table 4, this learnable radius is not effective in terms of a classiﬁcation performance, compared to the one of the predeﬁned depth-wise radius decay.

Importance of hierarchy. To validate the hierarchical (manually annotated) label structure used in the experiments and the proposed hierarchical sphere modeling further, we observe their performances from three different perspectives, Randomized hierarchy, Super-class categorization, and Embedding distribution as follows.

Randomized hierarchical structure (Table 5). We examine hierarchical modeling using random hierarchical labels to

validate the importance of the given hierarchy labels from the dataset. One can propose to use labels found in an unsupervised way. However, popular (and simple) unsupervised clustering, or grouping, method such as k-mean, provides randomly assigned super-class labels. Hence, we examine simply a random super-class assignment. As shown in Table 5, the methods with a randomly generated hierarchy shows a degraded performance compared to that with a given hierarchical information. This validates the effectiveness of the hierarchy structure we used in the above experiments.

Super-class categorization accuracy (Table 6). As shown in Table 6, our proposed methods (Hierarchy, +Manifold, and +Riemann) outperformed a baseline hierarchical (multitask) method in terms of test accuracy performance. Note that, in the multitask classiﬁcation, a loss function for classi-

Connecting Sphere Manifolds Hierarchically

ﬁcation using super-classes is used additionally. This result validates that our proposed methods work in a hierarchical way. Modeling on spheres (+Manifold and +Riemann) further improves hierarchical inference.

Visualization of embedding vectors (Figures 3 and 4). We show a distribution of embedding vectors to observe how our proposed modeling draws the input samples in groupwise setting. We visualize both the original (high-dimensional) embedding used in above experiments and small dimensional embedding vectors using Cars dataset (the number of superclasses is small enough to show their distribution clearly).

We use a popular visualization technique, t-Distributed Stochastic Neighbor Embeddings (t-SNE) (van der Maaten & Hinton, 2008), to visualize the embedding vector (R512) in a two-dimensional space. As shown in Figure 3, embedding vectors of our proposed methods (see Figure 3c, 3d, and 3e) are clearly grouped within small distance for each sub-class and their super-classes (the same color indicates the same super-class) compared to that of the baseline methods (see Figure 3a and 3b).

Since t-SNE returns stochastic results dependent on hyperparameters such perplexity values but independent to the networks, we visualize two-dimensional vector (R2) directly learned by the networks following (Liu et al., 2016). To obtain this embedding vector, we added an additional (dimensionality reduction) layer (i.e. a mapping function R512 7 R2) prior to the last FC layer of Res Net-18. As shown in Figure 4, embedding vectors of our proposed methods are distributed more closely (clustered) with regard to their super-classes (Figure 4c, 4d, and 4e).

5. Conclusion and Future Work

We presented a simple regularization method for neural networks using a given hierarchical structure of the classes. We reformulated the fully connected layer which is the last layer of the neural network using the hierarchical layer. By applying spherical constraints to this layer further, we formulated a spherical fully-connected layer. The reformulation using the hierarchical layer H and the spherical constraint had a considerable impact on the generalization accuracy of the network. Finally, we compared several optimization strategies for the spherical layer. The Riemannian optimization showed signiﬁcant improvement and sometimes similar to its projected counterpart.

In the future, it would be interesting to use our proposed regularization method on other architectures (e.g. Inception and Squeeze Net), for embedding (e.g. Poincaré), and other applications (e.g. Natural Language Processing). Moreover, in this paper, we used a given hierarchy mostly based on taxonomy designed by experts. This hierarchical structure,

which is convenient for humans, may not be most convenient for classiﬁcation algorithms. Alternatively, a self-supervised algorithm that learns the hierarchy and classiﬁer simultaneously may be effective because we do not need to access a given hierarchy and lead to better results (because the structure will be more adapted to the classiﬁcation task).

Acknowledgements

We would like to thank our colleagues Simon Lacoste-Julien and Jae-Joon Han for their insightful discussions and relevant remarks. We also thank the anonymous reviewers for their constructive comments and suggestions.

Connecting Sphere Manifolds Hierarchically

Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, USA, 2007. ISBN 0691132984, 9780691132983.

Bachmann, G., Bécigneul, G., and Ganea, O.-E. Constant curvature graph convolutional networks. ar Xiv preprint ar Xiv:1911.05076, 2019.

Batmanghelich, K., Saeedi, A., Narasimhan, K., and Gershman, S. Nonparametric spherical topic modeling with word embeddings. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2016, pp. 537. NIH Public Access, 2016.

Battaglia, P., Hamrick, J. B. C., Bapst, V., Sanchez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gulcehre, C., Song, F., Ballard, A., Gilmer, J., Dahl, G. E., Vaswani, A., Allen, K., Nash, C., Langston, V. J., Dyer, C., Heess, N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y., and Pascanu, R. Relational inductive biases, deep learning, and graph networks. ar Xiv, 2018.

Bécigneul, G. and Ganea, O.-E. Riemannian adaptive optimization methods. ar Xiv preprint ar Xiv:1810.00760, 2018.

Bilal, A., Jourabloo, A., Ye, M., Liu, X., and Ren, L. Do convolutional neural networks learn class hierarchy? IEEE transactions on visualization and computer graphics, 24 (1):152 162, 2017.

Bonnabel, S. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58 (9):2217 2229, Sep. 2013.

Boumal, N. An introduction to optimization on smooth manifolds. Available online, May 2020. URL http: //www.nicolasboumal.net/book.

Bronstein, M. M., Bruna, J., Le Cun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4): 18 42, 2017.

Cai, L. and Hofmann, T. Hierarchical document categorization with support vector machines. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pp. 78 87, 2004.

Cai, L. and Hofmann, T. Exploiting known taxonomies in learning overlapping concepts. In IJCAI, volume 7, pp. 708 713, 2007.

Chami, I., Ying, Z., Ré, C., and Leskovec, J. Hyperbolic graph convolutional neural networks. In Advances in Neural Information Processing Systems 32, pp. 4868 4879, 2019.

De Sa, C., Gu, A., Ré, C., and Sala, F. Representation tradeoffs for hyperbolic embeddings. Proceedings of machine learning research, 80:4460, 2018.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei Fei, L. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

Du, L., Lu, Z., Wang, Y., Song, G., Wang, Y., and Chen, W. Galaxy network embedding: A hierarchical community structure preserving approach. In IJCAI, pp. 2079 2085, 2018.

Dumais, S. and Chen, H. Hierarchical classiﬁcation of web content. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 256 263, 2000.

Gopal, S., Yang, Y., and Niculescu-Mizil, A. Regularization framework for large scale hierarchical classiﬁcation. Proceedings of European Conference on Machine Learning, 2012.

Gu, A., Sala, F., Gunel, B., and Ré, C. Learning mixedcurvature representations in product spaces. In International Conference on Learning Representations, 2019.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770 778, 2016.

Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR, pp. 2261 2269. IEEE Computer Society, 2017.

Khosla, A., Jayadevaprakash, N., Yao, B., and Fei-Fei, L. Novel dataset for ﬁne-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.

Khrulkov, V., Mirvakhabova, L., Ustinova, E., Oseledets, I., and Lempitsky, V. Hyperbolic image embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

Kochurov, M., Karimov, R., and Kozlukov, S. Geoopt: Riemannian optimization in pytorch, 2020.

Koller, D. and Sahami, M. Hierarchically classifying documents using very few words. Technical report, Stanford Info Lab, 1997.

Connecting Sphere Manifolds Hierarchically

Krause, J., Stark, M., Deng, J., and Fei-Fei, L. 3d object representations for ﬁne-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3d RR-13), Sydney, Australia, 2013.

Krizhevsky, A. Learning multiple layers of features from tiny images, 2009.

Liu, Q., Nickel, M., and Kiela, D. Hyperbolic graph neural networks. In Advances in Neural Information Processing Systems 32, pp. 8230 8241, 2019.

Liu, W., Wen, Y., Yu, Z., and Yang, M. Large-margin softmax loss for convolutional neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pp. 507 516. JMLR.org, 2016.

Mc Callum, A., Rosenfeld, R., Mitchell, T. M., and Ng, A. Y. Improving text classiﬁcation by shrinkage in a hierarchy of classes. In ICML, volume 98, pp. 359 367, 1998.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efﬁcient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013a.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111 3119, 2013b.

Nickel, M. and Kiela, D. Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, pp. 6338 6347, 2017.

Sela, Y., Freiman, M., Dery, E., Edrei, Y., Safadi, R., Pappo, O., Joskowicz, L., and Abramovitch, R. fmri-based hierarchical svm model for the classiﬁcation and grading of liver ﬁbrosis. IEEE transactions on biomedical engineering, 58(9):2574 2581, 2011.

Shimizu, R., Mukuta, Y., and Harada, T. Hyperbolic neural networks++. In International Conference on Learning Representations, 2021.

Skopek, O., Ganea, O.-E., and Bécigneul, G. Mixedcurvature variational autoencoders. ar Xiv preprint ar Xiv:1911.08411, 2019.

Tifrea, A., Bécigneul, G., and Ganea, O.-E. Poincaré glove: Hyperbolic word embeddings. ar Xiv preprint ar Xiv:1810.06546, 2018.

van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86): 2579 2605, 2008.

Vendrov, I., Kiros, R., Fidler, S., and Urtasun, R. Orderembeddings of images and language. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.

Wang, K., Zhou, S., and Liew, S. C. Building hierarchical classiﬁers using class proximity. In VLDB, volume 99, pp. 363 374. Citeseer, 1999.

Weigend, A. S., Wiener, E. D., and Pedersen, J. O. Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193 216, 1999.

Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., and Perona, P. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.

Xie, P., Deng, Y., Zhou, Y., Kumar, A., Yu, Y., Zou, J., and Xing, E. P. Learning latent space models with angular constraints. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3799 3810. JMLR. org, 2017.

Zhu, J.-Y., Krähenbühl, P., Shechtman, E., and Efros, A. A. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597 613. Springer, 2016.