# deep_hyperspherical_learning__c985bb0e.pdf Deep Hyperspherical Learning Weiyang Liu1, Yan-Ming Zhang2, Xingguo Li3,1, Zhiding Yu4, Bo Dai1, Tuo Zhao1, Le Song1 1Georgia Institute of Technology 2Institute of Automation, Chinese Academy of Sciences 3University of Minnesota 4Carnegie Mellon University {wyliu,tourzhao}@gatech.edu, ymzhang@nlpr.ia.ac.cn, lsong@cc.gatech.edu Convolution as inner product has been the founding basis of convolutional neural networks (CNNs) and the key to end-to-end visual representation learning. Benefiting from deeper architectures, recent CNNs have demonstrated increasingly strong representation abilities. Despite such improvement, the increased depth and larger parameter space have also led to challenges in properly training a network. In light of such challenges, we propose hyperspherical convolution (Sphere Conv), a novel learning framework that gives angular representations on hyperspheres. We introduce Sphere Net, deep hyperspherical convolution networks that are distinct from conventional inner product based convolutional networks. In particular, Sphere Net adopts Sphere Conv as its basic convolution operator and is supervised by generalized angular softmax loss - a natural loss formulation under Sphere Conv. We show that Sphere Net can effectively encode discriminative representation and alleviate training difficulty, leading to easier optimization, faster convergence and comparable (even better) classification accuracy over convolutional counterparts. We also provide some theoretical insights for the advantages of learning on hyperspheres. In addition, we introduce the learnable Sphere Conv, i.e., a natural improvement over prefixed Sphere Conv, and Sphere Norm, i.e., hyperspherical learning as a normalization method. Experiments have verified our conclusions. 1 Introduction Recently, deep convolutional neural networks have led to significant breakthroughs on many vision problems such as image classification [9, 18, 19, 6], segmentation [3, 13, 1], object detection [3, 16], etc. While showing stronger representation power over many conventional hand-crafted features, CNNs often require a large amount of training data and face certain training difficulties such as overfitting, vanishing/exploding gradient, covariate shift, etc. The increasing depth of recently proposed CNN architectures have further aggravated the problems. To address the challenges, regularization techniques such as dropout [9] and orthogonality parameter constraints [21] have been proposed. Batch normalization [8] can also be viewed as an implicit regularization to the network, by normalizing each layer s output distribution. Recently, deep residual learning [6] emerged as a promising way to overcome vanishing gradients in deep networks. However, [20] pointed out that residual networks (Res Nets) are essentially an exponential ensembles of shallow networks where they avoid the vanishing/exploding gradient problem but do not provide direct solutions. As a result, training an ultra-deep network still remains an open problem. Besides vanishing/exploding gradient, network optimization is also very sensitive to initialization. Finding better initializations is thus widely studied [5, 14, 4]. In general, having a large parameter space is double-edged considering the benefit of representation power and the associated training difficulties. Therefore, proposing better learning frameworks to overcome such challenges remains important. In this paper, we introduce a novel convolutional learning framework that can effectively alleviate training difficulties, while giving better performance over dot product based convolution. Our idea 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Sphere Conv Operator g( ) θ(w,x) θ(w,x) w x Sphere Conv Cross-entropy Hyperspherical Convolutions Generalized Angular Softmax Loss Sphere Conv Sphere Conv Figure 1: Deep hyperspherical convolutional network architecture. is to project parameter learning onto unit hyperspheres, where layer activations only depend on the geodesic distance between kernels and input signals1 instead of their inner products. To this end, we propose the Sphere Conv operator as the basic module for our network layers. We also propose softmax losses accordingly under such representation framework. Specifically, the proposed softmax losses supervise network learning by also taking the Sphere Conv activations from the last layer instead of inner products. Note that the geodesic distances on a unit hypersphere is the angles between inputs and kernels. Therefore, the learning objective is essentially a function of the input angles and we call it generalized angular softmax loss in this paper. The resulting architecture is the hyperspherical convolutional network (Sphere Net), which is shown in Fig. 1. Our key motivation to propose Sphere Net is that angular information matters in convolutional representation learning. We argue this motivation from several aspects: training stability, training efficiency, and generalization power. Sphere Net can also be viewed as an implicit regularization to the network by normalizing the activation distributions. The weight norm is no longer important since the entire network operates only on angles. And as a result, the ℓ2 weight decay is also no longer needed in Sphere Net. Sphere Conv to some extent also alleviates the covariate shift problem [8]. The output of Sphere Conv operators are bounded from 1 to 1 (0 to 1 if considering Re LU), which makes the variance of each output also bounded. Our second intuition is that angles preserve the most abundant discriminative information in convolutional learning. We gain such intuition from 2D Fourier transform, where an image is decomposed by the combination of a set of templates with magnitude and phase information in 2D frequency domain. If one reconstructs an image with original magnitudes and random phases, the resulting images are generally not recognizable. However, if one reconstructs the image with random magnitudes and original phases. The resulting images are still recognizable. It shows that the most important structural information in an image for visual recognition is encoded by phases. This fact inspires us to project the network learning into angular space. In terms of low-level information, Sphere Conv is able to preserve the shape, edge, texture and relative color. Sphere Conv can learn to selectively drop the color depth but preserve the RGB ratio. Thus the semantic information of an image is preserved. Sphere Net can also be viewed as a non-trivial generalization of [12, 11]. By proposing a loss that discriminatively supervises the network on a hypersphere, [11] achieves state-of-the-art performance on face recognition. However, the rest of the network remains a conventional convolution network. In contrast, Sphere Net not only generalizes the hyperspherical constraint to every layer, but also to different nonlinearity functions of input angles. Specifically, we propose three instances of Sphere Conv operators: linear, cosine and sigmoid. The sigmoid Sphere Conv is the most flexible one with a parameter controlling the shape of the angular function. As a simple extension to the sigmoid Sphere Conv, we also present a learnable Sphere Conv operator. Moreover, the proposed generalized angular softmax (GA-Softmax) loss naturaly generalizes the angular supervision in [11] using the Sphere Conv operators. Additionally, the Sphere Conv can serve as a normalization method that is comparable to batch normalization, leading to an extension to spherical normalization (Sphere Norm). Sphere Net can be easily applied to other network architectures such as Goog Le Net [19], VGG [18] and Res Net [6]. One simply needs to replace the convolutional operators and the loss functions with the proposed Sphere Conv operators and hyperspherical loss functions. In summary, Sphere Conv can be viewed as an alternative to the original convolution operators, and serves as a new measure of correlation. Sphere Net may open up an interesting direction to explore the neural networks. We ask the question whether inner product based convolution operator is an optimal correlation measure for all tasks? Our answer to this question is likely to be no . 1Without loss of generality, we study CNNs here, but our method is generalizable to any other neural nets. 2 Hyperspherical Convolutional Operator 2.1 Definition The convolutional operator in CNNs is simply a linear matrix multiplication, written as F(w, x) = w x + b F where w is a convolutional filter, x denotes a local patch from the bottom feature map and b F is the bias. The matrix multiplication here essentially computes the similarity between the local patch and the filter. Thus the standard convolution layer can be viewed as patch-wise matrix multiplication. Different from the standard convolutional operator, the hyperspherical convolutional (Sphere Conv) operator computes the similarity on a hypersphere and is defined as: Fs(w, x) = g(θ(w,x)) + b Fs, (1) where θ(w,x) is the angle between the kernel parameter w and the local patch x. g(θ(w,x)) indicates a function of θ(w,x) (usually a monotonically decreasing function), and b Fs is the bias. To simplify analysis and discussion, the bias terms are usually left out. The angle θ(w,x) can be interpreted as the geodesic distance (arc length) between w and x on a unit hypersphere. In contrast to the convolutional operator that works in the entire space, Sphere Conv only focuses on the angles between local patches and the filters, and therefore operates on the hypersphere space. In this paper, we present three specific instances of the Sphere Conv Operator. To facilitate the computation, we constrain the output of Sphere Conv operators to [ 1, 1] (although it is not a necessary requirement). Linear Sphere Conv. In linear Sphere Conv operator, g is a linear function of θ(w,x), with the form: g(θ(w,x)) = aθ(w,x) + b, (2) where a and b are parameters for the linear Sphere Conv operator. In order to constrain the output range to [0, 1] while θ(w,x) [0, π], we use a = 2 π and b = 1 (not necessarily optimal design). 0 0.5 1 1.5 2 2.5 3 -1 1 Cosine Linear Sigmoid (k=0.1) Sigmoid (k=0.3) Sigmoid (k=0.7) Figure 2: Sphere Conv operators. Cosine Sphere Conv. The cosine Sphere Conv operator is a nonlinear function of θ(w,x), with its g being the form of g(θ(w,x)) = cos(θ(w,x)), (3) which can be reformulated as w T x w 2 x 2 . Therefore, it can be viewed as a doubly normalized convolutional operator, which bridges the Sphere Conv operator and convolutional operator. Sigmoid Sphere Conv. The Sigmoid Sphere Conv operator is derived from the Sigmoid function and its g can be written as g(θ(w,x)) = 1 + exp( π 2k) 1 exp( π 2k) 1 exp θ(w,x) 1 + exp θ(w,x) k π 2k , (4) where k > 0 is the parameter that controls the curvature of the function. While k is close to 0, g(θ(w,x)) will approximate the step function. While k becomes larger, g(θ(w,x)) is more like a linear function, i.e., the linear Sphere Conv operator. Sigmoid Sphere Conv is one instance of the parametric Sphere Conv family. With more parameters being introduced, the parametric Sphere Conv can have richer representation power. To increase the flexibility of the parametric Sphere Conv, we will discuss the case where these parameters can be jointly learned via back-prop later in the paper. 2.2 Optimization The optimization of the Sphere Conv operators is nearly the same as the convolutional operator and also follows the standard back-propagation. Using the chain rule, we have the gradient of the Sphere Conv with respect to the weights and the feature input: g(θ(w,x)) w = g(θ(w,x)) θ(w,x) θ(w,x) w , g(θ(w,x)) x = g(θ(w,x)) θ(w,x) θ(w,x) For different Sphere Conv operators, both θ(w,x) w and θ(w,x) x are the same, so the only difference lies in the g(θ(w,x)) θ(w,x) part. For θ(w,x) w , we have w = arccos w T x w 2 x 2 x = arccos w T x w 2 x 2 which are straightforward to compute and therefore neglected here. Because g(θ(w,x)) θ(w,x) for the linear Sphere Conv, the cosine Sphere Conv and the Sigmoid Sphere Conv are a, sin(θ(w,x)) and 2 exp(θ(w,x)/k π/2k) k(1+exp(θ(w,x)/k π/2k))2 respectively, all these partial gradients can be easily computed. 2.3 Theoretical Insights We provide a fundamental analysis for the cosine Sphere Conv operator in the case of linear neural network to justify that the Sphere Conv operator can improve the conditioning of the problem. In specific, we consider one layer of linear neural network, where the observation is F = U V (ignore the bias), U Rn k is the weight, and V Rm k is the input that embeds weights from previous layers. Without loss of generality, we assume the columns satisfying Ui,: 2 = Vj,: 2 = 1 for all i = 1, . . . , n and j = 1, . . . , m, and consider min U Rn k,V Rm k G(U, V ) = 1 2 F UV 2 F. (7) This is closely related with the matrix factorization and (7) can be also viewed as the expected version for the matrix sensing problem [10]. The following lemma demonstrates a critical scaling issue of (7) for U and V that significantly deteriorate the conditioning without changing the objective of (7). Lemma 1. Consider a pair of global optimal points U, V satisfying F = UV and Tr(V V In) Tr(U U Im). For any real c > 1, let e U = c U and e V = V /c, then we have κ( 2G( e U, e V )) = Ω(c2κ( 2G(U, V ))), where κ = λmax λmin is the restricted condition number with λmax being the largest eigenvalue and λmin being the smallest nonzero eigenvalue. Lemma 1 implies that the conditioning of the problem (7) at a unbalanced global optimum scaled by a constant c is Ω(c2) times larger than the conditioning of the problem at a balanced global optimum. Note that λmin = 0 may happen, thus we consider the restricted condition here. Similar results hold beyond global optima. This is an undesired geometric structure, which further leads to slow and unstable optimization procedures, e.g., using stochastic gradient descent (SGD). This motivates us to consider the Sphere Conv operator discussed above, which is equivalent to projecting data onto the hypersphere and leads to a better conditioned problem. Next, we consider our proposed cosine Sphere Conv operator for one-layer of the linear neural network. Based on our previous discussion on Sphere Conv, we consider an equivalent problem: min U Rn k,V Rm k GS(U, V ) = 1 2 F DUUV DV 2 F, (8) where DU = diag 1 U1,: 2 , . . . , 1 Un,: 2 Rn n and DV = diag 1 V1,: 2 , . . . , 1 Vm,: 2 Rm m are diagonal matrices. We provide an analogous result to Lemma 1 for (8) . Lemma 2. For any real c > 1, let e U = c U and e V = V /c, then we have λi( 2GS( e U, e V )) = λi( 2GS(U, V )) for all i [(n + m)k] = {1, 2, . . . , (n + m)k} and κ( 2G( e U, e V )) = κ( 2G(U, V )), where κ is defined as in Lemma 1. We have from Lemma 2 that the issue of increasing condition caused by the scaling is eliminated by the Sphere Conv operator in the entire parameter space. This enhances the geometric structure over (7), which further results in improved convergence of optimization procedures. If we extend the result from one layer to multiple layers, the scaling issue propagates. Roughly speaking, when we train N layers, in the worst case, the conditioning of the problem can be c N times worse with a scaling factor c > 1. The analysis is similar to the one layer case, but the computation of the Hessian matrix and associated eigenvalues are much more complicated. Though our analysis is elementary, we provide an important insight and a straightforward illustration of the advantage for using the Sphere Conv operator. The extension to more general cases, e..g, using nonlinear activation function (e.g., Re LU), requires much more sophisticated analysis to bound the eigenvalues of Hessian for objectives, which is deferred to future investigation. 2.4 Discussion Comparison to convolutional operators. Convolutional operators compute the inner product between the kernels and the local patches, while the Sphere Conv operators compute a function of the angle between the kernels and local patches. If we normalize the convolutional operator in terms of both w and x, then the normalized convolutional operator is equivalent to the cosine Sphere Conv operator. Essentially, they use different metric spaces. Interestingly, Sphere Conv operators can also be interpreted as a function of the Geodesic distance on a unit hypersphere. Extension to fully connected layers. Because the fully connected layers can be viewed as a special convolution layer with the kernel size equal to the input feature map, the Sphere Conv operators could be easily generalized to the fully connected layers. It also indicates that Sphere Conv operators could be used not only to deep CNNs, but also to linear models like logistic regression, SVM, etc. Network Regularization. Because the norm of weights is no longer crucial, we stop using the ℓ2 weight decay to regularize the network. Sphere Nets are learned on hyperspheres, so we regularize the network based on angles instead of norms. To avoid redundant kernels, we want the kernels uniformly spaced around the hypersphere, but it is difficult to formulate such constraints. As a tradeoff, we encourage the orthogonality. Given a set of kernels W where the i-th column Wi is the weights of the i-th kernel, the network will also minimize W W I 2 F where I is an identity matrix. Determining the optimal Sphere Conv. In practice, we could treat different types of Sphere Conv as a hyperparameter and use the cross validation to determine which Sphere Conv is the most suitable one. For sigmoid Sphere Conv, we could also use the cross validation to determine its hyperparameter k. In general, we need to specify a Sphere Conv operator before using it, but prefixing a Sphere Conv may not be an optimal choice (even using cross validation). What if we treat the hyperparameter k in sigmoid Sphere Conv as a learnable parameter and use the back-prop to learn it? Following this idea, we further extend sigmoid Sphere Conv to a learnable Sphere Conv in the next subsection. Sphere Conv as normalization. Because Sphere Conv could partially address the covariate shift, it could also serve as a normalization method similar to batch normalization. Differently, Sphere Conv normalizes the network in terms of feature map and kernel weights, while batch normalization is for the mini-batches. Thus they do not contradict with each other and can be used simultaneously. 2.5 Extension: Learnable Sphere Conv and Sphere Norm Learnable Sphere Conv. It is a natrual idea to replace the current prefixed Sphere Conv with a learnable one. There will be plenty of parametrization choices for the Sphere Conv to be learnable, and we present a very simple learnable Sphere Conv operator based on the sigmoid Sphere Conv. Because the sigmoid Sphere Conv has a hyperparameter k, we could treat it as a learnable parameter that can be updated by back-prop. In back-prop, k is updated using kt+1 = kt +η L k where t denotes the current iteration index and L k can be easily computed by the chain rule. Usually, we also require k to be positive. The learning of k is in fact similar to the parameter learning in PRe LU [5]. Sphere Norm: hyperspherical learning as a normalization method. Similar to batch normalization (Batch Norm), we note that the hyperspherical learning can also be viewed as a way of normalization, because Sphere Conv constrain the output value in [ 1, 1] ([0, 1] after Re LU). Different from Batch Norm, Sphere Norm normalizes the network based on spatial information and the weights, so it has nothing to do with the mini-batch statistic. Because Sphere Norm normalize both the input and weights, it could avoid covariate shift due to large weights and large inputs while Batch Norm could only prevent covariate shift caused by the inputs. In such sense, it will work better than Batch Norm when the batch size is small. Besides, Sphere Conv is more flexible in terms of design choices (e.g. linear, cosine, and sigmoid) and each may lead to different advantages. Similar to Batch Norm, we could use a rescaling strategy for the Sphere Norm. Specifically, we rescale the output of Sphere Conv via βFs(w, x) + γ where β and γ are learned by back-prop (similar to Batch Norm, the rescaling parameters can be either learned or prefixed). In fact, Sphere Norm does not contradict with the Batch Norm at all and can be used simultaneously with Batch Norm. Interestingly, we find using both is empirically better than using either one alone. 3 Learning Objective on Hyperspheres For learning on hyperspheres, we can either use the conventional loss function such as softmax loss, or use some loss functions that are tailored for the Sphere Conv operators. We present some possible choices for these tailored loss functions. Weight-normalized Softmax Loss. The input feature and its label are denoted as xi and yi, respectively. The original softmax loss can be written as L = 1 i log efyi P j efj where N is the number of training samples and fj is the score of the j-th class (j [1, K], K is the number of classes). The class score vector f is usually the output of a fully connected layer W , so we have fj =W j xi +bj and fyi =W yi xi +byi in which xi, Wj, and Wyi are the i-th training sample, the j-th and yi-th column of W respectively. We can rewrite Li as Li = log e W yi xi+byi P j e W j xi+bj = log e Wyi xi cos(θyi,i)+byi P j e Wj xi cos(θj,i)+bj where θj,i(0 θj,i π) is the angle between vector Wj and xi. The decision boundary of the original softmax loss is determined by the vector f. Specifically in the binary-class case, the decision boundary of the softmax loss is W 1 x+b1 =W 2 x+b2. Considering the intuition of the Sphere Conv operators, we want to make the decision boundary only depend on the angles. To this end, we normalize the weights ( Wj =1) and zero out the biases (bj =0), following the intuition in [11] (sometimes we could keep the biases while data is imbalanced). The decision boundary becomes x cos(θ1)= x cos(θ2). Similar to Sphere Conv, we could generalize the decision boundary to x g(θ1)= x g(θ2), so the weight-normalized softmax (W-Softmax) loss can be written as Li = log e xi g(θyi,i) j e xi g(θj,i) where g( ) can take the form of linear Sphere Conv, cosine Sphere Conv, or sigmoid Sphere Conv. Thus we also term these three difference weight-normalized loss functions as linear W-Softmax loss, cosine W-Softmax loss, and sigmoid W-Softmax loss, respectively. Generalized Angular Softmax Loss. Inspired by [11], we use a multiplicative parameter m to impose margins on hyperspheres. We propose a generalized angular softmax (GA-Softmax) loss which extends the W-Softmax loss to a loss function that favors large angular margin feature distribution. In general, the GA-Softmax loss is formulated as Li = log e xi g(mθyi,i) e xi g(mθyi,i) + P j =yi e xi g(θj,i) where g( ) could also have the linear, cosine and sigmoid form, similar to the W-Softmax loss. We can see A-Softmax loss [11] is exactly the cosine GA-Softmax loss and W-Softmax loss is the special case (m = 1) of GA-Sofmtax loss. Note that we usually require θj,i [0, π m], because cos(θj,i) is only monotonically decreasing in [0, π]. To address this, [12, 11] construct a monotonically decreasing function recursively using the [0, π m] part of cos(mθj,i). Although it indeed partially addressed the issue, it may introduce a number of saddle points (w.r.t. W ) in the loss surfaces. Originally, g θ will be close to 0 only when θ is close to 0 and π. However, in L-Softmax [12] or A-Softmax (cosine GA-Softmax), it is not the case. g θ will be 0 when θ = kθ m , k = 0, , m. It will possibly cause instability in training. The sigmoid GA-Softmax loss also has similar issues. However, if we use the linear GA-Softmax loss, this problem will be automatically solved and the training will possibly become more stable in practice. There will also be a lot of choices of g( ) to design a specific GA-Sofmtax loss, and each one has different optimization dynamics. The optimal one may depend on the task itself (e.g. cosine GA-Softmax has been shown effective in deep face recognition [11]). Discussion of Sphere-normalized Softmax Loss. We have also considered the sphere-normalized softmax loss (S-Softmax), which simultaneously normalizes the weights (Wj) and the feature x. It seems to be a more natural choice than W-Softmax for the proposed Sphere Conv and makes the entire framework more unified. In fact, we have tried this and the empirical results are not that good, because the optimization seems to become very difficult. If we use the S-Softmax loss to train a network from scratch, we can not get reasonable results without using extra tricks, which is the reason we do not use it in this paper. For completeness, we give some discussions here. Normally, it is very difficult to make the S-Softmax loss value to be small enough, because we normalize the features to unit hypersphere. To make this loss work, we need to either normalize the feature to a value much larger than 1 (hypersphere with large radius) and then tune the learning rate or first train the network with the softmax loss from scratch and then use the S-Softmax loss for finetuning. 4 Experiments and Results 4.1 Experimental Settings We will first perform comprehensive ablation study and exploratory experiments for the proposed Sphere Nets, and then evaluate the Sphere Nets on image classification. For the image classification task, we perform experiments on CIFAR10 (only with random left-right flipping), CIFAR10+ (with full data augmentation), CIFAR100 and large-scale Imagenet 2012 datasets [17]. General Settings. For CIFAR10, CIFAR10+ and CIFAR100, we follow the same settings from [7, 12]. For Imagenet 2012 dataset, we mostly follow the settings in [9]. We attach more details in Appendix B. For fairness, batch normalization and Re LU are used in all methods if not specified. All the comparisons are made to be fair. Compared CNNs have the same architecture with Sphere Nets. Training. Appendix A gives the network details. For CIFAR-10 and CIFAR-100, we use the ADAM, starting with the learning rate 0.001. The batch size is 128 if not specified. The learning rate is divided by 10 at 34K, 54K iterations and the training stops at 64K. For both A-Softmax and GA-Softmax loss, we use m=4. For Imagenet-2012, we use the SGD with momentum 0.9. The learning rate starts with 0.1, and is divided by 10 at 200K and 375K iterations. The training stops at 550K iteration. 4.2 Ablation Study and Exploratory Experiments We perform comprehensive Ablation and exploratory study on the Sphere Net and evaluate every component individually in order to analyze its advantages. We use the 9-layer CNN as default (if not specified) and perform the image classification on CIFAR-10 without any data augmentation. Sphere Conv Operator / Loss Original Softmax Sigmoid (0.1) W-Softmax Sigmoid (0.3) W-Softmax Sigmoid (0.7) W-Softmax Linear W-Softmax Cosine W-Softmax A-Softmax (m=4) GA-Softmax (m=4) Sigmoid (0.1) 90.97 90.91 90.89 90.88 91.07 91.13 91.87 91.99 Sigmoid (0.3) 91.08 91.44 91.37 91.21 91.34 91.28 92.13 92.38 Sigmoid (0.7) 91.05 91.16 91.47 91.07 90.99 91.18 92.22 92.36 Linear 91.10 90.93 91.42 90.96 90.95 91.24 92.21 92.32 Cosine 90.89 90.88 91.08 91.22 91.17 90.99 91.94 92.19 Original Conv 90.58 90.58 90.73 90.78 91.08 90.68 91.78 91.80 Table 1: Classification accuracy (%) with different loss functions. Comparison of different loss functions. We first evaluate all the Sphere Conv operators with different loss functions. All the compared Sphere Conv operators use the 9-layer CNN architecture in the experiment. From the results in Table 1, one can observe that the Sphere Conv operators consistently outperforms the original convolutional operator. For the compared loss functions except A-Softmax and GA-Softmax, the effect on accuracy seems to less crucial than the Sphere Conv operators, but sigmoid W-Softmax is more flexible and thus works slightly better than the others. The sigmoid Sphere Conv operators with a suitably chosen parameter also works better than the others. Note that, W-Softmax loss is in fact comparable to the original softmax loss, because our Sphere Net optimizes angles and the W-Softmax is derived from the original softmax loss. Therefore, it is fair to compare the Sphere Net with W-Softmax and CNN with softmax loss. From Table 1, we can see Sphere Conv operators are consistently better than the covolutional operators. While we use a large-margin loss function like the A-Softmax [11] and the proposed GA-Softmax, the accuracy can be further boosted. One may notice that A-Softmax is actually cosine GA-Softmax. The superior performance of A-Softmax with Sphere Net shows that our architecture is more suitable for the learning of angular loss. Moreover, our proposed large-margin loss (linear GA-Softmax) performs the best among all these compared loss functions. Comparison of different network architectures. We are also interested in how our Sphere Conv operators work in different architectures. We evaluate all the proposed Sphere Conv operators with the same architecture of different layers and a totally different architecture (Res Net). Our baseline CNN architecture follows the design of VGG network [18] only with different convolutional layers. For fair comparison, we use cosine W-Softmax for all Sphere Conv operators and original softmax for original convolution operators. From the results in Table 2, one can see that Sphere Nets greatly outperforms the CNN baselines, usually with more than 1% improvement. While applied to Res Net, our Sphere Conv operators also work better than the baseline. Note that, we use the similar Res Net architecture from the CIFAR-10 experiment in [6]. We do not use data augmentation for CIFAR-10 in this experiment, so the Res Net accuracy is much lower than the reported one in [6]. Our results on different network architectures show consistent and significant improvement over CNNs. Sphere Conv Operator CNN-3 CNN-9 CNN-18 CNN-45 CNN-60 Res Net-32 Sigmoid (0.1) 82.08 91.13 91.43 89.34 87.67 90.94 Sigmoid (0.3) 81.92 91.28 91.55 89.73 87.85 91.7 Sigmoid (0.7) 82.4 91.18 91.69 89.85 88.42 91.19 Linear 82.31 91.15 91.24 90.15 89.91 91.25 Cosine 82.23 90.99 91.23 90.05 89.28 91.38 Original Conv 81.19 90.68 90.62 88.23 88.15 90.40 Table 2: Classification accuracy (%) with different network architectures. Sphere Conv Operator Acc. (%) Sigmoid (0.1) 86.29 Sigmoid (0.3) 85.67 Sigmoid (0.7) 85.51 Linear 85.34 Cosine 85.25 CNN w/o Re LU 80.73 Table 3: Acc. w/o Re LU. Comparison of different width (number of filters). We evaluate the Sphere Net with different number of filters. Fig. 3(c) shows the convergence of different width of Sphere Nets. 16/32/48 means conv1.x, conv2.x and conv3.x have 16, 32 and 48 filters, respectively. One could observe that while the number of filters are small, Sphere Net performs similarly to CNNs (slightly worse). However, while we increase the number of filters, the final accuracy will surpass the CNN baseline even faster and more stable convergence performance. With large width, we find that Sphere Nets perform consistently better than CNN baselines, showing that Sphere Nets can make better use of the width. Learning without Re LU. We notice that Sphere Conv operators are no longer a matrix multiplication, so it is essentially a non-linear function. Because the Sphere Conv operators already introduce certain Testing Accuracy 0 1 2 3 4 5 6 7 x104 Res Net baseline on CIFAR10 Res Net baseline on CIFAR10+ Sphere Res Net (Sigmoid 0.3) on CIFAR10 Sphere Res Net (Sigmoid 0.3) on CIFAR10+ 0 1 2 3 4 5 6 7 x104 CNN Baseline Sphere Net (cosine) w/o orth. Sphere Net (cosine) w/ orth. Sphere Net (linear) w/ orth. Sphere Net (Sigmoid 0.3) w/ orth. Testing Accuracy 0 0.5 1 1.5 2 2.5 3 3.5 4 x104 69-layer CNN 69-layer Sphere Net Testing Accuracy CNN 16/32/48 Sphere Net 16/32/48 CNN 64/96/128 Sphere Net 64/96/128 CNN 128/192/256 Sphere Net 128/192/256 CNN 256/384/512 Sphere Net 256/384/512 x104 Iteration 0 1 2 3 4 5 6 5.5 6 6.5 0.9 Testing Accuracy (a) Res Net vs. Sphere Res Net on CIFAR-10/10+ (b) CNN vs. Sphere Net (orth.) on CIFAR-10 (c) Different width of Sphere Net on CIFAR-10 (d) Deep CNN vs. Sphere Net on CIFAR-10 Figure 3: Testing accuracy over iterations. (a) Res Net vs. Sphere Res Net. (b) Plain CNN vs. plain Sphere Net. (c) Different width of Sphere Net. (d) Ultra-deep plain CNN vs. ultra-deep plain Sphere Net. non-linearity to the network, we evaluate how much gain will such non-linearity bring. Therefore, we remove the Re LU activation and compare our Sphere Net with the CNNs without Re LU. The results are given in Table 3. All the compared methods use 18-layer CNNs (with Batch Norm). Although removing Re LU greatly reduces the classification accuracy, our Sphere Net still outperforms the CNN without Re LU by a significant margin, showing its rich non-linearity and representation power. Convergence. One of the most significant advantages of Sphere Net is its training stability and convergence speed. We evaluate the convergence with two different architectures: CNN-9 and Res Net-32. For fair comparison, we use the original softmax loss for all compared methods (including Sphere Nets). ADAM is used for the stochastic optimization and the learning rate is the same for all networks. From Fig. 3(a), the Sphere Res Net converges significantly faster than the original Res Net baseline in both CIFAR-10 and CIFAR-10+ and the final accuracy are also higher than the baselines. In Fig. 3(b), we evaluate the Sphere Net with and without orthogonality constraints on kernel weights. With the same network architecture, Sphere Net also converges much faster and performs better than the baselines. The orthogonality constraints also can bring performance gains in some cases. Generally from Fig. 3, one could also observe that the Sphere Net converges fast and very stably in every case while the CNN baseline fluctuates in a relative wide range. Optimizing ultra-deep networks. Partially because of the alleviation of the covariate shift problem and the improvement of conditioning, our Sphere Net is able to optimize ultra-deep neural networks without using residual units or any form of shortcuts. For Sphere Nets, we use the cosine Sphere Conv operator with the cosine W-Softmax loss. We directly optimize a very deep plain network with 69 stacked convolutional layers. From Fig. 3(d), one can see that the convergence of Sphere Net is much easier than the CNN baseline and the Sphere Net is able to achieve nearly 90% final accuracy. 4.3 Preliminary Study towards Learnable Sphere Conv 0 0.2 0.4 0.6 0.8 1 The value of k conv1.1 conv2.1 conv3.1 Figure 4: Frequency histogram of k. Although the learnable Sphere Conv is not a main theme of this paper, we still run some preliminary evaluations on it. For the proposed learnable sigmoid Sphere Conv, we learn the parameter k independently for each filter. It is also trivial to learn it in a layer-shared or network-shared fashsion. With the same 9-layer architecture used in Section 4.2, the learnable Sphere Conv (with cosine W-Softmax loss) achieves 91.64% on CIFAR-10 (without full data augmentation), while the best sigmoid Sphere Conv (with cosine W-Softmax loss) achieves 91.22%. In Fig. 4, we also plot the frequency histogram of k in Conv1.1 (64 filters), Conv2.1 (96 filters) and Conv3.1 (128 filters) of the final learned Sphere Net. From Fig. 4, we observe that each layer learns different distribution of k. The first convolutional layer (Conv1.1) tends to uniformly distribute k into a large range of values from 0 to 1, potentially extracting information from all levels of angular similarity. The fourth convolutional layer (Conv2.1) tends to learn more concentrated distribution of k than Conv1.1, while the seventh convolutional layer (Conv3.1) learns highly concentrated distribution of k which is centered around 0.8. Note that, we initialize all k with a constant 0.5 and learn them with the back-prop. 4.4 Evaluation of Sphere Norm From Section 4.2, we could clearly see the convergence advantage of Sphere Nets. In general, we can view the Sphere Conv as a normalization method (comparable to batch normalization) that can be applied to all kinds of networks. This section evaluates the challenging scenarios where the minibatch size is small (results under 128 batch size could be found in Section 4.2) and we use the same 0 1 2 3 4 5 6 Iteration x104 Testing Accuracy Batch Norm Sphere Norm Sphere Norm+Batch Norm 0 1 2 3 4 5 6 Iteration x104 Testing Accuracy 0 1 2 3 4 5 6 Iteration Testing Accuracy Batch Norm Sphere Norm Rescaled Sphere Norm Sphere Norm w/ Orth. Sphere Norm+Batch Norm 0 1 2 3 4 5 6 Iteration Testing Accuracy Batch Norm Sphere Norm Rescaled Sphere Norm Sphere Norm w/ Orth. Sphere Norm+Batch Norm (a) Mini-Batch Size = 4 (b) Mini-Batch Size = 8 (c) Mini-Batch Size = 16 (d) Mini-Batch Size = 32 Batch Norm Sphere Norm Rescaled Sphere Norm Sphere Norm w/ Orth. Sphere Norm+Batch Norm Figure 5: Convergence under different mini-batch size on CIFAR-10 dataset (Same setting as Section 4.2). 9-layer CNN as in Section 4.2. To be simple, we use the cosine Sphere Conv as Sphere Norm. The softmax loss is used in both CNNs and Sphere Nets. From Fig. 5, we could observe that Sphere Norm achieves the final accuracy similar to Batch Norm, but Sphere Norm converges faster and more stably. Sphere Norm plus the orthogonal constraint helps convergence a little bit and rescaled Sphere Norm does not seem to work well. While Batch Norm and Sphere Norm are used together, we obtain the fastest convergence and the highest final accuracy, showing excellent compatibility of Sphere Norm. 4.5 Image Classification on CIFAR-10+ and CIFAR-100 Method CIFAR-10+ CIFAR-100 ELU [2] 94.16 72.34 Fit Res Net (LSUV) [14] 93.45 65.72 Res Net-1001 [7] 95.38 77.29 Baseline Res Net-32 (softmax) 93.26 72.85 Sphere Res Net-32 (S-SW) 94.47 76.02 Sphere Res Net-32 (L-LW) 94.33 75.62 Sphere Res Net-32 (C-CW) 94.64 74.92 Sphere Res Net-32 (S-G) 95.01 76.39 Table 4: Acc. (%) on CIFAR-10+ & CIFAR-100. We first evaluate the Sphere Net in a classic image classification task. We use the CIFAR-10+ and CIFAR100 datasets and perform random flip (both horizontal and vertical) and random crop as data augmentation (CIFAR-10 with full data augmentation is denoted as CIFAR-10+). We use the Res Net-32 as a baseline architecture. For the Sphere Net of the same architecture, we evaluate sigmoid Sphere Conv operator (k = 0.3) with sigmoid W-Softmax (k = 0.3) loss (S-SW), linear Sphere Conv operator with linear W-Softmax loss (L-LW), cosine Sphere Conv operator with cosine W-Softmax loss (C-CW) and sigmoid Sphere Conv operator (k = 0.3) with GA-Softmax loss (S-G). In Table 4, we could see the Sphere Net outperforms a lot of current state-of-the-art methods and is even comparable to the Res Net-1001 which is far deeper than ours. This experiment further validates our idea that learning on a hyperspheres constrains the parameter space to a more semantic and label-related one. 4.6 Large-scale Image Classification on Imagenet-2012 0 1 2 3 4 5 Iteration x105 Top1 Error Rate Res Net-18 Sphere Res Net-18-v1 Sphere Res Net-18-v2 0 1 2 3 4 5 Iteration x105 Top5 Error Rate Res Net-18 Sphere Res Net-18-v1 Sphere Res Net-18-v2 Figure 6: Validation error (%) on Image Net. We evaluate Sphere Nets on large-scale Imagenet2012 dataset. We only use the minimum data augmentation strategy in the experiment (details are in Appendix B). For the Res Net-18 baseline and Sphere Res Net-18, we use the same filter numbers in each layer. We develop two types of Sphere Res Net-18, termed as v1 and v2 respectively. In Sphere Res Net-18-v2, we do not use Sphere Conv in the 1 1 shortcut convolutions which are used to match the number of channels. In Sphere Res Net-18-v1, we use Sphere Conv in the 1 1 shortcut convolutions. Fig. 6 shows the single crop validation error over iterations. One could observe that both Sphere Res Nets converge much faster than the Res Net baseline, while Sphere Res Net18-v1 converges the fastest but yields a slightly worse yet comparable accuracy. Sphere Res Net-18-v2 not only converges faster than Res Net-18, but it also shows slightly better accuracy. 5 Limitations and Future Work Our work still has some limitations: (1) Sphere Nets have large performance gain while the network is wide enough. If the network is not wide enough, Sphere Nets still converge much faster but yield slightly worse (still comparable) recognition accuracy. (2) The computation complexity of each neuron is slightly higher than the CNNs. (3) Sphere Convs are still mostly prefixed. Possible future work includes designing/learning a better Sphere Conv, efficiently computing the angles to reduce computation complexity, applications to the tasks that require fast convergence (e.g. reinforcement learning and recurrent neural networks), better angular regularization to replace orthogonality, etc. Acknowledgements We thank Zhen Liu (Georgia Tech) for helping with the experiments and providing suggestions. This project was supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, NSF CNS-1704701, ONR N00014-15-1-2340, Intel ISTC, NVIDIA and Amazon AWS. Xingguo Li is supported by doctoral dissertation fellowship from University of Minnesota. Yan-Ming Zhang is supported by the National Natural Science Foundation of China under Grant 61773376. [1] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. [2] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv:1511.07289, 2015. [3] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. [4] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Aistats, 2010. [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015. [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. ar Xiv:1603.05027, 2016. [8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. [10] Xingguo Li, Zhaoran Wang, Junwei Lu, Raman Arora, Jarvis Haupt, Han Liu, and Tuo Zhao. Symmetry, saddle points, and global geometry of nonconvex matrix factorization. ar Xiv:1612.09296, 2016. [11] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017. [12] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural networks. In ICML, 2016. [13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. [14] Dmytro Mishkin and Jiri Matas. All you need is a good init. ar Xiv:1511.06422, 2015. [15] Yuji Nakatsukasa. Eigenvalue perturbation bounds for hermitian block tridiagonal matrices. Applied Numerical Mathematics, 62(1):67 78, 2012. [16] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91 99, 2015. [17] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, pages 1 42, 2014. [18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv:1409.1556, 2014. [19] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. [20] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In NIPS, 2016. [21] Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. ar Xiv:1703.01827, 2017.