# learning_towards_minimum_hyperspherical_energy__d71fc74a.pdf Learning towards Minimum Hyperspherical Energy Weiyang Liu1,*, Rongmei Lin2,*, Zhen Liu1,*, Lixin Liu3, Zhiding Yu4, Bo Dai1,5, Le Song1,6 1Georgia Institute of Technology 2Emory University 3South China University of Technology 4NVIDIA 5Google Brain 6Ant Financial Neural networks are a powerful class of nonlinear functions that can be trained end-to-end on various applications. While the over-parametrization nature in many neural networks renders the ability to fit complex functions and the strong representation power to handle challenging tasks, it also leads to highly correlated neurons that can hurt the generalization ability and incur unnecessary computation cost. As a result, how to regularize the network to avoid undesired representation redundancy becomes an important issue. To this end, we draw inspiration from a well-known problem in physics Thomson problem, where one seeks to find a state that distributes N electrons on a unit sphere as evenly as possible with minimum potential energy. In light of this intuition, we reduce the redundancy regularization problem to generic energy minimization, and propose a minimum hyperspherical energy (MHE) objective as generic regularization for neural networks. We also propose a few novel variants of MHE, and provide some insights from a theoretical point of view. Finally, we apply neural networks with MHE regularization to several challenging tasks. Extensive experiments demonstrate the effectiveness of our intuition, by showing the superior performance with MHE regularization. 1 Introduction The recent success of deep neural networks has led to its wide applications in a variety of tasks. With the over-parametrization nature and deep layered architecture, current deep networks [14, 46, 42] are able to achieve impressive performance on large-scale problems. Despite such success, having redundant and highly correlated neurons (e.g., weights of kernels/filters in convolutional neural networks (CNNs)) caused by over-parametrization presents an issue [37, 41], which motivated a series of influential works in network compression [10, 1] and parameter-efficient network architectures [16, 19, 62]. These works either compress the network by pruning redundant neurons or directly modify the network architecture, aiming to achieve comparable performance while using fewer parameters. Yet, it remains an open problem to find a unified and principled theory that guides the network compression in the context of optimal generalization ability. Another stream of works seeks to further release the network generalization power by alleviating redundancy through diversification [57, 56, 5, 36] as rigorously analyzed by [59]. Most of these works address the redundancy problem by enforcing relatively large diversity between pairwise projection bases via regularization. Our work broadly falls into this category by sharing similar high-level target, but the spirit and motivation behind our proposed models are distinct. In particular, there is a recent trend of studies that feature the significance of angular learning at both loss and convolution levels [29, 28, 30, 27], based on the observation that the angles in deep embeddings learned by CNNs tend to encode semantic difference. The key intuition is that angles preserve the most abundant and discriminative information for visual recognition. As a result, hyperspherical geodesic distances between neurons naturally play a key role in this context, and thus, it is intuitively desired to impose discrimination by keeping their projections on the hypersphere as far away from * indicates equal contributions. Correspondence to: Weiyang Liu . 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. each other as possible. While the concept of imposing large angular diversities was also considered in [59, 57, 56, 36], they do not consider diversity in terms of global equidistribution of embeddings on the hypersphere, which fails to achieve the state-of-the-art performances. Given the above motivation, we draw inspiration from a well-known physics problem, called Thomson problem [48, 43]. The goal of Thomson problem is to determine the minimum electrostatic potential energy configuration of N mutually-repelling electrons on the surface of a unit sphere. We identify the intrinsic resemblance between the Thomson problem and our target, in the sense that diversifying neurons can be seen as searching for an optimal configuration of electron locations. Similarly, we characterize the diversity for a group of neurons by defining a generic hyperspherical potential energy using their pairwise relationship. Higher energy implies higher redundancy, while lower energy indicates that these neurons are more diverse and more uniformly spaced. To reduce the redundancy of neurons and improve the neural networks, we propose a novel minimum hyperspherical energy (MHE) regularization framework, where the diversity of neurons is promoted by minimizing the hyperspherical energy in each layer. As verified by comprehensive experiments on multiple tasks, MHE is able to consistently improve the generalization power of neural networks. Orthonormal MHE Half-space MHE Figure 1: Orthonormal, MHE and half-space MHE regularization. The red dots denote the neurons optimized by the gradient of the corresponding regularization. The rightmost pink dots denote the virtual negative neurons. We randomly initialize the weights of 10 neurons on a 3D Sphere and optimize them with SGD. MHE faces different situations when it is applied to hidden layers and output layers. For hidden layers, applying MHE straightforwardly may still encourage some degree of redundancy since it will produce co-linear bases pointing to opposite directions (see Fig. 1 middle). In order to avoid such redundancy, we propose the half-space MHE which constructs a group of virtual neurons and minimize the hyperspherical energy of both existing and virtual neurons. For output layers, MHE aims to distribute the classifier neurons1 as uniformly as possible to improve the inter-class feature separability. Different from MHE in hidden layers, classifier neurons should be distributed in the full space for the best classification performance [29, 28]. An intuitive comparison among the widely used orthonormal regularization, the proposed MHE and half-space MHE is provided in Fig. 1. One can observe that both MHE and half-space MHE are able to uniformly distribute the neurons over the hypersphere and half-space hypershpere, respectively. In contrast, conventional orthonormal regularization tends to group neurons closer, especially when the number of neurons is greater than the dimension. MHE is originally defined on Euclidean distance, as indicated in Thomson problem. However, we further consider minimizing hyperspherical energy defined with respect to angular distance, which we will refer to as angular-MHE (A-MHE) in the following paper. In addition, we give some theoretical insights of MHE regularization, by discussing the asymptotic behavior and generalization error. Last, we apply MHE regularization to multiple vision tasks, including generic object recognition, class-imbalance learning, and face recognition. In the experiments, we show that MHE is architectureagnostic and can considerably improve the generalization ability. 2 Related Works Diversity regularization is shown useful in sparse coding [32, 35], ensemble learning [26, 24], selfpaced learning [21], metric learning [58], etc. Early studies in sparse coding [32, 35] show that the generalization ability of codebook can be improved via diversity regularization, where the diversity is often modeled using the (empirical) covariance matrix. More recently, a series of studies have featured diversity regularization in neural networks [59, 57, 56, 5, 36, 55], where regularization is mostly achieved via promoting large angle/orthogonality, or reducing covariance between bases. Our work differs from these studies by formulating the diversity of neurons on the entire hypersphere, therefore promoting diversity from a more global, top-down perspective. Methods other than diversity-promoting regularization have been widely proposed to improve CNNs [44, 20, 33, 30] and generative adversarial nets (GANs) [4, 34]. MHE can be regarded as a complement that can be applied on top of these methods. 1Classifier neurons are the projection bases of the last layer (i.e., output layer) before input to softmax. 3 Learning Neurons towards Minimum Hyperspherical Energy 3.1 Formulation of Minimum Hyperspherical Energy Minimum hyperspherical energy defines an equilibrium state of the configuration of neuron s directions. We argue that the power of neural representation of each layer can be characterized by the hyperspherical energy of its neurons, and therefore a minimal energy configuration of neurons can induce better generalization. Before delving into details, we first define the hyperspherical energy functional for N neurons (i.e., kernels) with (d+1)-dimension WN ={w1, , w N Rd+1} as Es,d( ˆwi|N i=1) = j=1,j =i fs ˆwi ˆwj = i =j ˆwi ˆwj s , s > 0 P i =j log ˆwi ˆwj 1 , s = 0 , (1) where denotes Euclidean distance, fs( ) is a decreasing real-valued function, and ˆwi = wi wi is the i-th neuron weight projected onto the unit hypersphere Sd ={w Rd+1| w =1}. We also denote ˆ WN ={ ˆw1, , ˆw N Sd}, and Es =Es,d( ˆwi|N i=1) for short. There are plenty of choices for fs( ), but in this paper we use fs(z) = z s, s > 0, known as Riesz s-kernels. Particularly, as s 0, z s s log(z 1)+1, which is an affine transformation of log(z 1). It follows that optimizing the logarithmic hyperspherical energy E0 =P i =j log( ˆwi ˆwj 1) is essentially the limiting case of optimizing the hyperspherical energy Es. We therefore define f0(z)=log(z 1) for convenience. The goal of the MHE criterion is to minimize the energy in Eq. (1) by varying the orientations of the neuron weights w1, , w N. To be precise, we solve an optimization problem: min WN Es with s 0. In particular, when s=0, we solve the logarithmic energy minimization problem: arg min WN E0 = arg min WN exp(E0) = arg max WN i =j ˆwi ˆwj , (2) in which we essentially maximize the product of Euclidean distances. E0, E1 and E2 have interesting yet profound connections. Note that Thomson problem corresponds to minimizing E1, which is a NP-hard problem. Therefore in practice we can only compute its approximate solution by heuristics. In neural networks, such a differentiable objective can be directly optimized via gradient descent. 3.2 Logarithmic Hyperspherical Energy E0 as a Relaxation Optimizing the original energy in Eq. (1) is equivalent to optimizing its logarithmic form log Es. To efficiently solve this difficult optimization problem, we can instead optimize the lower bound of log Es as a surrogate energy, by applying Jensen s inequality: j=1,j =i log With fs(z)=z s, s>0, we observe that Elog becomes s E0 =s P i =j log( ˆwi ˆwj 1), which is identical to the logarithmic hyperspherical energy E0 up to a multiplicative factor s. Therefore, minimizing E0 can also be viewed as a relaxation of minimizing Es for s>0. 3.3 MHE as Regularization for Neural Networks Now that we have introduced the formulation of MHE, we propose MHE regularization for neural networks. In supervised neural network learning, the entire objective function is shown as follows: j=1 ℓ( wout i , xj c i=1, yj) | {z } training data fitting 1 Nj(Nj 1){Es}j | {z } Th: hyperspherical energy for hidden layers + λo 1 NL(NL 1)Es( ˆwout i |c i=1) | {z } To: hyperspherical energy for output layer where xi is the feature of the i-th training sample entering the output layer, wout i is the classifier neuron for the i-th class in the output fully-connected layer and ˆwout i denotes its normalized version. {Es}i denotes the hyperspherical energy for the neurons in the i-th layer. c is the number of classes, m is the batch size, L is the number of layers of the neural network, and Ni is the number of neurons in the i-th layer. Es( ˆwout i |c i=1) denotes the hyperspherical energy of neurons { ˆwout 1 , , ˆwout c }. The ℓ2 weight decay is omitted here for simplicity, but we will use it in practice. An alternative interpretation of MHE regularization from a decoupled view is given in Section 3.7 and Appendix C. MHE has different effects and interpretations in regularizing hidden layers and output layers. MHE for hidden layers. To make neurons in the hidden layers more discriminative and less redundant, we propose to use MHE as a form of regularization. MHE encourages the normalized neurons to be uniformly distributed on a unit hypersphere, which is partially inspired by the observation in [30] that angular difference in neurons preserves semantic (label-related) information. To some extent, MHE maximizes the average angular difference between neurons (specifically, the hyperspherical energy of neurons in every hidden layer). For instance, in CNNs we minimize the hyperpsherical energy of kernels in convolutional and fully-connected layers except the output layer. MHE for output layers. For the output layer, we propose to enhance the inter-class feature separability with MHE to learn discriminative and well-separated features. For classification tasks, MHE regularization is complementary to the softmax cross-entropy loss in CNNs. The softmax loss focuses more on the intra-class compactness, while MHE encourages the inter-class separability. Therefore, MHE on output layers can induce features with better generalization power. 3.4 MHE in Half Space Original MHE Half-space MHE Figure 2: Half-space MHE. Directly applying the MHE formulation may still encouter some redundancy. An example in Fig. 2, with two neurons in a 2dimensional space, illustrates this potential issue. Directly imposing the original MHE regularization leads to a solution that two neurons are colinear but with opposite directions. To avoid such redundancy, we propose the half-space MHE regularization which constructs some virtual neurons and minimizes the hyperspherical energy of both original and virtual neurons together. Specifically, half-space MHE constructs a colinear virtual neuron with opposite direction for every existing neuron. Therefore, we end up with minimizing the hyperspherical energy with 2Ni neurons in the i-th layer (i.e., minimizing Es({ ˆwk, ˆwk}|2Ni k=1)). This half-space variant will encourage the neurons to be less correlated and less redundant, as illustrated in Fig. 2. Note that, half-space MHE can only be used in hidden layers, because the colinear neurons do not constitute redundancy in output layers, as shown in [29]. Nevertheless, colinearity is usually not likely to happen in high-dimensional spaces, especially when the neurons are optimized to fit training data. This may be the reason that the original MHE regularization still consistently improves the baselines. 3.5 MHE beyond Euclidean Distance The hyperspherical energy is originally defined based on the Euclidean distance on a hypersphere, which can be viewed as an angular measure. In addition to Euclidean distance, we further consider the geodesic distance on a unit hypersphere as a distance measure for neurons, which is exactly the same as the angle between neurons. Specifically, we consider to use arccos( ˆw i ˆwj) to replace ˆwi ˆwj in hyperspherical energies. Following this idea, we propose angular MHE (A-MHE) as a simple extension, where the hyperspherical energy is rewritten as: Ea s,d( ˆwi|N i=1) = j=1,j =i fs arccos( ˆw i ˆwj) = P i =j arccos( ˆw i ˆwj) s, s > 0 P i =j log arccos( ˆw i ˆwj) 1 , s = 0 (5) which can be viewed as redefining MHE based on geodesic distance on hyperspheres (i.e., angle), and can be used as an alternative to the original hyperspherical energy Es in Eq. (4). Note that, A-MHE can also be learned in full-space or half-space, leading to similar variants as original MHE. The key difference between MHE and A-MHE lies in the optimization dynamics, because their gradients w.r.t the neuron weights are quite different. A-MHE is also more computationally expensive than MHE. 3.6 Mini-batch Approximation for MHE With a large number of neurons in one layer, calculating MHE can be computationally expensive as it requires computing the pair-wise distances between neurons. To address this issue, we propose the mini-batch version of MHE to approximate the MHE (either original or half-space) objective. Mini-batch approximation for MHE on hidden layers. For hidden layers, mini-batch approximation iteratively takes a random batch of neurons as input and minimizes their hyperspherical energy as an approximation to the MHE. Note that the gradient of the mini-batch objective is an unbiased estimation of the original gradient of MHE. Data-dependent mini-batch approximation for output layers. For the output layer, the datadependent mini-batch approximation iteratively takes the classifier neurons corresponding to the classes that exist in mini-batches. It minimizes 1 m(N 1) Pm i=1 PN j=1,j =yi fs( ˆwyi ˆwj ) in each iteration, where yi denotes the class label of the i-th sample in each mini-batch, m is the mini-batch size, and N is the number of neurons (in one particular layer). 3.7 Discussions Connections to scientific problems. The hyperspherical energy minimization has close relationships with scientific problems. When s=1, Eq. (1) reduces to Thomson problem [48, 43] (in physics) where one needs to determine the minimum electrostatic potential energy configuration of N mutuallyrepelling electrons on a unit sphere. When s= , Eq. (1) becomes Tammes problem [47] (in geometry) where the goal is to pack a given number of circles on the surface of a sphere such that the minimum distance between circles is maximized. When s=0, Eq. (1) becomes Whyte s problem where the goal is to maximize product of Euclidean distances as shown in Eq. (2). Our work aims to make use of important insights from these scientific problems to improve neural networks. Understanding MHE from decoupled view. Inspired by decoupled networks [27], we can view the original convolution as the multiplication of the angular function g(θ)=cos(θ) and the magnitude function h( w , x )= w x : f(w, x)=h( w , x ) g(θ) where θ is the angle between the kernel w and the input x. From the equation above, we can see that the norm of the kernel and the direction (i.e., angle) of the kernel affect the inner product similarity differently. Typically, weight decay is to regularize the kernel by minimizing its ℓ2 norm, while there is no regularization on the direction of the kernel. Therefore, MHE completes this missing piece by promoting angular diversity. By combining MHE to a standard neural networks, the entire regularization term becomes Lreg = λw 1 PL j=1 Nj | {z } Weight decay: regularizing the magnitude of kernels 1 Nj(Nj 1){Es}j + λo 1 NL(NL 1)Es( ˆwout i |c i=1) | {z } MHE: regularizing the direction of kernels where λw, λh and λo are weighting hyperparameters for these three regularization terms. From the decoupled view, MHE makes a lot of senses in regularizing the neural networks, since it serves as a complementary and orthogonal role to weight decay. More discussions are in Appendix C. Comparison to orthogonality/angle-promoting regularizations. Promoting orthogonality or large angles between bases has been a popular choice for encouraging diversity. Probably the most related and widely used one is the orthonormal regularization [30] which aims to minimize W W I F , where W denotes the weights of a group of neurons with each column being one neuron and I is an identity matrix. One similar regularization is the orthogonality regularization [36] which minimizes the sum of the cosine values between all the kernel weights. These methods encourage kernels to be orthogonal to each other, while MHE does not. Instead, MHE encourages the hyperspherical diversity among these kernels, and these kernels are not necessarily orthogonal to each other. [56] proposes the angular constraint which aims to constrain the angles between different kernels of the neural network, but quite different from MHE, they use a hard constraint to impose this angular regularization. Moreover, these methods model diversity regularization at a more local level, while MHE regularization seeks to model the problem in a more top-down manner. Normalized neurons in MHE. From Eq. 1, one can see that the normalized neurons are used to compute MHE, because we aim to encourage the diversity on a hypersphere. However, a natural question may arise: what if we use the original (i.e., unnormalized) neurons to compute MHE? First, combining the norm of kernels (i.e., neurons) into MHE may lead to a trivial gradient descent direction: simply increasing the norm of all kernels. Suppose all kernel directions stay unchanged, increasing the norm of all kernels by a factor can effectively decrease the objective value of MHE. Second, coupling the norm of kernels into MHE may contradict with weight decay which aims to decrease the norm of kernels. Moreover, normalized neurons imply that the importance of all neurons is the same, which matches the intuition in [28, 30, 27]. If we desire different importance for different neurons, we can also manually assign a fixed weight for each neuron. This may be useful when we have already known certain neurons are more important and we want them to be relatively fixed. The neuron with large weight tends to be updated less. We will discuss it more in Appendix D. 4 Theoretical Insights This section leverages a number of rigorous theoretical results from [38, 23, 12, 25, 11, 23, 8, 54] and provides theoretical yet intuitive understandings about MHE. 4.1 Asymptotic Behavior This subsection shows how the hyperspherical energy behaves asymptotically. Specifically, as N , we can show that the solution ˆ WN tends to be uniformly distributed on hypersphere Sd when the hyperspherical energy defined in Eq. (1) achieves its minimum. Definition 1 (minimal hyperspherical s-energy). We define the minimal s-energy for N points on the unit hypersphere Sd ={w Rd+1| w =1} as εs,d(N) := inf ˆ WN Sd Es,d( ˆwi|N i=1) (6) where the infimum is taken over all possible ˆ WN on Sd. Any configuration of ˆ WN to attain the infimum is called an s-extremal configuration. Usually εs,d(N)= if N is greater than d and εs,d(N)=0 if N =0, 1. We discuss the asymptotic behavior (N ) in three cases: 0d. We first write the energy integral as Is(µ)= RR Sd Sd u v sdµ(u)dµ(v), which is taken over all probability measure µ supported on Sd. With 0d, p(N)=N 1+s/d. Particularly if 0