# sigsoftmax_reanalysis_of_the_softmax_bottleneck__40902787.pdf Sigsoftmax: Reanalysis of the Softmax Bottleneck Sekitoshi Kanai NTT Software Innovation Center, Keio Univ. kanai.sekitoshi@lab.ntt.co.jp Yasuhiro Fujiwara NTT Software Innovation Center fujiwara.yasuhiro@lab.ntt.co.jp Yuki Yamanaka NTT Secure Platform Laboratories yamanaka.yuki@lab.ntt.co.jp Shuichi Adachi Keio Univ. adachi.shuichi@appi.keio.ac.jp Softmax is an output activation function for modeling categorical probability distributions in many applications of deep learning. However, a recent study revealed that softmax can be a bottleneck of representational capacity of neural networks in language modeling (the softmax bottleneck). In this paper, we propose an output activation function for breaking the softmax bottleneck without additional parameters. We re-analyze the softmax bottleneck from the perspective of the output set of log-softmax and identify the cause of the softmax bottleneck. On the basis of this analysis, we propose sigsoftmax, which is composed of a multiplication of an exponential function and sigmoid function. Sigsoftmax can break the softmax bottleneck. The experiments on language modeling demonstrate that sigsoftmax and mixture of sigsoftmax outperform softmax and mixture of softmax, respectively. 1 Introduction Deep neural networks are used in many recent applications such as image recognition [17, 13], speech recognition [12], and natural language processing [24, 32, 7]. High representational capacity and generalization performance of deep neural networks are achieved by many layers, activation functions and regularization methods [26, 13, 31, 14, 10]. Although various model architectures are built in the above applications, softmax is commonly used as an output activation function for modeling categorical probability distributions [4, 10, 13, 24, 32, 7, 12]. For example, in language modeling, softmax is employed for representing the probability of the next word over the vocabulary in a sentence. When using softmax, we train the model by minimizing negative log-likelihood with a gradient-based optimization method. We can easily calculate the gradient of negative log-likelihood with softmax, and it is numerically stable [3, 4]. Even though softmax is widely used, few studies have attempted to improve its modeling performance [6, 8]. This is because deep neural networks with softmax are believed to have a universal approximation property. However, Yang et al. [34] recently revealed that softmax can be a bottleneck of representational capacity in language modeling. They showed that the representational capacity of the softmax-based model is restricted by the length of the hidden vector in the output layer. In language modeling, the length of the hidden vector is much smaller than the vocabulary size. As a result, the softmax-based model cannot completely learn the true probability distribution, and this is called the softmax bottleneck. For breaking the softmax bottleneck, Yang et al. [34] proposed mixture of softmax (Mo S) that mixes the multiple softmax outputs. However, this analysis of softmax does not explicitly show why softmax can be a bottleneck. Furthermore, Mo S is an additional layer or mixture model rather than an alternative activation function to softmax: Mo S has learnable parameters and hyper-parameters. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. In this paper, we propose a novel output activation function for breaking the softmax bottleneck without additional parameters. We re-analyze the softmax bottleneck from the point of view of the output set (range) of a function and show why softmax can be a bottleneck. This paper reveals that (i) the softmax bottleneck occurs because softmax uses only exponential functions for nonlinearity and (ii) the range of log-softmax is a subset of the vector space whose dimension depends on the dimension of the input space. As an alternative activation function to softmax, we explore the output functions composed of rectified linear unit (Re LU) and sigmoid functions. In addition, we propose sigsoftmax, which is composed of a multiplication of an exponential function and sigmoid function. Sigsoftmax has desirable properties for output activation functions, e.g., the calculation of its gradient is numerically stable. More importantly, sigsoftmax can break the softmax bottleneck, and the range of softmax can be a subset of that of sigsoftmax. Experiments on language modeling demonstrate that sigsoftmax can break the softmax bottleneck and outperform softmax. In addition, mixture of sigsoftmax outperforms Mo S. 2 Preliminaries 2.1 Softmax Deep neural networks use softmax in learning categorical distributions. For example, in the classification, a neural network uses softmax to learn the probability distribution over M classes y RM conditioned on the input x as Pθ(y|x) where θ is a parameter. Let h(x) Rd be a hidden vector and W RM d be a weight matrix in the output layer, the output of softmax fs( ) represents the conditional probability of the i-th class as follows: Pθ(yi|x) = [fs(W h(x))]i = exp([W h(x)]i) PM m=1 exp([W h(x)]m), (1) where [fs]i represents the i-th element of fs. We can see that each element of fs is bounded from zero to one since the output of exponential functions is non-negative in eq. (1). The summation of all elements of fs is obviously one. From these properties, we can regard output of the softmax trained by minimizing negative log-likelihood as a probability [4, 21]. If we only need the most likely label, we can find such a label by comparing elements of W h(x) without the calculations of softmax fs(W h(x)) once we have trained the softmax-based model. This is because exponential functions in softmax are monotonically increasing. To train the softmax-based models, negative log-likelihood (cross entropy) is used as a loss function. Since the loss function is minimized by stochastic gradient descent (SGD), the properties of the gradients of functions are very important [26, 28, 9, 15]. One advantage of softmax is that the gradient of log-softmax is easily calculated as follows [3, 4, 1, 8]: [logfs(z)]i ( 1 [fs(z)]j if j = i, [fs(z)]j if j = i, (2) where z = W h(x). Whereas the derivative of the logarithm can cause a division by zero since dlog(z) z, the derivative of log-softmax cannot. As a result, softmax is numerically stable. 2.2 Softmax bottleneck In recurrent neural network (RNN) language modeling, given a corpus of tokens Y = (Y1, . . . , YT ), the joint probability P(Y ) is factorized as P(Y ) = Q t P(Yt|Y