# sigsoftmax_reanalysis_of_the_softmax_bottleneck__40902787.pdf

Sigsoftmax: Reanalysis of the Softmax Bottleneck

Sekitoshi Kanai NTT Software Innovation Center, Keio Univ. kanai.sekitoshi@lab.ntt.co.jp

Yasuhiro Fujiwara NTT Software Innovation Center fujiwara.yasuhiro@lab.ntt.co.jp

Yuki Yamanaka NTT Secure Platform Laboratories yamanaka.yuki@lab.ntt.co.jp

Shuichi Adachi Keio Univ. adachi.shuichi@appi.keio.ac.jp

Softmax is an output activation function for modeling categorical probability distributions in many applications of deep learning. However, a recent study revealed that softmax can be a bottleneck of representational capacity of neural networks in language modeling (the softmax bottleneck). In this paper, we propose an output activation function for breaking the softmax bottleneck without additional parameters. We re-analyze the softmax bottleneck from the perspective of the output set of log-softmax and identify the cause of the softmax bottleneck. On the basis of this analysis, we propose sigsoftmax, which is composed of a multiplication of an exponential function and sigmoid function. Sigsoftmax can break the softmax bottleneck. The experiments on language modeling demonstrate that sigsoftmax and mixture of sigsoftmax outperform softmax and mixture of softmax, respectively.

1 Introduction

Deep neural networks are used in many recent applications such as image recognition [17, 13], speech recognition [12], and natural language processing [24, 32, 7]. High representational capacity and generalization performance of deep neural networks are achieved by many layers, activation functions and regularization methods [26, 13, 31, 14, 10]. Although various model architectures are built in the above applications, softmax is commonly used as an output activation function for modeling categorical probability distributions [4, 10, 13, 24, 32, 7, 12]. For example, in language modeling, softmax is employed for representing the probability of the next word over the vocabulary in a sentence. When using softmax, we train the model by minimizing negative log-likelihood with a gradient-based optimization method. We can easily calculate the gradient of negative log-likelihood with softmax, and it is numerically stable [3, 4].

Even though softmax is widely used, few studies have attempted to improve its modeling performance [6, 8]. This is because deep neural networks with softmax are believed to have a universal approximation property. However, Yang et al. [34] recently revealed that softmax can be a bottleneck of representational capacity in language modeling. They showed that the representational capacity of the softmax-based model is restricted by the length of the hidden vector in the output layer. In language modeling, the length of the hidden vector is much smaller than the vocabulary size. As a result, the softmax-based model cannot completely learn the true probability distribution, and this is called the softmax bottleneck. For breaking the softmax bottleneck, Yang et al. [34] proposed mixture of softmax (Mo S) that mixes the multiple softmax outputs. However, this analysis of softmax does not explicitly show why softmax can be a bottleneck. Furthermore, Mo S is an additional layer or mixture model rather than an alternative activation function to softmax: Mo S has learnable parameters and hyper-parameters.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

In this paper, we propose a novel output activation function for breaking the softmax bottleneck without additional parameters. We re-analyze the softmax bottleneck from the point of view of the output set (range) of a function and show why softmax can be a bottleneck. This paper reveals that (i) the softmax bottleneck occurs because softmax uses only exponential functions for nonlinearity and (ii) the range of log-softmax is a subset of the vector space whose dimension depends on the dimension of the input space. As an alternative activation function to softmax, we explore the output functions composed of rectiﬁed linear unit (Re LU) and sigmoid functions. In addition, we propose sigsoftmax, which is composed of a multiplication of an exponential function and sigmoid function. Sigsoftmax has desirable properties for output activation functions, e.g., the calculation of its gradient is numerically stable. More importantly, sigsoftmax can break the softmax bottleneck, and the range of softmax can be a subset of that of sigsoftmax. Experiments on language modeling demonstrate that sigsoftmax can break the softmax bottleneck and outperform softmax. In addition, mixture of sigsoftmax outperforms Mo S.

2 Preliminaries

2.1 Softmax

Deep neural networks use softmax in learning categorical distributions. For example, in the classiﬁcation, a neural network uses softmax to learn the probability distribution over M classes y RM conditioned on the input x as Pθ(y|x) where θ is a parameter. Let h(x) Rd be a hidden vector and W RM d be a weight matrix in the output layer, the output of softmax fs( ) represents the conditional probability of the i-th class as follows:

Pθ(yi|x) = [fs(W h(x))]i = exp([W h(x)]i) PM m=1 exp([W h(x)]m), (1)

where [fs]i represents the i-th element of fs. We can see that each element of fs is bounded from zero to one since the output of exponential functions is non-negative in eq. (1). The summation of all elements of fs is obviously one. From these properties, we can regard output of the softmax trained by minimizing negative log-likelihood as a probability [4, 21]. If we only need the most likely label, we can ﬁnd such a label by comparing elements of W h(x) without the calculations of softmax fs(W h(x)) once we have trained the softmax-based model. This is because exponential functions in softmax are monotonically increasing.

To train the softmax-based models, negative log-likelihood (cross entropy) is used as a loss function. Since the loss function is minimized by stochastic gradient descent (SGD), the properties of the gradients of functions are very important [26, 28, 9, 15]. One advantage of softmax is that the gradient of log-softmax is easily calculated as follows [3, 4, 1, 8]:

[logfs(z)]i

( 1 [fs(z)]j if j = i, [fs(z)]j if j = i, (2)

where z = W h(x). Whereas the derivative of the logarithm can cause a division by zero since dlog(z)

z, the derivative of log-softmax cannot. As a result, softmax is numerically stable.

2.2 Softmax bottleneck

In recurrent neural network (RNN) language modeling, given a corpus of tokens Y = (Y1, . . . , YT ), the joint probability P(Y ) is factorized as P(Y ) = Q

t P(Yt|Y<t) = Q

t P(Yt|Xt), where Xt = Y<t is referred to as the context of the conditional probability. Output of softmax fs(W h(Xt)) learns P(Yt|Xt) where (a) h(Xt) Rd is the hidden vector corresponding to the context Xt and (b) W is a weight matrix in the output layer (embedding layer). A natural language is assumed as a ﬁnite set of pairs of xt and P (Y |xt) as L = {(x1, P (Y |x1)), . . . , (x N, P (Y |x N))}, where N is the number of possible contexts. The objective of language modeling is to learn a model distribution Pθ(Y |X) parameterized by θ to match the true data distribution P (Y |X). Note that upperand lower-case letters are used for variables and constants, respectively, in this section. Under the above assumptions, let y1, . . . , y M be M possible tokens in the language L, the previous study of Yang

et al. [34] considers the following three matrices:

h(x2)T ... h(x N)T

log P (y1|x1), log P (y2|x1), . . . log P (y M|x1) log P (y1|x2), log P (y2|x2), . . . log P (y M|x2) ... ... ... ... log P (y1|x N), log P (y2|x N), . . . log P (y M|x N)

Hθ RN d is a matrix composed of the hidden vectors, W RM d is a weight matrix, and A RM N is a matrix composed of the log probabilities of the true distribution. By using these matrices, the rank of HθW T should be greater than or equal to rank(A) 1 so that the softmaxbased model completely learns L [34]. However, the rank of HθW T is at most d if any functions U are used for Hθ and W . Therefore, if we have d < rank(A) 1, softmax can be the bottleneck of representational capacity as shown in the following theorem: Theorem 1 (Softmax Bottleneck [34]). If d < rank(A) 1, for any function family U and any model parameter θ, there exists a context x in L such that Pθ(Y |x) = P (Y |x).

This theorem shows that the length of the hidden vector in the output layer determines the representational power of RNN with softmax. In language modeling, the rank of A can be extremely high since contexts can vary and vocabulary size M is much larger than d. Therefore, the softmax can be the bottleneck of the representational power.

2.3 Mixture of softmax

A simple approach to improving the representational capacity is to use a weighted sum of the several models. In fact, Yang et al. [34] use this approach for breaking the softmax bottleneck. As the alternative to softmax, they propose the mixture of softmax (Mo S), which is the weighted sum of K softmax functions:

Pθ(yi|x) = PK k=1 π(x, k) exp([W h(x,k)]i) PM m=1 exp([W h(x,k)]m), (4)

where π(x, k) is the prior or mixture weight of the k-th component, and h(x, k) is the k-th context vector associated with the context x. Let h (x) be input of Mo S for the context x. The priors and con-

text vectors are parameterized as π(x, k) = exp(w T π,kh (x)) PK k =1 exp(w T π,k h (x)) and h(x, k) = tanh(Wh,kh (x)),

respectively. Mo S can break the softmax bottleneck since the rank of the approximate A can be arbitrarily large [34]. Therefore, language modeling with Mo S performs better than that with softmax. However, in this method, the number of mixtures K is the hyper-parameter which needs to be tuned. In addition, weights Wh,k and wπ,k are additional parameters. Thus, Mo S can be regarded as an additional layer or mixing technique rather than the improvement of the activation function.

2.4 Related work

Previous studies proposed alternative functions to softmax [8, 25, 27]. The study of de Brébisson and Vincent [8] explored spherical family functions: the spherical softmax and Taylor softmax. They showed that these functions do not outperform softmax when the length of an output vector is large. In addition, the spherical softmax has a hyper-parameter that should be carefully tuned for numerical stability reasons [8]. On the other hand, the Taylor softmax might suffer from the softmax bottleneck since it approximates softmax. Mohassel and Zhang [25] proposed a Re LU-based alternative function to softmax for privacy-preserving machine learning since softmax is expensive to compute inside a secure computation. However, it leads to a division by zero since all outputs of Re LUs frequently become zeros and the denominator for normalization becomes zero. Several studies improved the efﬁciency of softmax [11, 30, 33, 20]. However, they did not improve the representational capacity.

3 Proposed method

3.1 Reanalysis of the softmax bottleneck

The analysis of the softmax bottleneck [34] is based on matrix factorization and reveals that the rank of HθW T θ needs to be greater than or equal to rank(A) 1. Since the rank of HθW T θ becomes

the length of the hidden vector in the output layer, the length of the hidden vector determines the representational power as described in Sec. 2.2. However, this analysis does not explicitly reveal the cause of the softmax bottleneck. To identify the cause of the softmax bottleneck, we re-analyze the softmax bottleneck from the perspective of the range of log-softmax because it should be large enough to approximate the true log probabilities.

Log-softmax is a logarithm of softmax and is used in training of deep learning as mentioned in Sec. 2.1. By using the notation in Sec. 2.1, log-softmax log(fs(z)) can be represented as

[log(fs(z))]i =log exp(zi) PM m=1exp(zm)

=zi log(PM m=1 exp(zm)). This function can be expressed as

log (fs(z)) = z log(PM m=1 exp(zm))1, (5)

where 1 is the vector of all ones. To represent various log probability distributions log(P (y|x)), the range of log (fs(z)) RM should be sufﬁciently large. Therefore, we investigate the range of log (fs(z)). We assume that the hidden vector h in the output layer can be an arbitrary vector in Rd where d M, and the weight matrix W RM d is the full rank matrix; the rank of W is d.1 Under these assumptions, the input vector space of softmax S (z S) is a d dimensional vector space, and we have the following theorem: Theorem 2. Let S RM be the d dimensional vector space and z S be input of log-softmax, every range of the log-softmax {log(fs(z))|z S} is a subset of the d + 1 dimensional vector space.

Proof. The input of log-softmax z = W h can be represented by d singular vectors of W since the rank of W is d. In other words, the space of input vectors z is spanned by d basis vectors. Thus, the input vector space {z|z S} is represented as {Pd l=1 k(l)u(l)|k(l) R} where u(l) RM for l = 1, . . . , d are linearly independent vectors and k(l) are their coefﬁcients. From eq. (5), by using u(l) and k(l), the range of log-softmax {log(fs(z))|z S} becomes

{log(fs(z))| z S} = {Pd l=1 k(l)u(l) c(Pd l=1 k(l)u(l))1|k(l) R}, (6)

where c(Pd l=1 k(l)u(l)) = log(PM m=1 exp( h Pd l=1 k(l)u(l)i

m)). This is the linear combination of d

linearly independent vectors u(l) and 1. Therefore, we have the following relation:

{Pd l=1 k(l)u(l) c(Pd l=1 k(l)u(l))1|k(l) R} {Pd l=1 k(l)u(l) + k(d+1)1|k(l) R}, (7)

where {Pd l=1 k(l)u(l) + k(d+1)1|k(l) R} is the vector space spanned by u(l) and 1. Let Y be the vector space {Pd l=1 k(l)u(l) + k(d+1)1|k(l) R}, the dimension of Y becomes

( d + 1 if 1 / {Pd l=1 k(l)u(l)|k(l) R}, d if 1 {Pd l=1 k(l)u(l)|k(l) R}. (8)

We can see that Y is the d or d + 1 dimensional linear subspace of RM. From eqs. (7) and (8), output vectors of log-softmax exist in the d + 1 dimensional vector space, which completes the proof.

Theorem 2 shows that the log-softmax has at most d + 1 linearly independent output vectors, even if the various inputs are applied to the model. Therefore, if the vectors of true log probabilities log P (y|x) have more than d + 1 linearly independent vectors, the softmax-based model cannot completely represent the true probabilities. Figure 1 illustrates theorems 1 and 2 when M = 3 and d = 1. We can prove Theorem 1 by using Theorem 2 as follows:

Proof. If we have d < rank(A) 1, i.e., rank(A) > d + 1, the number of linearly independent vectors of log P (y|x) is larger than d + 1. On the other hand, the output vectors log Pθ(y|x) of the model cannot be larger than d + 1 linearly independent vectors from Theorem 2. Therefore, the softmax-based model cannot completely learn P (y|x), i.e., there exists a context x in L such that Pθ(Y |x) = P (Y |x).

1If neural networks have the universal approximation property, h can be an arbitrary vector in Rd. If not, the input space is a subset of a d dimensional vector space, and the range of log-softmax is still a subset of a d + 1 dimensional vector space. When rank(W ) < d, we can examine the range of log-softmax in the same way by replacing d with rank(W ). If a bias is used in the output layer, the dimension of S can be d + 1.

Figure 1: Softmax bottleneck (M = 3, d = 1). The input space S is a gray dashed straight line, and the range of log-softmax is the black curve on the orange plane spanned by z and 1. On the other hand, {log(P )} is the blue dash-dotted curve over the 3 dimensional (3-D) space since M =3. We can see that the range of log-softmax cannot match the {log(P )} over the 3-D space.

The above analysis shows that the softmax bottleneck occurs because the output of log-softmax is the linear combination of the input z and vector 1 as eq. (5). Linear combination of the input and vector 1 increases the number of linearly independent vectors by at most one, and as a result, the output vectors become at most d + 1 linearly independent vectors. The reason log-softmax becomes the linear combination is that the logarithm of the exponential function log(exp(z)) is z.

By contrast, the number of linearly independent output vectors of a nonlinear function can be much greater than the number of linearly independent input vectors. Therefore, if the other nonlinear functions are replaced with exponential functions, the logarithm of such functions can be nonlinear and the softmax bottleneck can be broken without additional parameters.

Our analysis provides new insights that the range of log-softmax is a subset of the less dimensional vector space although the dimension of a vector space is strongly related to the rank of a matrix. Furthermore, our analysis explicitly shows the cause of the softmax bottleneck.

3.2 Alternative functions to softmax and desirable properties

In the previous section, we explained that the softmax bottleneck can be broken by replacing nonlinear functions with exponential functions. In this section, we explain the desirable properties of an alternative function to softmax. We formulate a new output function f( ) as follows:

[f(z)]i = [g(z)]i PM m=1[g(z)]m . (9)

The new function is composed of the nonlinear function g(z) and the division for the normalization so that the summation of the elements is one. As the alternative function to softmax, a new output function f(z) and its g(z) should have all of the following properties:

Nonlinearity of log(g(z)) As mentioned in Secs. 2.2 and 3.1, softmax can be the bottleneck of the representational power because log(exp(z)) is z. Provided that log(g(z)) is a linear function, {log(f(z))|z S} is a subset of the d + 1 dimensional vector space. In order to break the softmax bottleneck, log(g(z)) should be nonlinear. Numerically stable In training of deep learning, we need to calculate the gradient for optimization. The derivative of logarithm of [f(z)]i with respect to zj is

log([f(z)]i)

zj = 1 [f(z)]i

We can see that this function has a division by [f(z)]i. It can cause a division by zero since [f(z)]i can be close to zero if networks completely go wrong in training. The alternative functions should avoid a division by zero similar to softmax as shown in eq. (2). Non-negative In eq. (9), all elements of g(z) should be non-negative to limit output in [0, 1]. Therefore, g(z) should be non-negative: [g(z)]i 0. Note that if g(z) is non-positive, f(z) are also limited to [0, 1]. We only mention non-negative since non-positive functions g(z) can easily be non-negative as g(z).

Monotonically increasing g(z) should be monotonically increasing so that f(z) becomes a smoothed version of the argmax function [4, 2]. If g(z) is monotonically increasing, we can obtain the label that has the maximum value of f(z) by comparing elements of z.

Note that, if we use Re LU as g(z), the Re LU-based function f(z) does not have all the above properties since the gradient of its logarithm is not numerically stable. If we use sigmoid as g(z), the new sigmoid-based function satisﬁes the above properties. However, the output of sigmoid is bounded above as [g(z)]i 1, and this restriction might limit the representational power. In fact, the sigmoid-based function does not outperform softmax on the large dataset in Sec. 4. We discuss these functions in detail in the supplementary material. In the next section, we propose a new output activation function that can break the softmax bottleneck, and satisﬁes all the above properties.

3.3 Sigsoftmax

For breaking the softmax bottleneck, we propose sigsoftmax given as follows:

Deﬁnition 1. Sigsoftmax is deﬁned as

[f(z)]i = exp(zi)σ(zi) PM m=1 exp(zm)σ(zm), (11)

where σ( ) represents a sigmoid function.

We theoretically show that sigsoftmax can break the softmax bottleneck and has the desired properties. In the same way as in the analysis of softmax in Sec. 3.1, we examine the range of log-sigsoftmax. Since we have log(σ(z))=log( 1 1+exp( z))=z log(1+exp(z)), log-sigsoftmax becomes

log(f(z)) = 2z log(1 + exp(z)) + c (z)1, (12)

where c (z) = log(PM m=1 exp(zm)σ(zm)), and log(1 + exp(z)) is the nonlinear function called softplus [10]. Since log-sigsoftmax is composed of a nonlinear function, its output vectors can be greater than d + 1 linearly independent vectors. Therefore, we have the following theorem:

Theorem 3. Let S RM be the d dimensional vector space and z S be input of log-sigsoftmax, some range of log-sigsoftmax {log(f(z))|z S} is not a subset of a d + 1 dimensional vector space.

The detailed proof of this theorem is given in the supplementary material. Theorem 3 shows that sigsoftmax can break the softmax bottleneck; even if the vectors of the true log probabilities are more than d + 1 linearly independent vectors, the sigsoftmax-based model can learn the true probabilities.

However, the representational powers of sigsoftmax and softmax are difﬁcult to compare only by using the theorem based on the vector space. This is because both functions are nonlinear and their ranges are not necessarily vector spaces, even though they are subsets of vector spaces. Therefore, we directly compare the ranges of sigsoftmax and softmax as the following theorem:

Theorem 4. Let z S be the input of sigsoftmax f( ) and softmax fs( ). If the S is a d dimensional vector space and 1 S, the range of softmax is a subset of the range of sigsoftmax

{fs(z)|z S} {f(z)|z S}. (13)

Proof. If we have 1 S, S can be written as S = {Pd 1 l=1 k (l)u (l) + k (d)1|k (l) R} where u (l) (l = 1, . . . , d 1) and 1 are linearly independent vectors. In addition, the arbitrary elements of S can be written as Pd 1 l=1 k (l)u (l) + k (d)1, and thus, z = Pd 1 l=1 k (l)u (l) + k (d)1. For the output of softmax, by substituting z = Pd 1 l=1 k (l)u (l) + k (d)1 for eq. (1), we have

[fs(z)]i = exp([ Pd 1 l=1 k (l)u (l)]i+k (d)) PM m=1 exp([ Pd 1 l=1 k (l)u (l)]m+k (d)) = exp([ Pd 1 l=1 k (l)u (l)]i) PM m=1 exp([ Pd 1 l=1 k (l)u (l)]m). (14)

As a result, the range of softmax becomes as follows: n fs(Pd 1 l=1 k (l)u (l) + k (d)1)|k (l) R o = exp([ Pd 1 l=1 k (l)u (l)]i) PM m=1 exp([ Pd 1 l=1 k (l)u (l)]m)|k (l) R . (15)

On the other hand, by substituting z = Pd 1 l=1 k (l)u (l) +k (d)1 for eq. (11), the output of sigsoftmax becomes as follows:

[f(z)]i = exp([ Pd 1 l=1 k (l)u (l)]i)σ([ Pd 1 l=1 k (l)u (l)]i+k (d)) PM m=1 exp([ Pd 1 l=1 k (l)u (l)]m)σ([ Pd 1 l=1 k (l)u (l)]m+k (d)). (16)

When k (l) are ﬁxed for l = 1, . . . , d 1 and k (d) + ,2 we have the following equality:

lim k (d) +

exp([ Pd 1 l=1 k (l)u (l)]i)σ([ Pd 1 l=1 k (l)u (l)]i+k (d)) PM m=1 exp([ Pd 1 l=1 k (l)u (l)]m)σ([ Pd 1 l=1 k (l)u (l)]m+k (d)) = exp([ Pd 1 l=1 k (l)u (l)]i) PM m=1 exp([ Pd 1 l=1 k (l)u (l)]m), (17)

since limk + σ(v+k)=1 when v is ﬁxed. From eq. (17), sigsoftmax has the following relation: exp([ Pd 1 l=1 k (l)u (l)]i) PM m=1 exp([ Pd 1 l=1 k (l)u (l)]m)|k (l) R = {f(z)|z S } {f(z)|z S} , (18)

where S is a hyperplane of S with k (d) = + , S = {Pd 1 l=1 k (l)u (l) + k (d)1|k (l) R for l = 1, . . . , d 1, k (d) = + } S. From eqs. (15) and (18), we can see that the range of sigsoftmax includes the range of softmax. Therefore, we have {fs(z)|z S} {f(z)|z S}.

Theorem 4 shows that the range of sigsoftmax can be larger than that of softmax if 1 S. The assumption 1 S means that there exist inputs of which outputs are the equal probabilities for all labels as pθ(yi|x) = 1 M for all i. This assumption is not very strong in practice. If 1 / S, the range of sigsoftmax can include the range of softmax by introducing one learnable scalar parameter b into sigsoftmax as [f(z + b1)]i = exp(zi)σ(zi+b) PM m=1 exp(zm)σ(zm+b). In this case, if softmax can ﬁt the true probability, b can become large enough for sigsoftmax to approximately equal softmax. In the experiments, we did not use b in order to conﬁrm that sigsoftmax can outperform softmax without additional parameters. From Theorems 3 and 4, sigsoftmax can break the softmax bottleneck, and furthermore, the representational power of sigsoftmax can be higher than that of softmax.

Then, we show that sigsoftmax has the desirable properties introduced in Sec. 3.2 as shown in the following theorem from Deﬁnition 1 although we show its proof in the supplementary material: Theorem 5. Sigsoftmax has the following properties:

1. Nonlinearity of log(g(z)): log(g(z)) = 2z log(1 + exp(z)).

2. Numerically stable: log[f(z)]i

( (1 [f(z)]j)(2 σ(zj)) i = j, [f(z)]j (2 σ(zj)) i = j.

3. Non-negative: [g(z)]i = exp(zi)σ(zi) 0.

4. Monotonically increasing: z1 z2 exp(z1)σ(z1) exp(z2)σ(z2).

Since sigsoftmax is an alternative function to softmax, we can use the weighted sum of sigsoftmax functions in the same way as Mo S. Mixture of sigsoftmax (Mo SS) is the following function:

Pθ(yi|x) = PK k=1 π(x, k) exp([W h(x,k)]i)σ([W h(x,k)]i) PM m=1 exp([W h(x,k)]m)σ([W h(x,k)]m). (19)

π(x, k) is also composed of sigsoftmax as π(x, k) = exp(w T π,kh (x))σ(w T π,kh (x)) PK k =1 exp(w T π,k h (x))σ(w T π,k h (x)).

4 Experiments

To evaluate the effectiveness of sigsoftmax, we conducted experiments on word-level language modeling. We compared sigsoftmax with softmax, the Re LU-based function and the sigmoid-based function. We also compared the mixture of sigsoftmax with that of softmax; Mo SS with Mo S.

Note that we provide the character-level language modeling experiments on text8 [18] and word-level language modeling experiments on One Billion Word dataset [5] in the supplementary material. Since the softmax bottleneck does not occur on character-level language modeling, we conﬁrmed the performance of sigsoftmax is similar to that of softmax in these experiments. On One Billion Word dataset, we used efﬁcient method [11] since One Billion Word is the massive dataset. We conﬁrmed that sigsoftmax can outperform softmax on the massive dataset.

2 Even though k (d) is extremely large, the input vector is the element of the input space S.

Table 1: Results of the language modeling experiment on PTB.

Softmax g:Re LU g: Sigmoid Sigsoftmax Mo S Mo SS

Validation 51.2 0.5 (4.91 5) 103 49.2 0.4 49.7 0.5 48.6 0.2 48.3 0.1 Test 50.5 0.5 (2.78 8) 105 48.9 0.3 49.2 0.4 48.0 0.1 47.7 0.07

Table 2: Results of the language modeling experiment on WT2.

Softmax g:Re LU g:Sigmoid Sigsoftmax Mo S Mo SS

Validation 45.3 0.2 (1.79 0.8) 103 45.7 0.1 44.9 0.1 42.5 0.1 42.1 0.2 Test 43.3 0.1 (2.30 2) 104 43.5 0.1 42.9 0.1 40.8 0.03 40.3 0.2

4.1 Experimental conditions

We used Penn Treebank dataset (PTB) [19, 24] and Wiki Text-2 dataset (WT2) [22] by following the previous studies [23, 16, 34]. PTB is commonly used to evaluate the performance of RNN-based language modeling [24, 35, 23, 34]. PTB is split into a training set (about 930 k tokens), validation set (about 74 k tokens), and test set (about 82 k tokens). The vocabulary size M was set to 10 k, and all words outside the vocabulary were replaced with a special token. WT2 is a collection of tokens from the set of articles on Wikipedia. WT2 is also split into a training set (about 2100 k), validation set (about 220 k), and test set (about 250 k). The vocabulary size M was 33,278. Since WT2 is larger than PTB, language modeling of WT2 may require more representational power than that of PTB.

We trained a three-layer long short-term memory (LSTM) model with each output function. After we trained models, we ﬁnetuned them and applied the dynamic evaluation [16]. For fair comparison, the experimental conditions, such as unit sizes, dropout rates, initialization, and the optimization method were the same as in the previous studies [23, 34, 16] except for the number of epochs by using their codes.3 We set the epochs to be twice as large as the original epochs used in [23] since the losses did not converge in the original epochs. In addition, we trained each model with various random seeds and evaluated the average and standard deviation of validation and test perplexities for each method. The detailed conditions and the results at training and ﬁnetuning steps are provided in the supplementary material.

4.2 Experimental results

Validation perplexities and test perplexities of PTB and WT2 modeling are listed in Tabs. 1 and 2. Note that we conﬁrmed these results are statistically different by pair-wise t-test (5 % of p-value). Table 1 shows that the sigmoid-based function achieved the lowest perplexities among output activation functions on PTB. However, the sigmoid-based function did not outperform softmax on WT2. This is because sigmoid is bounded above by one, σ( ) 1, and it may restrict the representational power. As a result, the sigmoid based function did not perform well on the large dataset. On the other hand, sigsoftmax achieved lower perplexities than softmax on PTB and achieves the lowest perplexities on WT2. Furthermore, between mixture models, Mo SS achieved lower perplexities than Mo S. Even though we trained and ﬁnetuned models under the conditions that are highly optimized for softmax and Mo S in [23, 34], sigsoftmax and Mo SS outperformed softmax and Mo S, respectively. Therefore, we conclude that sigsoftmax outperforms softmax as an activation function.

4.3 Evaluation of linear independence

In this section, we evaluate linear independence of output vectors of each function. First, we applied whole test data to the ﬁnetuned models and obtained log-output log(Pθ(yt|xt)), e.g., log-softmax, at each time. Next, we made the matrices ˆ A as ˆ A = [log(Pθ(y1|x1)), . . . , log(Pθ(y T |x T ))] RM T where T is the number of tokens of test data. M and T were respectively 10,000 and 82,430 on the PTB test set and 33,278 and 245,570 on the WT2 test set. Finally, we examined the rank of ˆ A since

3https://github.com/salesforce/awd-lstm-lm (Note that Merity et al. [23] further tuned some hyper-parameters to obtain results better than those in the original paper in their code.); https://github.com/benkrause/dynamic-evaluation; https://github.com/zihangdai/mos

Table 3: The number of linearly independent log-output vectors on test datasets: Ranks of ˆ A.

Softmax g: Re LU g: Sigmoid Sigsoftmax Mo S Mo SS

PTB 402 8243 1304 4640 9980 9986 WT2 402 31400 463 5465 12093 19834

the rank of the matrix is N if the matrix is composed of N linearly independent vectors. Note that the numerical approaches for computing ranks have roundoff error, and we used the threshold used in [29, 34] to detect the ranks. The ranks of ˆ A are listed in Tab. 3. The calculated singular values for detecting ranks are presented in the supplementary material.

We can see that log-softmax output vectors have 402 linearly independent vectors. In the experiments, the number of hidden units is set to 400, and we used a bias vector in the output layer. As a result, the dimension of the input space S was at most 401, and log-softmax output vectors are theoretically at most 402 linearly independent vectors from Theorem 2. Therefore, we conﬁrmed that the range of log-softmax is a subset of the d + 1 dimensional vector space. On the other hand, the number of linearly independent output vectors of sigsoftmax, Re LU and sigmoid-based functions are not bounded by 402. Therefore, sigsoftmax, Re LU and sigmoid-based functions can break the softmax bottleneck. The ranks of the Re LU-based function are larger than the other activation functions. However, the Re LU-based function is numerically unstable as mentioned in Sec. 3.2. As a result, it was not trained well as shown in Tabs. 1 and 2. Mo SS has more linearly independent output vectors than Mo S. Therefore, Mo SS may have more representational power than Mo S.

5 Conclusion

In this paper, we investigated the range of log-softmax and identiﬁed the cause of the softmax bottleneck. We proposed sigsoftmax, which can break the softmax bottleneck and has more representational power than softmax without additional parameters. Experiments on language modeling demonstrated that sigsoftmax outperformed softmax. Since sigsoftmax has the desirable properties for output activation functions, it has the potential to replace softmax in many applications. Breaking the softmax bottleneck is the necessary conditions in order to ﬁt the model to the true distribution. In our future work, we will investigate the sufﬁcient conditions in order to ﬁt the model to the distribution.

[1] Christopher M Bishop. Neural Networks for Pattern Recognition. Oxford university press, 1995. [2] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006. [3] John S Bridle. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Proc. NIPS, pages 211 217, 1990. [4] John S Bridle. Probabilistic interpretation of feedforward classiﬁcation network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pages 227 236. Springer, 1990. [5] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. Technical report, Google, 2013. URL http://arxiv.org/abs/1312.3005. [6] Binghui Chen, Weihong Deng, and Junping Du. Noisy softmax: Improving the generalization ability of dcnn via postponing the early softmax saturation. In Proc. CVPR, pages 5372 5381, 2017. [7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder decoder for statistical machine translation. In Proc. EMNLP, pages 1724 1734. ACL, 2014. [8] Alexandre de Brébisson and Pascal Vincent. An exploration of softmax alternatives belonging to the spherical loss family. In Proc. ICLR, 2016.

[9] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Proc. AISTATS, pages 249 256, 2010.

[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[11] Édouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. Efﬁcient softmax approximation for GPUs. In Proc. ICML, pages 1302 1310, 2017.

[12] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Proc. ICASSP, pages 6645 6649. IEEE, 2013.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, pages 770 778, 2016.

[14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. ICML, pages 448 456, 2015.

[15] Sekitoshi Kanai, Yasuhiro Fujiwara, and Sotetsu Iwamura. Preventing gradient explosions in gated recurrent units. In Proc. NIPS, pages 435 444, 2017.

[16] Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. ar Xiv preprint ar Xiv:1709.07432, 2017.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Proc. NIPS, pages 1097 1105, 2012.

[18] Matt Mahoney. Large text compression benchmark. 2011. URL http://www.mattmahoney. net/text/text.html.

[19] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313 330, 1993.

[20] Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classiﬁcation. In Proc. ICML, pages 1614 1623, 2016.

[21] Roland Memisevic, Christopher Zach, Marc Pollefeys, and Geoffrey E Hinton. Gated softmax classiﬁcation. In Proc. NIPS, pages 1603 1611, 2010.

[22] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In Proc. ICLR, 2017.

[23] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. In Proc. ICLR, 2018.

[24] Tomas Mikolov. Statistical language models based on neural networks. Ph D thesis, Brno University of Technology, 2012.

[25] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-preserving machine learning. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 19 38. IEEE, 2017.

[26] Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In Proc. ICML, pages 807 814. Omnipress, 2010.

[27] Yann Ollivier. Riemannian metrics for neural networks i: feedforward networks. ar Xiv preprint ar Xiv:1303.0818, 2013.

[28] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difﬁculty of training recurrent neural networks. In Proc. ICML, pages 1310 1318, 2013.

[29] William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. Numerical Recipes 3rd Edition: The Art of Scientiﬁc Computing. Cambridge University Press, 2007.

[30] Kyuhong Shim, Minjae Lee, Iksoo Choi, Yoonho Boo, and Wonyong Sung. SVD-softmax: Fast softmax approximation on large vocabulary neural networks. In Proc. NIPS, pages 5469 5479, 2017.

[31] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15(1):1929 1958, 2014.

[32] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proc. NIPS, pages 3104 3112. 2014.

[33] Michalis K. Titsias. One-vs-each approximation to softmax for scalable estimation of probabilities. In Proc. NIPS, pages 4161 4169, 2016. [34] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: a high-rank rnn language model. In Proc. ICLR, 2018. [35] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. ar Xiv preprint ar Xiv:1409.2329, 2014.