# adaptive_convolutional_relus__82a68959.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Adaptive Convolutional Re LUs

Hongyang Gao,1 Lei Cai,2 Shuiwang Ji1

1Texas A&M University, College Station, TX, USA 2Washington State University, Pullman, WA, USA {hongyang.gao, sji}@tamu.edu, lei.cai@wsu.edu

Rectiﬁed linear units (Re LUs) are currently the most popular activation function used in neural networks. Although Re LUs can solve the gradient vanishing problem and accelerate training convergence, it suffers from the dying Re LU problem in which some neurons are never activated if the weights are not updated properly. In this work, we propose a novel activation function, known as the adaptive convolutional Re LU (Conv Re LU), that can better mimic brain neuron activation behaviors and overcome the dying Re LU problem. With our novel parameter sharing scheme, Conv Re LUs can be applied to convolution layers that allow each input neuron to be activated by different trainable thresholds without involving a large number of extra parameters. We employ the zero initialization scheme in Conv Re LU to encourage trainable thresholds to be close to zero. Finally, we develop a partial replacement strategy that only replaces the Re LUs in the early layers of the network. This resolves the dying Re LU problem and retains sparse representations for linear classiﬁers. Experimental results demonstrate that our proposed Conv Re LU has consistently better performance compared to Re LU, Leaky Re LU, and PRe LU. In addition, the partial replacement strategy is shown to be effective not only for our Conv Re LU but also for Leaky Re LU and PRe LU.

Introduction Convolutional neural networks (CNNs) (Le Cun et al. 1998b) have shown great capability in various ﬁelds such as computer vision (Ren et al. 2015; Laina et al. 2016) and natural language processing (Johnson and Zhang 2017). In CNNs, activation functions play important roles for introducing nonlinearity to networks. Among various activation functions such as tanh( ) and sigmoid( ), Re LU is the most popular one. Re LU computes the identity for positive arguments while outputs zero for negative ones. It was initially proposed for Boltzmann machines (Nair and Hinton 2010) and has been successfully applied to neural networks for its non-saturating property, which alleviates the gradient vanishing problem and accelerates convergence speed. Though effective and efﬁcient, Re LUs suffer from the dying Re LU problem, which makes some neurons in the

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

network never activated again (Chen et al. 2017). Many attempts have tried to resolve or alleviate this problem such as Leaky Re LUs (Maas, Hannun, and Ng 2013) and PRe LUs (He et al. 2015). These works tackle the problem by using a small constant or trainable slope for negative arguments. However, the behaviors of such activation functions are different from how biological neurons are activated, as biological neurons actually employ different thresholds within a certain range (Islam and Islam 2016). In this work, we propose a simple but effective method known as adaptive Re LUs, which activates the arguments based on trainable thresholds. Based on the adaptive Re LUs, we develop a novel parameter sharing scheme that is especially applicable to convolution layers. This leads to adaptive convolutional Re LUs (Conv Re LUs). For the parameters involved in Conv Re LUs, we employ the zero initialization with L2-regularization as it encourages the trainable thresholds to be close to zero. Finally, we propose the partial replacement strategy for Re LUs replacement in the network to relieve dying Re LU problem and retain sparse representations for linear classiﬁers in neural networks.

Related Work Activation functions have been a popular research ﬁeld due to its importance in deep neural networks. Initially, tanh( ) and sigmoid( ) were applied in neural networks with tanh( ) being preferred for its zero-centering property (Le Cun et al. 1998a). To solve the gradient vanishing problem suffered by saturating activation functions like tanh( ) and sigmoid( ), the non-saturating function Re LU (Nair and Hinton 2010) was proposed and successfully applied in various state-of-theart deep neural networks (He et al. 2016; Huang et al. 2017; Gao et al. 2019). Re LUs compute the function

f(x) = max(0, x), (1)

which solves the gradient vanishing problem and accelerates convergence in training (Krizhevsky, Sutskever, and Hinton 2012). However, it suffers from a problem known as the dying Re LU (Chen et al. 2017). In this scenario, a large weight update may cause the Re LU neuron to be never activated again. Thus gradients that ﬂow through those neurons will

Figure 1: Illustrations of Re LU, Leaky Re LU, PRe LU, and the proposed adaptive Re LU (Ada Re LU). Re LU computes the function f(xi) = max(0, xi). Leaky Re LU computes f(xi) = max(αxi, xi), where α is a small constant. PRe LU computes the same function as Leaky Re LU with a trainable β. Our Ada Re LU computes the function f(xi) = max(θi, xi), where θi is a trainable parameter corresponding to xi. Ada Re LU can be naturally used in convolutional layers, leading to Conv Re LU.

be zero. Leaky Re LU (Leaky Re LU) (Maas, Hannun, and Ng 2013) attempts to address the dying Re LU problem by employing a small slope instead of zero when x < 0. It computes the function

f(x) = max(αx, x), (2)

where α is a small constant like 0.01. Parametric Rectiﬁed Linear Unit (PRe LU) (He et al. 2015) generalizes this function by making α to be a trainable parameter. Re LU, Leaky Re LU, and PRe LU functions are illustrated in Figure 1. Other types of activation functions (Xu et al. 2015) have been proposed with different functional forms. Exponential Linear Units (ELUs) (Clevert, Unterthiner, and Hochreiter 2015) computes the function

f(xi) = xi, if xi > 0 α(exp(x) 1), if xi 0. (3)

Based on ELUs, Scaled Exponential Linear Units (SELUs) induced self-normalizing properties (Klambauer et al. 2017). However, both ELUs and SELUs only ﬁtted fully-connected layers, and the use in convolution layers is not clear.

Adaptive Convolutional Rectiﬁed Linear Units In this section, we ﬁrstly present the adaptive Re LUs (Ada Re LU). Then we propose the parameter sharing mechanism in neural networks especially for convolution layers, leading to convolutional Re LUs (Conv Re LUs). In addition, we discuss the parameter initialization and activation function replacement in neural networks for Conv Re LUs.

Background and Motivations It is well known that neural networks mimic computational activities in biological brains. Each neuron, the basic computational unit in brain, receives signals from its dendrites and outputs signal when its strength is above a certain threshold known as threshold potentials (Chen et al. 2006). Figure 2 (a) illustrates a simpliﬁed version of this process, which is also called feed-forward network with Re LUs. In brain, the thresholds above which neurons ﬁre are not the same but are largely in the range between -50 and -55 m V (Seifter,

Sloane, and Ratner 2005). This is different from our popular activation functions Re LUs in which the same zero threshold is applied to every neuron as illustrated in Figure 2 (b). With the support from biological theories (Seifter, Sloane, and Ratner 2005), we propose that each input neuron should have different activation thresholds, which can better mimic brain functions.

Adaptive Re LU To enable different thresholds for different input neurons, we propose the Adaptive Re LUs (Ada Re LUs) activation function deﬁned as

f(xi) = xi, if xi > θi θi, if xi θi, (4)

where xi is the argument of the activation function f( ), and θi is the corresponding threshold for input xi. Note that θi is learned automatically from data. Figure 1 illustrates the adaptive Re LU. The subscript i in θi indicates that the nonlinear activation function can vary for different inputs in terms of thresholds. This gives an extra degree of ﬂexibility for its applications in neural networks. Each input neuron can have different thresholds, which are learned from data. In addition, the input neurons of the same channel or even the same layer can share the same threshold. Adaptive Re LU reduces to Re LU when θi = 0 for all i. Ada Re LUs can be trained by back-propagation algorithms (Le Cun et al. 1989) along with other layers simultaneously. The update equation for trainable parameters {θi} can be derived from the chain rule. For one layer, the gradient of θi can be expressed as: E θi =

E f(xi) f(xi)

where E is the objective function and E f(xi) represents the gradient back-propagated from the deeper layer. The summation

xi runs over all positions on the feature map that employ θi as the activation threshold. The gradient of the activation threshold θi can be written as: f(xi)

θi = 0, if xi > θi 1, if xi θi. (6)

Figure 2: Illustrations of one-layer (a) and multi-layer (b) feed-forward networks. In the one-layer network (a), the output neuron xi is activated by Re LU after the element-wise multiplication and summation. But in the view of multi-layer network (b), all input neurons x1 and x2 for the second layer are activated by the same activation function or threshold.

Adaptive Convolutional Re LUs

By employing different thresholds for different input neurons, Ada Re LUs may potentially incur a large number of extra parameters as compared to Re LUs. For example, we can simply use different thresholds for each input neuron, thereby requiring a large number of extra parameters and increasing the risk of over-ﬁtting. On the other hand, we can use parameter sharing schemes to reduce the number of extra parameters. For example, we can require the input neurons of the same layer to share one trainable threshold like PRe LUs. Then the number of extra parameters involved in this sharing scheme will be negligible. In this work, we propose a new parameter sharing scheme that works especially well for convolutional layers. In our scheme, the input units sharing the same outgoing weight in convolution are also required to share the same threshold. We term this version of adaptive Re LUs as convolutional Re LUs (Conv Re LUs). In convolution layers, each input unit is connected to multiple outgoing weights depending on the sizes of convolution kernels. For instance, in an 1D convolution layer with a kernel of size k 1, an input unit (not on the boundary) is used k times with k different outgoing weights wi acting on it. In Conv Re LUs, we have k corresponding different thresholds {θ1 . . . θk} for each input unit. When wi is acting on the input unit, a unit x is activated by computing

f(x) = max(θi, x), (7)

where wi and θi share the same index. Note that the trainable thresholds θ in Conv Re LUs are shared across output channels but not input channels. Suppose we have nin input and nout output channels in convolution layer, the number of trainable thresholds in Conv Re LUs is nin k while the number of weights is nin k nout. In this way, we set different thresholds for input neurons without inducing excessive extra parameters. Figure 3 illustrates how Conv Re LUs share parameters in 1D convolution layers. The same parameter sharing scheme can be extended to 2D and 3D convolution layers.

The unit xl i,j in output feature map d is calculated as:

b=0 wa,b,c,d max(xl 1 i+a,j+b,c, θa,b,c) (8)

where kernel size k k is used in this 2D convolution and nin is the number of input channels. xl 1 and xl are the input and output of layer l. θ and w are trainable thresholds and weights in our Conv Re LU layer, respectively. In our Conv Re LU, θ is shared across output feature maps. Suppose we have an input [2, 3, 4], weights [1, 1], and thresholds [0, 3.5]. When applying 1D Conv Re LU without padding, the output is [ 1.5, 1].

Parameter Initialization We observe that parameter initialization has an impact on model performance. In this section, we discuss a few methods for initialing parameters involved in our proposed Conv Re LUs. Random initialization: In deep neural networks, the weights are commonly initialized randomly from Gaussian or uniform distributions with zero means (Krizhevsky, Sutskever, and Hinton 2012). Compared to zero initialization, random initialization serves as a symmetry-breaking tool that makes every neuron perform different computations. Glot initialization: Glorot and Bengio (Glorot and Bengio 2010) used a scaled uniform distribution for parameter initialization, which is also known as the Xavier initialization. In this scheme, the weights are drawn from a distribution with zero mean but a speciﬁc variance. The variance recommended in that work is var(w) = 2/(nin + nout), where nin and nout are the numbers of input and output channels, respectively. However, this recommendation is based on the assumption that the activation functions are linear, which is not the case with Re LUs and its variants (He et al. 2015). Zero initialization: Compared to the previous two initialization methods, zero initialization was rarely used in neural networks, since neurons initialized to zero perform the same computation. However, the parameters in our proposed Conv Re LUs correspond to activation thresholds. We expect the

Figure 3: Illustrations of parameter sharing of Conv Re LUs in 1D convolution layers. The left ﬁgure (a) shows the sharing scheme in which each input unit has a different trainable threshold. Apparently, this scheme incurs a large number of extra parameters. When θi = 0, Conv Re LUs reduce to Re LUs. The right ﬁgure (b) illustrates our proposed parameter sharing scheme in which the input units sharing the same outgoing weights also share the same threshold. The network in (b) is a decomposed version of the network in (a). Nodes with the same symbol are replicated versions of the same node for illustration purposes. In this example, we have three thresholds θi (i = 1, 2, 3) corresponding to three weights wi (i = 1, 2, 3) used in this convolution layer. When computing y1, the weight w2 acts on the input unit x2. We apply threshold θ2 on x2. For computing y2, the threshold θ1 is applied on the unit x2 since the weight w1 is used.

Inputs Outputs

Transit Fully Connected Conv

Dense Block 2 Dense Block 3

Dense Block 1

Figure 4: The Dense Net-40 used for image classiﬁcation tasks. There are three dense blocks, each of which has 8 layers with a growth rate of 12.

thresholds to be close to, but may not exactly equal to, zero. From this point, initializing these trainable thresholds to zero is a valid strategy. We observe from our experiments that zero initialization with L2-regularization performs the best.

Activation Functions in Deep Multi-Layer Networks

In previous studies on improving Re LUs (Maas, Hannun, and Ng 2013; He et al. 2015; Klambauer et al. 2017), Re LUs in all layers are completely replaced by their proposed new activation functions. Despite the possibility of resulting in dying neurons in the network, Re LUs have the advantage of inducing higher sparsity for the computed features as compared to other activation functions. This may consequently reduce the risk of over-ﬁtting (Glorot, Bordes, and Bengio 2011). Although these activation functions can overcome the dying Re LU problem, the sparsity of the outputs is signiﬁcantly reduced, especially for the ﬁnal linear classiﬁer layer. This may increase the risk of over-ﬁtting. To alleviate the dying Re LU problem and also preserve sparsity patterns in the computed features, we propose to

use a partial replacement strategy, in which only the ﬁrst several Re LU layers are replaced by our proposed Conv Re LUs in the network. The use of Conv Re LUs in early layers ensures the neurons are activated by some non-zero values. The remaining Re LU layers in the top part of networks can provide sparse feature representations for ﬁnal linear classiﬁers, thus avoiding the over-ﬁtting problem. In this work, we observe that our proposed partial replacement strategy yields better performance not only for our proposed Conv Re LUs, but also for Leaky Re LUs and PRe LUs. We provide detailed experiments and discussions in experimental studies.

Experimental Studies

In this section, we evaluate our proposed Conv Re LUs activation function on both image classiﬁcation and text classiﬁcation tasks. We conduct experiments to compare Conv Re LUs with popular rectiﬁed activation functions, including Re LUs, Leaky Re LUs, and PRe LUs. In addition, performance studies are used to compare three parameter initialization strategies and the replacement strategies.

Table 1: Comparison between Conv Re LU and other activation functions in terms of top-1 accuracy on image classiﬁcation datasets, including Cifar10, Cifar100, and Tiny Image Net.

Function Cifar10 Cifar100 Image Net Re LU 94.29% 75.87% 56.31% Leaky Re LU 94.45% 76.06% 56.38% Leaky Re LU (A) 94.32% 75.29% 55.97% PRe LU 94.58% 75.78% 56.09% PRe LU (A) 94.03% 74.87% 54.38% Conv Re LU 94.72% 76.41% 56.47%

Table 2: Comparison between Conv Re LU function and other activation functions in terms of top-1 accuracy on text classiﬁcation datasets, including MR, AG s News, and Yelp Full datasets.

Function MR AG Yelp Re LU 78.51% 88.64% 62.69% Leaky Re LU 77.48% 88.98% 62.93% Leaky Re LU (A) 78.33% 88.55% 62.69% PRe LU 77.19% 88.77% 63.03% PRe LU (A) 79.83% 88.52% 62.94% Conv Re LU 80.39% 89.13% 63.19%

We evaluate our methods on six datasets, including three datasets on image classiﬁcation tasks and three datasets on text classiﬁcation tasks. Image classiﬁcation datasets: For image classiﬁcation tasks, we use three image datasets including Cifar10, Cifar100 (Krizhevsky, Hinton, and others 2009), and Tiny Image Net (Yao and Miller 2015). Cifar10 and Cifar100 contain natural images with 32 32 pixels. Cifar10 consists of images from 10 classes, while the images in Cifar100 are drawn from 100 classes. Both datasets contain 50,000 training and 10,000 testing images. Tiny Image Net dataset is a tiny version of Image Net dataset (Deng et al. 2009). It has 200 classes, each of which contains 500 training, 50 validation, and 50 testing images. Text classiﬁcation datasets: For text classiﬁcation tasks, we choose three datasets; those are, MR, AG s News, and Yelp Full. MR is a Movie Review dataset (Pang and Lee 2005), which includes positive and negative reviews for sentiment classiﬁcation. Each sample in MR is a sentence with positive or negative sentiment label. AG s News is a topic classiﬁcation dataset with four topics: World, Sports, Business, and Sci/Tech (Zhang, Zhao, and Le Cun 2015). Yelp Full is formed based on the Yelp Dataset Challenge 2015 (Zhang, Zhao, and Le Cun 2015).

Experimental Setup

For image and text classiﬁcation tasks, we use different settings in terms of the model architecture. Image classiﬁcation settings: For image tasks, we mainly use the Dense Net architecture (Huang et al. 2017), which

achieves state-of-the-art performances in various image classiﬁcation tasks, including the ILSVRC 2012 challenge (Deng et al. 2009). On all three image datasets, we use Dense Net40 as illustrated in Figure 4 with minor adjustments to accommodate different datasets. The network includes three dense blocks with a depth of 8 and a growth rate of 12. During training, the standard data augmentation scheme widely used in (Huang et al. 2017; Simonyan and Zisserman 2015; He et al. 2016) is applied on these image datasets for fair comparisons. Text classiﬁcation task settings: On text classiﬁcation tasks, we employ the state-of-the-art VGG-like architecture in (Zhang, Zhao, and Le Cun 2015) without using any unsupervised learning method. In this VGG-like network, there are 6 convolution layers, 3 pooling layers, and 3 fullyconnected layers. Note that more recent models like Res Net and Dense Net have not achieved better performance on these tasks. The following setups are shared for both experimental settings. In training, the SGD optimizer (Le Cun, Bengio, and Hinton 2015) is used with a learning rate that starts from 0.1 and decays by 0.1 at the 150th and 250th epoch. The batch size is 128. These hyper-parameters are tuned on the Cifar10 and AG s News datasets, then applied on other datasets.

Comparison of Conv Re LU with Other Activation Functions

Based on Dense Net-40 and VGG networks, we compare our proposed Conv Re LUs with other activation functions on both image and text classiﬁcation tasks. The results are summarized in Tables 1 and 2 for image and text datasets, respectively. In the experiments using Conv Re LUs, we replace the Re LUs in the ﬁrst dense block by Conv Re LUs for image tasks, and replace the Re LUs in the ﬁrst two convolution layers by Conv Re LUs for text tasks. For both Leaky Re LUs and PRe LUs, we evaluate two replacement strategies; namely, one using the same replacement strategy as Conv Re LU, and the other one with all layers using Leaky Re LUs or PRe LUs. For the Leaky Re LU and PRe LU experiments, we add (A) to indicate experiments using Leaky Re LU or PRe LU in all layers. We can observe from both Tables 1 and 2 that our proposed Conv Re LUs achieve consistently better performance than Re LUs, Leaky Re LUs, and PRe LUs. For baseline values listed for text classiﬁcation tasks, they are the stateof-the-art without using unsupervised learning methods. Note that some studies (Zhang, Zhao, and Le Cun 2015; Johnson and Zhang 2017) reported better results on these datasets by employing unsupervised learning methods. Since our method is orthogonal to these methods, we make use of the results without unsupervised learning for simplicity. These results demonstrate the effectiveness of our methods in both computer vision and text analysis ﬁelds. While for Leaky Re LUs and PRe LUs, they only perform better on some not all of datasets than Re LUs. Given that we do not change the model architectures and only add several thousands of training parameters, the performance improvements over other activation functions are signiﬁcant.

0 50 100 150 200 250 300 350 400 Epoch

Test Accuracy

Re LU Leaky Re LU Leaky Re LU (A) PRe LU PRe LU (A) Conv Re LU

360 370 380 390 400 Epoch

Test Accuracy

Re LU Leaky Re LU Leaky Re LU (A) PRe LU PRe LU (A) Conv Re LU

Figure 5: Comparison of top-1 accuracy curves on the testing dataset of Cifar100 for Re LU, Leaky Re LU, PRe LU, and Conv Re LU. The symbol (A) in the legend indicates experiments with all layers using the Notably, both Leaky Re LUs and PRe LUs with partial corresponding activation function.

Table 3: Comparison of Conv Re LU with Re LU on popular networks in terms of top-1 accuracy on image classiﬁcation datasets Cifar10 and Cifar100.

Cifar10 Cifar100 Network Re LU Conv Re LU Re LU Conv Re LU VGG-16 93.71% 94.01% 73.31% 74.04% Res Net-18 94.37% 94.48% 75.17% 75.97% Dense Net-40 94.29% 94.72% 75.87% 76.41%

Notably, both Leaky Re LUs and PRe LUs with partial replacement strategy achieve better performance than those using the full replacement strategy on most datasets. This shows that the proposed partial replacement strategy can not only overcome the dying Re LU problem but also retain the sparse representation in the network. This is shown to be effective for Leaky Re LUs and PRe LUs. In experimental studies, we will show that this partial replacement strategy also beneﬁts our Conv Re LUs. Figure 5 shows the test accuracy curves of different activation functions on the Cifar100 dataset. We can easily observe from the ﬁgure that the performance of our proposed Conv Re LU is consistently better than other activation functions by about a margin of 0.5%. Both Leaky Re LU and PRe LU with partial replacement strategy outperform their full replacement versions by a margin of about 1%. This again demonstrates the effectiveness of the partial replacement strategy.

Performance Study on Popular Networks Our previous experiments on image classiﬁcation tasks are mainly based on the Dense Net-40. It can be argued that our proposed activation function is only effective on this model architecture. In this section, we wish to investigate the performance of Conv Re LUs on other popular networks such as VGG (Simonyan and Zisserman 2015) and Res Net (He et

Table 4: Comparison of three initialization methods in terms of top-1 Accuracy on image classiﬁcation datasets Cifar10 and Cifar100.

Zero Init Random Init Glot Init Cifar10 94.72% 94.55% 94.45% Cifar100 76.41% 75.81% 75.44%

al. 2016). We compare the performance of Conv Re LUs with Re LUs on the Cifar10 and Cifar100 datasets, and the results are summarized in Table 3. We can observe from these results that our proposed Conv Re LUs are consistently better than Re LUs on both datasets using three popular deep networks. These deep networks employ different architecture designs and have different advantages. These results demonstrate the superiority of our Conv Re LUs over Re LUs on multiple networks and datasets.

Performance Study of Different Initialization Methods Based on the Dense Net-40, we perform comparisons of the three initialization methods discussed in Section on the Cifar10 and Cifar100 datasets. For all the three initialization methods, we employ L2-regularization to encourage the learned thresholds to be small. The results are summarized

Figure 6: The distribution of threshold values learned in Conv Re LUs based on Dense Net-40.

Table 5: Comparison of the three initialization strategies in terms of top-1 accuracy on image classiﬁcation dataset Cifar10.

R1 (First) R2 (Last) R3 (All) Cifar10 94.72% 94.34% 94.55%

in Table 4. We can observe from the results that the zero initialization method outperforms random and glot initialization methods on both datasets. Figure 6 shows the distribution of the learned threshold parameters in Conv Re LUs. The histogram in the ﬁgure demonstrates that the learned thresholds are very close to but not equal to zero. This is consistent with our expectation on using zero initialization. This simple parameter initialization method is very suitable for our proposed Conv Re LUs.

Performance Study of Replacement Strategies We investigate the performance of Conv Re LU with different replacement strategies. We develop three replacement strategies: R1, R2, and R3. For strategy R1, we replace Re LUs with Conv Re LUs in convolution layers of the ﬁrst block in Dense Net-40. We use Conv Re LUs instead of Re LUs in the last block for strategy R2. In the strategy R3, we replace all Re LUs by our proposed Conv Re LUs. We test these replacement strategies on Dense Net-40 using the Cifar10 dataset. The results are summarized in Table 5. We can see from the results that the ﬁrst replacement strategy has the best performance among the three strategies. The observations reﬂected in Figure 5 demonstrate that the ﬁrst replacement strategy also beneﬁts Leaky Re LUs and PRe LUs. From the results, replacement of Re LUs in the beginning part of the network can overcome the dying Re LU problem and retain sparse representations for the linear classiﬁers simultaneously.

Parameter Number Study Since our proposed activation function Conv Re LUs involve more parameters than Re LUs, Leaky Re LUs, and PRe LUs, we study the number of additional parameters for Conv Re LUs based on Dense Net-40. The results are given in Table 6. We can observe from the results that Conv Re LUs only needs

Table 6: Comparison of Conv Re LU with other activation functions in terms of parameter numbers based on Dense Net40 on the dataset Cifar10.

Function #Params Ratio Re LU 1,035,268 0.00% Leaky Re LU 1,035,268 0.00% Leaky Re LU (A) 1,035,268 0.00% PRe LU 1,035,280 0.00% PRe LU (A) 1,035,387 0.01% Conv Re LU 1,038,184 0.28%

0.28% or less additional parameters compared to Re LUs and other activation functions. We believe this marginal increase in parameter number is negligible and will not cause the over-ﬁtting problem. This is a result of our parameter sharing scheme used in Conv Re LUs, which allows each input neuron acts on different thresholds and avoids a large number of extra parameters.

In this work, we propose the adaptive Re LUs and its variant Conv Re LUs to solve the dying Re LU problem suffered by Re LUs. The dying Re LU problem is mostly caused by its zero activation for negative arguments. By making the thresholds to be trainable instead of zero, adaptive Re LUs allow each input neuron to be activated by different trainable thresholds. Other than common parameter sharing methods such as layer sharing used in PRe LUs, we propose a novel parameter sharing scheme, in which the trainable threshold parameters are shared based on the weights in convolution layers, thereby leading to our Conv Re LUs. When computing with a speciﬁc weight in the convolution layer, the input neuron is activated by its corresponding threshold. In this way, the input neuron is acted by different thresholds without involving excessive extra parameters in neural networks. The experimental results on image and text classiﬁcation tasks demonstrate consistent performance improvements of Conv Re LUs compared to Re LUs, Leaky Re LUs, and PRe LUs. For the extra parameters involved in Conv Re LUs, we propose to use the zero initialization method with L2-regularization such that the trainable thresholds are close to but not equal to zero. Both the quantitative results and the histogram of threshold values conﬁrm our intuitions and expectations on the zero initialization method. Finally, we propose the partial replacement strategy that helps to solve the dying Re LU problem and retain sparse representations for linear classiﬁers in neural networks. Our results indicate that the partial replacement strategy can help not only our Conv Re LUs but also Leaky Re LUs and PRe LUs, which demonstrates its broad applicability.

Acknowledgements

This work was supported in part by National Science Foundation grants IIS-1908166 and DBI-1661289.

References Chen, N.; Chen, X.; Yu, J.; and Wang, J. 2006. Afterhyperpolarization improves spike programming through lowering threshold potentials and refractory periods mediated by voltage-gated sodium channels. Biochemical and Biophysical Research Communications 346(3):938 945. Chen, J.; Sathe, S.; Aggarwal, C.; and Turaga, D. 2017. Outlier detection with autoencoder ensembles. In Proceedings of the 2017 SIAM International Conference on Data Mining, 90 98. SIAM. Clevert, D.-A.; Unterthiner, T.; and Hochreiter, S. 2015. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv preprint ar Xiv:1511.07289. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Image Net: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Gao, H.; Yuan, H.; Wang, Z.; and Ji, S. 2019. Pixel transposed convolutional networks. IEEE Transactions on Pattern Analysis & Machine Intelligence 1(1):1 1. Glorot, X., and Bengio, Y. 2010. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, 249 256. Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse rectiﬁer neural networks. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, 315 323. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectiﬁers: Surpassing human-level performance on Image Net classiﬁcation. In Proceedings of the IEEE International Conference on Computer Vision, 1026 1034. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700 4708. Islam, M. M., and Islam, N. 2016. Measuring threshold potentials of neuron cells using hodgkin-huxley model by applying different types of input signals. Dhaka University Journal of Science 64(1):15 20. Johnson, R., and Zhang, T. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 562 570. Klambauer, G.; Unterthiner, T.; Mayr, A.; and Hochreiter, S. 2017. Self-normalizing neural networks. In Advances in Neural Information Processing Systems, 972 981. Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks.

In Advances in neural information processing systems, 1097 1105. Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; and Navab, N. 2016. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision, 239 248. IEEE. Le Cun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature 521(7553):436. Le Cun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural computation 1(4):541 551. Le Cun, Y.; Bottou, L.; Orr, G.; and Muller, K. 1998a. Efﬁcient backprop. In Orr, G., and K., M., eds., Neural Networks: Tricks of the trade. Springer. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998b. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324. Maas, A. L.; Hannun, A. Y.; and Ng, A. Y. 2013. Rectiﬁer nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing. Citeseer. Nair, V., and Hinton, G. E. 2010. Rectiﬁed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), 807 814. Pang, B., and Lee, L. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, 115 124. Association for Computational Linguistics. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster rcnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 91 99. Seifter, J.; Sloane, D.; and Ratner, A. 2005. Concepts in medical physiology. Lippincott Williams & Wilkins. Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. Proceedings of the International Conference on Learning Representations. Xu, B.; Wang, N.; Chen, T.; and Li, M. 2015. Empirical evaluation of rectiﬁed activations in convolutional network. ar Xiv preprint ar Xiv:1505.00853. Yao, L., and Miller, J. 2015. Tiny imagenet classiﬁcation with convolutional neural networks. CS 231N. Zhang, X.; Zhao, J.; and Le Cun, Y. 2015. Character-level convolutional networks for text classiﬁcation. In Advances in neural information processing systems, 649 657.