# enhancing_adversarial_defense_by_kwinnerstakeall__8287c481.pdf Published as a conference paper at ICLR 2020 ENHANCING ADVERSARIAL DEFENSE BY k-WINNERSTAKE-ALL Chang Xiao Peilin Zhong Changxi Zheng Columbia University {chang, peilin, cxz}@cs.columbia.edu We propose a simple change to existing neural network structures for better defending against gradient-based adversarial attacks. Instead of using popular activation functions (such as Re LU), we advocate the use of k-Winners-Take-All (k-WTA) activation, a C0 discontinuous function that purposely invalidates the neural network model s gradient at densely distributed input data points. The proposed k-WTA activation can be readily used in nearly all existing networks and training methods with no significant overhead. Our proposal is theoretically rationalized. We analyze why the discontinuities in k-WTA networks can largely prevent gradient-based search of adversarial examples and why they at the same time remain innocuous to the network training. This understanding is also empirically backed. We test k-WTA activation on various network structures optimized by a training method, be it adversarial training or not. In all cases, the robustness of k-WTA networks outperforms that of traditional networks under white-box attacks. 1 INTRODUCTION In the tremendous success of deep learning techniques, there is a grain of salt. It has become wellknown that deep neural networks can be easily fooled by adversarial examples (Szegedy et al., 2014). Those deliberately crafted input samples can mislead the networks to produce an output drastically different from what we expect. In many important applications, from face recognition authorization to autonomous cars, this vulnerability gives rise to serious security concerns (Barreno et al., 2010; 2006; Sharif et al., 2016; Thys et al., 2019). Attacking the network is straightforward. Provided a labeled data item (x, y), the attacker finds a perturbation x perceptually similar to x but misleading enough to cause the network to output a label different from y. By far, the most effective way of finding such a perturbation (or adversarial example) is by exploiting the gradient information of the network with respect to its input: the gradient indicates how to perturb x to trigger the maximal change of y. The defense, however, is challenging. Recent studies showed that adversarial examples always exist if one tends to pursue a high classification accuracy adversarial robustness seems at odds with the accuracy (Tsipras et al., 2018; Shafahi et al., 2019a; Su et al., 2018a; Weng et al., 2018; Zhang et al., 2019). This intrinsic difficulty of eliminating adversarial examples suggests an alternative path: can we design a network whose adversarial examples are evasive rather than eliminated? Indeed, along with this thought is a series of works using obfuscated gradients as a defense mechanism (Athalye et al., 2018). Those methods hide the network s gradient information by artificially discretizing the input (Buckman et al., 2018; Lin et al., 2019) or introducing certain randomness to the input (Xie et al., 2018a; Guo et al., 2018) or the network structure (Dhillon et al., 2018; Cohen et al., 2019) (see more discussion in Sec. 1.1). Yet, the hidden gradient in those methods can still be approximated, and as recently pointed out by Athalye et al. (2018), those methods remain vulnerable. Technical contribution I) Rather than obfuscating the network s gradient, we make the gradient undefined. This is achieved by a simple change to the standard neural network structure: we advocate the use of a C0 discontinuous activation function, namely the k-Winners-Take-All (k-WTA) activation, to replace the popular activation functions such as rectified linear units (Re LU). This is the only change we propose to a deep neural network. All other components (such as Batch Norm, convolution, and pooling) as well as the training methods remain unaltered. With no significant overhead, k-WTA activation can be readily used in nearly all existing networks and training methods. Published as a conference paper at ICLR 2020 k-WTA activation takes as input the entire output of a layer, retains its k largest values and deactivates all others to zero. As we will show in this paper, even an infinitesimal perturbation to the input may cause a complete change to the network neurons activation pattern, thereby resulting in a large jump in the network s output. This means that, mathematically, if we use f(x; w) to denote a k-WTA network taking an input x and parameterized by weights w, then the gradient xf(x; w) at certain x is undefined f(x; w) is C0 discontinuous. Figure 1: 1D illustration. Fit a 1D function (green dotted curve) using a k-WTA model provided with a set of points (red). The resulting model is piecewise continuous (blue curve), and the discontinuities can be dense. Technical contribution II) More intriguing than the mere replacement of the activation function is why k-WTA helps improve the adversarial robustness. We offer our theoretical reasoning of its behavior from two perspectives. On the one hand, we show that the discontinuities of f(x; w) is densely distributed in the space of x. Dense enough such that a tiny perturbation from any x almost always comes across some discontinuities, where the gradients are undefined and thus the attacker s search of adversarial examples becomes blinded (see Figure 1). On the other hand, a paradox seemingly exists. The discontinuities in the activation function also renders f(x; w) discontinuous with respect to the network weights w (at certain w values). But training the network relies on the presumption that the gradient with respect to the weights is almost always available. Interestingly, we show that, under k-WTA activation, the discontinuities of f(x; w) is rather sparse in the space of w, intuitively because the dimension of w (in parameter space) is much larger than the dimension of x (in data space). Thus, the network can be trained successfully. Summary of results. We conducted extensive experiments on multiple datasets under different network architectures, including Res Net (He et al., 2016), Dense Net (Huang et al., 2017), and Wide Res Net (Zagoruyko & Komodakis, 2016), that are optimized by regular training as well as various adversarial training methods (Madry et al., 2017; Zhang et al., 2019; Shafahi et al., 2019b). In all these setups, we compare the robustness performance of using the proposed k-WTA activation with commonly used Re LU activation under several white-box attacks, including PGD (Kurakin et al., 2016), Deepfool (Moosavi-Dezfooli et al., 2016), C&W (Carlini & Wagner, 2017), MIM (Dong et al., 2018), and others. In all tests, k-WTA networks outperform Re LU networks. The use of k-WTA activation is motivated for defending against gradient-based adversarial attacks. Our experiments suggest that the robustness improvement gained by simply switching to k-WTA activation is universal, not tied to specific network architectures or training methods. To promote reproducible research, we will release our implementation of k-WTA networks, along with our experiment code, configuration files and pre-trained models1. 1.1 RELATED WORK: OBFUSCATED GRADIENTS AS A DEFENSE MECHANISM Before delving into k-WTA details, we review prior adversarial defense methods that share the same philosophy with our method and highlight our advantages. For a review of other attack and defense methods, we refer to Appendix A. Methods aiming for concealing the gradient information from the attacker has been termed as obfuscated gradients (Athalye et al., 2018) or gradient masking (Papernot et al., 2017; Tramèr et al., 2017) techniques. One type of such methods is by exploiting randomness, either randomly transforming the input before feeding it to the network (Xie et al., 2018a; Guo et al., 2018) or introducing stochastic layers in the network (Dhillon et al., 2018). However, the gradient information in these methods can be estimated by taking the average over multiple trials (Athalye et al., 2018; 2017). As a result, these methods are vulnerable. Another type of obfuscated gradient methods relies on the so-called shattered gradient (Athalye et al., 2018), which aims to make the network gradients nonexistent or incorrect to the attacker, by purposely discretizing the input (Buckman et al., 2018; Ma et al., 2018) or artificially raising numerical instability for gradient evaluation (Song et al., 2018; Samangouei et al., 2018). Unfortunately, these methods are also vulnerable. As shown by Athalye et al. (2018), they can be compromised by backward pass 1https://github.com/a554b554/k WTA-Activation Published as a conference paper at ICLR 2020 Re LU Maxpool LWTA k-WTA Figure 2: Different activation functions. Re LU: all neurons with negative activation values will be set to zero. Max-pooling: only the largest activation in each group is transmitted to the next layer, and this effectively downsample the output. LWTA: the largest activation in each group retains its value when entering the next layer, others are set to zero. k-WTA: the k largest activations in the entire layer retain their values when entering the next layer, others are set to zero (k = 3 in this example). Note that the output is not downsampled through Re LU, LWTA and k-WTA. differentiable approximation (BPDA). Suppose fi(x) is a non-differentiable component of a network expressed by f = f1 f2 fn. The gradient xf can be estimated as long as one can find a smooth delegate g that approximates well fi (i.e., g(x) fi(x)). In stark contrast to all those methods, a slight change of the k-WTA activation pattern in an earlier layer of a network can cause a radical reorganization of activation patterns in later layers (as shown in Sec. 3). Thereby, k-WTA activation not just obfuscates the network s gradients but destroys them at certain input samples, introducing discontinuities densely distributed in the input data space. We are not aware of any possible smooth approximation of a k-WTA network to launch BPDA attacks. 2 k-WINNERS-TAKE-ALL ACTIVATION The debut of the Winner-Takes-All (WTA) activation on the stage of neural networks dates back to 1980s, when Grossberg (1982) introduced shunting short-term memory equations in on-center off-surround networks and showed the ability to identify the largest of N real numbers. Later, Majani et al. (1989) generalized the WTA network to identify the K largest of N real numbers, and they termed the network as the K-Winners-Take-All (KWTA) network. These early WTA-type activation functions output only boolean values, mainly motivated by the properties of biological neural circuits. In particular, Maass (2000a;b) has proved that any boolean function can be computed by a single KWTA unit. Yet, the boolean nature of these activation functions differs starkly from the modern activation functions, including the one we use. 2.1 DEEP NEURAL NETWORKS ACTIVATED BY k-WINNERS-TAKE-ALL We propose to use k-Winners-Take-All (k-WTA) activation, a natural generalization of the boolean KWTA2 (Majani et al., 1989). k-WTA retains the k largest values of an N 1 input vector and sets all others to be zero before feeding the vector to the next network layer, namely, φk(y)j = yj, yj {k largest elements of y}, 0, Otherwise. (1) Here φk : RN RN is the k-WTA function (parameterized by an integer k), y RN is the input to the activation, and φk(y)j denote the j-the element of the output φk(y) (see the rightmost subfigure of Figure 2). Note that if y has multiple elements that are equally k-th largest, we break the tie by retaining the element with smaller indices until the k slots are taken. When using k-WTA activation, we need to choose k. Yet it makes no sense to fix k throughout all layers of the neural network, because these layers often have different output dimensions; a small k to one layer s dimension can be relatively large to the other. Instead of specifying k, we introduce 2In this paper, we use k-WTA to refer our activation function, while using KWTA to refer the original boolean version by Majani et al. (1989). Published as a conference paper at ICLR 2020 a parameter γ (0, 1) called sparsity ratio. If a layer has an output dimension N, then its k-WTA activation has k = γ N . Even though the sparsity ratio can be set differently for different layers, in practice we found no clear gain from introducing such a variation. Therefore, we use a fixed γ the only additional hyperparameter needed for the neural network. In convolutional neural networks (CNN), the output of a layer is a C H W tensor. C denotes the number of output channels; H and W indicate the feature resolution. While there are multiple choices of applying k-WTA on the tensor for example, one can apply k-WTA individually to each channel empirically we found that the most effective (and conceptually the simplest) way is to treat the tensor as a long C H W 1 vector input to the k-WTA activation. Using k-WTA in this way is also backed by our theoretical understanding (see Sec. 3). The runtime cost of computing a k-WTA activation is asymptotically O(N), because finding k largest values in a list is asymptotically equivalent to finding the k-th largest value, which has an O(N) complexity (Cormen et al., 2009). This cost is comparable to Re LU s O(N) cost on a N-length vector. Thus, replacing Re LU with k-WTA introduces no significant overhead. Remark: other WTA-type activations. Relevant to k-WTA is the local Winner-Take-All (LWTA) activation (Srivastava et al., 2013; 2014), which divides each layer s output values into local groups and applies WTA to each group individually. LWTA is similar to max-pooling (Riesenhuber & Poggio, 1999) for dividing the layer output and choosing group maximums. But unlike Re LU and max-pooling being C0 continuous, LWTA and our k-WTA are both discontinuous with respect to the input. The differences among Re LU, max-pooling, LWTA, and k-WTA are illusrated in Figure 2. LWTA is motivated toward preventing catastrophic forgetting (Mc Closkey & Cohen, 1989), whereas our use of k-WTA is for defending against adversarial threat. Both are discontinuous. But it remains unclear what the LWTA s discontinuity properties are and how its discontinuities affect the network training. Our theoretical analysis (Sec. 3), in contrast, sheds some light on these fundamental questions about k-WTA, rationalizing its ability for improving adversarial robustness. Indeed, our experiments confirm that k-WTA outperforms LWTA in terms of robustness (see Appendix D.3). WTA-type activation, albeit originated decades ago and widely studied in computational neuroscience (Douglas et al., 1989; Douglas & Martin, 2004), remains elusive in modern neural networks. Perhaps this is because it has not demonstrated a considerable improvement to the network s standard test accuracy, though it can offer an accuracy comparable to Re LU (Srivastava et al., 2013). Our analysis and proposed use of k-WTA and its enabled improvement on adversarial defense may suggest a renaissance of studying WTA. 2.2 TRAINING k-WTA NETWORKS k-WTA networks require no special treatment in training. Any optimization algorithm (such as stochastic gradient descent) for training Re LU networks can be directly used to train k-WTA networks. Our experiments have found that when the sparsity ratio γ is relatively small ( 0.2), the network training converges slowly. This is not a surprise. A smaller γ activates fewer neurons, effectively reducing more of the layer width and in turn the network size, and the stripped subnetwork is much less expressive (Srivastava et al., 2013). Since different training examples activate different subnetworks, collectively they make the training harder. Nevertheless, we prefer a smaller γ. As we will discuss in the next section, a smaller γ usually leads to better robustness against finding adversarial examples. Therefore, to ease the training (when γ is small), we propose to use an iterative fine-tuning approach. Suppose the target sparsity ratio is γ1. We first train the network with a larger sparsity ratio γ0 using the standard training process. Then, we iteratively fine tune the network. In each iteration, we reduce its sparsity ratio by a small δ and train the network for two epochs. The iteration repeats until γ0 is reduced to γ1. This incremental process introduces little training overhead, because the cost of each fine tuning is negligible in comparison to training from scratch toward γ0. We also note that this process is optional. In practice we use it only when γ < 0.2. We show more experiments on the efficacy of the incremental training in Appendix D.2. 3 UNDERSTANDING k-WTA DISCONTINUITY We now present our theoretical understanding of k-WTA s discontinuity behavior in the context of deep neural networks, revealing some implication toward the network s adversarial robustness. Published as a conference paper at ICLR 2020 (a) (b) (c) Logit value Loss Logit value 25 0 Step 0 100 Step 100 0 Figure 3: (a, b) We plot the change of 10 logits values when conducting untargeted PGD attack with 100 iterations. X-axis indicates the perturbation size ϵ and Y-axis indicates the 10 color-coded logits values. (a) When we apply PGD attack on k-WTA Res Net18, the strong discontinuities w.r.t. to input invalidate gradient estimation, effectively defending well against the attack. (b) In contrast, for a Re LU Res Net18, PGD attack can easily find adversarial examples due to the model s smooth change w.r.t. input. (c) In the process of training k-WTA Res Net18, the loss change w.r.t. model weights is largely smooth. Thus, the training is not harmed by k-WTA s discontinuities. Activation pattern. To understand k-WTA s discontinuity, consider one layer outputting values x, passed through a k-WTA activation, and followed by the next layer whose linear weight matrix is W (see adjacent figure). Then, the value fed into the next activation can be expressed as Wφk(x), where φk( ) is the k-WTA function defined in (1). Suppose the vector x has a length l. We define the k-WTA s activation pattern under the input x as A(x) := {i [l] | xi is one of the k largest values in x} [l]. (2) Here (and throughout this paper), we use [l] to denote the integer set {1, 2, ..., l}. Discontinuity. The activation pattern A(x) is a key notion for analyzing k-WTA s discontinuity behavior. Even an infinitesimal perturbation of x may change A(x): some element i is removed from A(x) while another element j is added in. Corresponding to this change, in the evaluation of Wφk(x), the contribution of W s column vector Wi vanishes while another column Wj suddenly takes effect. It is this abrupt change that renders the result of Wφk(x) C0 discontinuous. Such a discontinuity jump can be arbitrarily large, because the column vectors Wi and Wj can be of any difference. Once W is determined, the discontinuity jump then depends on the value of xi and xj. As explained in Appendix B, when the discontinuity occurs, xi and xj have about the same value, depending on the choice of the sparsity ratio γ (recall Sec. 2.1) the smaller the γ is, the larger the jump will be. This relationship suggests that a smaller γ will make the search of adversarial examples harder. Indeed, this is confirmed through our experiments (see Appendix D.6). Piecewise linearity. Now, consider an n-layer k-WTA network, which can be expressed as f(x) = W(1) φk(W(2) φk( φk(W(n)x + b(n))) + b(2)) + b(1), where W(i) and b(i) are the i-th layer s weight matrix and bias vector, respectively. If the activation patterns of all layers are fixed, then f(x) is a linear function. When the activation pattern changes, f(x) switches from one linear function to another linear function. Over the entire space of x, f(x) is piecewise linear. The specific activation patterns of all layers define a specific linear piece of the function, or a linear region (following the notion introduced by Montufar et al. (2014)). Conventional Re LU (or hard tanh) networks also represent piecewise linear functions and their linear regions are joined together at their boundaries, whereas in k-WTA networks the linear regions are disconnected (see Figure 1). Linear region density. Next, we gain some insight on the distribution of those linear regions. This is of our interest because if the linear regions are densely distributed, a small x perturbation from any data point x will likely cross the boundary of the linear region where x locates. Whenever boundary crossing occurs, the gradient becomes undefined (see Figure 3-a). For the purpose of analysis, consider an input x passing through a layer followed by a k-WTA activation (see adjacent figure). The output from the activation is φk(Wx + b). We would like to understand, when x is changed into x , how likely the activation pattern of φk will change. First, notice that if x and x satisfy x = c x with some c > 0, the activation pattern remains unchanged. Therefore, we introduce a notation d(x, x ) that measures the perpendicular distance between x and x , one Published as a conference paper at ICLR 2020 that satisfies x = c (x + d(x, x )x ) for some scalar c, where x is a unit vector perpendicular to x and on the plane spanned by x and x . With this notion, and if the elements in W is initialized by sampling from N(0, 1 l ) and b is initialized as zero, we find the following property: Theorem 1 (Dense discontinuities). Given any input x Rm and some β, and x Rm such that d2(x,x ) x 2 2 β, if the following condition is satisfied, then with a probability at least 1 2 m, we have A(Wx + b) = A(Wx + b). Here l is the width of the layer, and γ is again the sparsity ratio in k-WTA. This theorem informs us that the larger the layer width l is, the smaller β and thus the smaller perpendicular perturbation distance d(x, x ) is needed to trigger a change of the activation pattern, that is, as the layer width increases, the piecewise linear regions become finer (see Appendix C for proof and more discussion). This property also echos a similar trend in Re LU networks, as pointed out by Raghu et al. (2017). Why is the k-WTA network trainable? While k-WTA networks are highly discontinuous as revealed by Theorem 1 and our experiments (Figure 3-a), in practice we experience no difficulty on training these networks. Our next theorem sheds some light on the reason behind the training success. Theorem 2. Consider N data points x1, x2, , x N Rm. Suppose i = j, xi xi 2 = xj xj 2 . If l is sufficiently large, then with a high probability, we have i = j, A(Wxi + b) A(Wxj + b) = . This theorem is more formally stated in Theorem 10 in Appendix C together with a proof there. Intuitively speaking, it states that if the network is sufficiently wide, then for any i = j, activation pattern of input data xi is almost separated from that of xj. Thus, the weights for predicting xi s and xj s labels can be optimized almost independently, without changing their individual activation patterns. In practice, the activation patterns of xi and xj are not fully separated but weakly correlated. During the optimization, the activation pattern of a data point xi may change, but the chance is relatively low a similar behavior has also been found in Re LU networks (Li & Liang, 2018; Du et al., 2018; Allen-Zhu et al., 2019a;b; Song & Yang, 2019). Further, notice that the training loss takes a summation over all training data points. This means a weight update would change only a small set of activation patterns (since the chance of having the pattern changed is low); the discontinuous change on the loss value, after taking the summation, will be negligible (see Figure 3-c). Thus, the discontinuities in k-WTA is not harmful to network training. 4 EXPERIMENTAL RESULTS We evaluate the robustness of k-WTA networks under adversarial attacks. Our evaluation considers multiple training methods on different network architectures (see details below). When reporting statistics, we use Arob to indicate the worst-case robustness accuracy on test data under all adversarial attacks we evaluated, and use Astd to indicate the accuracy on the clean test data. We use k-WTA-γ to represent k-WTA activation with sparsity ratio γ. 4.1 ROBUSTNESS UNDER WHITE-BOX ATTACKS The rationale behind k-WTA activation is to destroy network gradients information needed in white-box attacks. We therefore evaluate k-WTA networks under multiple recently proposed whitebox attack methods, including Projected Gradient Descent (PGD) (Madry et al., 2017), Deepfool (Moosavi-Dezfooli et al., 2016), C&W attack (Carlini & Wagner, 2017), and Momentum Iterative Method (MIM) (Dong et al., 2018). Since k-WTA activation can be used in almost any training method, be it adversarial training or not, we also consider multiple training methods, including natural (non-adversarial) training, adversarial training (AT) (Madry et al., 2017), TRADES (Zhang et al., 2019) and free adversarial training (FAT) (Shafahi et al., 2019b). In addition, we evaluate the robustness under transfer-based Black-box (BB) attacks (Papernot et al., 2017). The black-box threat model requires no knowledge about network architecture and parameters. Thus, we use a pre-trained VGG19 network (Simonyan & Zisserman, 2014) as the source model to generate adversarial examples using PGD. As demonstrated by Su et al. (2018b), VGG networks have the strongest transferability among different architectures. Published as a conference paper at ICLR 2020 Table 1: Adversarial robustness on CIFAR-10 and SVHN datasets. Arob in the last column denotes the empirical worst-case robustness among different attacks (columns) for each network optimized by different training methods (row). The bold numbers indicate the best Arob robustness achieved on Re LU and k-WTA networks by each training method. CIFAR-10 Training Activation Astd PGD C&W Deepfool MIM BB Arob Natural Re LU 92.9% 0.0% 0.0% 1.5% 0.0% 18.9% 0.0% k-WTA-0.1 89.3% 13.3% 27.9% 55.6% 13.1% 62.6% 13.1% k-WTA-0.2 91.7% 4.2% 6.2% 47.8% 3.9% 66.8% 4.2% AT Re LU 83.5% 46.3% 43.6% 46.8% 45.9% 71.0% 43.6% k-WTA-0.1 78.9% 51.4% 64.4% 70.4% 50.7% 73.4% 50.7% k-WTA-0.2 81.4% 48.4% 52.7% 66.1% 47.4% 73.5% 47.4% TRADES Re LU 79.7% 49.8% 52.3% 57.6% 49.9% 70.6% 49.8% k-WTA-0.1 76.6% 55.0% 62.2% 66.0% 57.5% 72.3% 55.0% k-WTA-0.2 80.4% 51.5% 57.7% 63.9% 53.4% 74.7% 51.5% FAT Re LU 82.6% 42.7% 44.4% 49.7% 41.6% 73.4% 41.6% k-WTA-0.1 78.4% 51.7% 66.3% 72.4% 49.1% 72.3% 49.1% k-WTA-0.2 82.8% 48.4% 60.5% 67.2% 46.7% 76.8% 46.7% SVHN Training Activation Astd PGD C&W Deepfool MIM BB Arob Natural Re LU 95.1% 0.0% 0.0% 2.5% 0.0% 14.7% 0.0% k-WTA-0.1 92.6% 10.2% 19.5% 88.7% 11.6% 51.4% 10.2% k-WTA-0.2 93.8% 4.3% 8.0% 86.8% 8.3% 56.7% 4.3% AT Re LU 84.2% 44.5% 42.7% 70.3% 48.4% 77.7% 42.7% k-WTA-0.1 79.9% 62.2% 65.7% 71.5% 56.9% 76.1% 56.9% k-WTA-0.2 82.4% 53.2% 63.6% 77.4% 52.3% 74.2% 52.3% TRADES Re LU 84.7% 47.4% 51.6% 76.9% 49.6% 76.5% 47.4% k-WTA-0.1 81.6% 61.3% 77.4% 79.4% 58.3% 78.1% 58.3% k-WTA-0.2 85.4% 56.7% 59.2% 71.6% 54.5% 79.3% 54.5% FAT Re LU 85.9% 40.8% 46.2% 76.1% 39.9% 76.9% 40.8% k-WTA-0.1 85.5% 57.7% 70.0% 77.0% 62.8% 75.6% 57.7% k-WTA-0.2 86.8% 54.3% 64.3% 74.7% 55.2% 74.4% 54.3% In each setup, we compare the robust accuracy of k-WTA networks with standard Re LU networks on three datasets, CIFAR-10, SVHN, and MNIST. Results on the former two are reported in Table 1, while the latter is reported in Appendix D.4. We use Res Net-18 for CIFAR-10 and SVHN. The perturbation range is 0.031 (CIFAR-10) and 0.047 (SVHN) for pixels ranging in [0, 1]. More detailed training and attacking settings are reported in Appendix D.1. The main takeaway from these experiments (in Table 1) is that k-WTA is able to universally improve the white-box robustness, regardless of the training methods. The k-WTA robustness under black-box attacks is not always significantly better than Re LU networks. But black-box attacks, due to the lack of network information, are generally much harder than white-box attacks. In this sense, white-box attacks make the networks more vulnerable, and k-WTA is able to improve a network s worst-case robustness. This improvement is not tied to any specific training method, achieved with no significant overhead, just by a simple replacement of Re LU with k-WTA. Athalye et al. (2018) showed that gradient-based defenses may render the network more vulnerable under black-box attacks than under gradient-based white-box attacks. However, we have not observed this behavior in k-WTA networks. Even under the strongest black-box attack, i.e., by generating adversarial examples from an independently trained copy of the target network, gradient-based attacks are still stronger (with higher successful rate) than black-box attacks (see Appendix D.3). Additional experiments include: 1) tests under transfer attacks across two independently trained k-WTA networks and across k-WTA and Re LU networks, 2) evaluation of k-WTA performance on different network architectures, and 3) comparison of k-WTA performance with the LWTA (recall Sec. 2.1) performance. See Appendix D.3 for details. Published as a conference paper at ICLR 2020 Figure 4: Robustness changing w.r.t. γ on CIFAR. When γ decreases, the standard test accuracy (left) starts to drop after a certain point. The robust accuracy (right) first increases then decreases. (a) Res Net18-0.1 (b)Res Net18-0.1+adv (c) Res Net18 (Vanilla) (d) Res Net18+adv 0.04 0.04 0.00 -0.04 0.00 0.04 0.04 0.00 -0.04 0.00 0.04 0.04 0.00 -0.04 0.00 0.04 0.04 0.00 -0.04 Figure 5: Gradient-based attack s loss landscapes in k-WTA (a, b) and conventional Re LU models (c, d). (a,b) k-WTA Models have much more non-convex and non-smooth landscapes. Also, the model optimized by adversarial training (b) has a lower absolute value of loss. 4.2 VARYING SPARSITY RATIO γ AND MODEL ARCHITECTURE. We further evaluate our method on various network architectures with different sparsity ratios γ. Figure 4 shows the standard test accuracies and robust accuracies against PGD attacks while γ decreases. To test on different network architectures, we apply k-WTA to Res Net18, Dense Net121 and Wide Res Net (WRN-22-12). In each case, starting from γ = 0.2, we decrease γ using incremental fine-tuning. We then evaluate the robust accuracy on CIFAR dataset, taking 20-iteration PGD attacks with a perturbation range ϵ = 0.31 for pixels ranging in [0, 1]. We find that when γ is larger than 0.1, reducing γ has little effect on the standard accuracy, but increases the robust accuracy. When γ is smaller than 0.1, reducing γ drastically lowers both the standard and robust accuracies. The peaks in the Arob curves (Figure 4-right) are consistent with our theoretical understanding: Theorem 1 suggests that when l is fixed, a smaller γ tends to sparsify the linear region boundaries, exposing more gradients to the attacker. Meanwhile, as also discussed in Sec. 3, a smaller γ leads to a larger discontinuity jump and thus tends to improve the robustness. 4.3 LOSS LANDSCAPE IN GRADIENT-BASED ATTACKS We now empirically unveil why k-WTA is able to improve the network s robustness (in addition to our theoretical analysis in Sec. 3). Here we visualize the attacker s loss landscape in gradient-based attacks in order to reveal the landscape change caused by k-WTA. Similar to the analysis in Tramèr et al. (2017), we plot the attack loss of a model with respect to its input on points x = x+ϵ1g1+ϵ2g2, where x is a test sample from CIFAR test set, g1 is the direction of the loss gradient with respect to the input, g2 is another random direction, ϵ1 and ϵ2 sweep in the range of [ 0.04, 0.04], with 50 samples each. This results in a 3D landscape plot with 2500 data points (Figure 5). As shown in Figure 5, k-WTA models (with γ = 0.1) have a highly non-convex and non-smooth loss landscape. Thus, the estimated gradient is hardly useful for adversarial searches. This explains why k-WTA models can effectively resist gradient-based attacks. In contrast, Re LU models have a much smoother loss surface, from which adversarial examples can be easily found using gradient descent. Inspecting the range of loss values in Figure 5, we find that adversarial training tends to compress the loss landscape s dynamic range in both the gradient direction and the other random direction, making the dynamic range smaller than that of the models without adversarial training. This phenomenon has been observed in Re LU networks (Madry et al., 2017; Tramèr et al., 2017). Interestingly, k-WTA models manifest a similar behavior (Figure 5-a,b). Moreover, we find that in k-WTA models a larger γ leads to a smoother loss surface than a smaller γ (see Appendix D.6 for more details). Published as a conference paper at ICLR 2020 5 CONCLUSION This paper proposes to replace widely used activation functions with the k-WTA activation for improving the neural network s robustness against adversarial attacks. This is the only change we advocate. The underlying idea is to embrace the discontinuities introduced by k-WTA functions to make the search for adversarial examples more challenging. Our method comes almost for free, harmless to network training, and readily useful in the current paradigm of neural networks. Acknowledgments. This work was supported in part by the National Science Foundation (CAREER-1453101, 1816041, 1910839, 1703925, 1421161, 1714818, 1617955, 1740833), Simons Foundation (#491119 to Alexandr Andoni), Google Research Award, a Google Ph D Fellowship, a Snap Research Fellowship, a Columbia SEAS CKGSB Fellowship, and Soft Bank Group. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In Neur IPS. https://arxiv.org/pdf/1810.12065, 2019a. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. In ICML. https://arxiv.org/pdf/1811.03962, 2019b. Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. ar Xiv preprint ar Xiv:1707.07397, 2017. Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, July 2018. URL https://arxiv.org/abs/ 1802.00420. Marco Barreno, Blaine Nelson, Russell Sears, Anthony D Joseph, and J Doug Tygar. Can machine learning be secure? In Proceedings of the 2006 ACM Symposium on Information, computer and communications security, pp. 16 25. ACM, 2006. Marco Barreno, Blaine Nelson, Anthony D Joseph, and J Doug Tygar. The security of machine learning. Machine Learning, 81(2):121 148, 2010. Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. ar Xiv preprint ar Xiv:1712.04248, 2017. Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. 2018. URL https://openreview.net/pdf?id= S18Su--CW. Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39 57. IEEE, 2017. Jianbo Chen and Michael I Jordan. Boundary attack++: Query-efficient decision-based adversarial attack. ar Xiv preprint ar Xiv:1904.02144, 2019. Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. ar Xiv preprint ar Xiv:1902.02918, 2019. Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Guneet S. Dhillon, Kamyar Azizzadenesheli, Jeremy D. Bernstein, Jean Kossaifi, Aran Khanna, Zachary C. Lipton, and Animashree Anandkumar. Stochastic activation pruning for robust adversarial defense. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1u R4GZRZ. Published as a conference paper at ICLR 2020 Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9185 9193, 2018. Rodney J Douglas and Kevan AC Martin. Neuronal circuits of the neocortex. Annu. Rev. Neurosci., 27:419 451, 2004. Rodney J Douglas, Kevan AC Martin, and David Whitteridge. A canonical microcircuit for neocortex. Neural computation, 1(4):480 488, 1989. Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014. Stephen Grossberg. Contour enhancement, short term memory, and constancies in reverberating neural networks. In Studies of mind and brain, pp. 332 378. Springer, 1982. Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. Countering adversarial images using input transformations. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Sy J7Cl WCb. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJz6ti Cq Ym. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017. Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesvári. Learning with a strong adversary. http://arxiv.org/abs/1511.03034, 11 2015. Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. ar Xiv preprint ar Xiv:1611.01236, 2016. Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157 8166, 2018. Ji Lin, Chuang Gan, and Song Han. Defensive quantization: When efficiency meets robustness. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=ryet Z20ct X. Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Michael E. Houle, Dawn Song, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1g J1L2a W. Wolfgang Maass. On the computational power of winner-take-all. Neural computation, 12(11): 2519 2535, 2000a. Wolfgang Maass. Neural computation with winner-take-all as the only nonlinear operation. In Advances in neural information processing systems, pp. 293 299, 2000b. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Published as a conference paper at ICLR 2020 E Majani, Ruth Erlanson, and Yaser S Abu-Mostafa. On the k-winners-take-all network. In Advances in neural information processing systems, pp. 634 642, 1989. Michael Mc Closkey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109 165. Elsevier, 1989. Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924 2932, 2014. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574 2582, 2016. Nina Narodytska and Shiva Prasad Kasiviswanathan. Simple black-box adversarial perturbations for deep networks. ar Xiv preprint ar Xiv:1612.06299, 2016. Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506 519. ACM, 2017. Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2847 2854. JMLR. org, 2017. Jonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox: A python toolbox to benchmark the robustness of machine learning models. ar Xiv preprint ar Xiv:1707.04131, 2017. URL http://arxiv.org/abs/1707.04131. Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019, 1999. Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. Co RR, abs/1711.09404, 2017. URL http://arxiv.org/abs/1711.09404. Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Bk J3ibb0-. Ali Shafahi, W. Ronny Huang, Christoph Studer, Soheil Feizi, and Tom Goldstein. Are adversarial examples inevitable? In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=r1l WUo A9FQ. Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! ar Xiv preprint ar Xiv:1904.12843, 2019b. Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K. Reiter. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS 16, pp. 1528 1540, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4139-4. doi: 10.1145/2976749.2978392. URL http://doi.acm.org/10.1145/2976749.2978392. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations, 2018. URL https://openreview.net/ forum?id=r JUYGxb CW. Published as a conference paper at ICLR 2020 Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. ar Xiv preprint ar Xiv:1906.03593, 2019. Rupesh K Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, and Jürgen Schmidhuber. Compete to compute. In Advances in Neural Information Processing Systems 26, pp. 2310 2318. 2013. URL http://papers.nips.cc/paper/5059-compete-to-compute. pdf. Rupesh Kumar Srivastava, Jonathan Masci, Faustino Gomez, and Jürgen Schmidhuber. Understanding locally competitive networks. ar Xiv preprint ar Xiv:1410.1165, 2014. Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy? a comprehensive study on the robustness of 18 deep image classification models. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.), Computer Vision ECCV 2018, pp. 644 661, Cham, 2018a. Springer International Publishing. Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy? a comprehensive study on the robustness of 18 deep image classification models. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 631 648, 2018b. Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, 2019. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6199. Simen Thys, Wiebe Van Ranst, and Toon Goedemé. Fooling automated surveillance cameras: adversarial patches to attack person detection. ar Xiv preprint ar Xiv:1904.08653, 2019. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. ar Xiv preprint ar Xiv:1705.07204, 2017. Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. stat, 1050:11, 2018. Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S Dhillon, and Luca Daniel. Towards fast computation of certified robustness for relu networks. 2018. Przemyslaw Wojtaszczyk. Banach spaces for analysts, volume 25. Cambridge University Press, 1996. Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating adversarial effects through randomization. In International Conference on Learning Representations, 2018a. URL https://openreview.net/forum?id=Sk9yuql0Z. Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, and Kaiming He. Feature denoising for improving adversarial robustness. ar Xiv preprint ar Xiv:1812.03411, 2018b. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016. Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan. Theoretically principled trade-off between robustness and accuracy. ar Xiv preprint ar Xiv:1901.08573, 2019. Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deep neural networks via stability training. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. Published as a conference paper at ICLR 2020 Supplementary Document Enhancing Adversarial Defense by k-Winners-Take-All A OTHER RELATED WORK In this section, we briefly review the key ideas of attacking neural network models and existing defense methods based on adversarial training. Attack methods. Recent years have seen adversarial attack studied extensively. The proposed attack methods fall under two general categories, white-box and black-box attacks. The white-box threat model assumes that the attacker knows the model s structure and parameters fully. This means that the attacker can exploit the model s gradient (with respect to the input) to find adversarial examples. A baseline of such attacks is the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014), which constructs the adversarial example x of a given labeled data (x, y) using a gradient-based rule: x = x + ϵsign( x L(f(x), y)), (3) where f(x) denotes the neural network model s output, L( ) is the loss function provided f(x) and input label y, and ϵ is the perturbation range for the allowed adversarial example. Extending FGSM, Projected Gradient Descent (PGD) (Kurakin et al., 2016) utilizes local firstorder gradient of the network in a multi-step fashion, and is considered the strongest first-order adversary (Madry et al., 2017). In each step of PGD, the adversarial example is updated by a FGSM rule, namely, x n+1 = Πx ϵx n + ϵsign( x L(f(x n), y)), (4) where x n is the adversarial examples after n steps and Πx ϵ(x n) projects x n back into an allowed perturbation range ϵ (such as an ϵ ball of x under certain distance measure). Other attacks include Deepfool (Moosavi-Dezfooli et al., 2016), C&W (Carlini & Wagner, 2017) and momentum-based attack (Dong et al., 2018). Those methods are all using first-order gradient information to construct adversarial samples. The black-box threat model is a strict subset of the white-box threat model. It assumes that the attacker has no information about the model s architecture or parameters. Some black-box attack model allows the attacker to query the victim neural network to gather (or reverse-engineer) information. By far the most successful black-box attack is transfer attack (Papernot et al., 2017; Tramèr et al., 2017). The idea is to first construct adversarial examples on an adversarially trained network and then attack the black-box network model use these samples. There also exist some gradient-free black-box attack methods, such as boundary attack (Brendel et al., 2017; Chen & Jordan, 2019), one-pixel attack (Su et al., 2019) and local search attack (Narodytska & Kasiviswanathan, 2016). Those methods rely on repeatedly evaluating the model and are not as effective as gradient-based white-box attacks. Adversarial training. Adversarial training (Goodfellow et al., 2014; Madry et al., 2017; Kurakin et al., 2016; Huang et al., 2015) is by far the most successful method against adversarial attacks. It trains the network model with adversarial images generated during the training time. Madry et al. (2017) showed that adversarial training in essence solves the following min-max optimization problem: min f E{ max x ϵ L(f(x ), y)}, (5) where ϵ is the set of allowed perturbations of training samples, and y denotes the true label of each training sample. Recent works that achieve state-of-the-art adversarial robustness rely on adversarial training (Zhang et al., 2019; Xie et al., 2018b). However, adversarial training is notoriously slow because it requires finding adversarial samples on-the-fly at each training epoch. Its prohibitive cost makes adversarial training difficult to scale to large datasets such as Image Net (Deng et al., 2009) unless enormous computation resources are available. Recently, Shafahi et al. (2019b) revise the adversarial training algorithm to make it has similar training time as regular training, while keep the standard and robust accuracy comparable to standard adversarial training. Published as a conference paper at ICLR 2020 Regularization. Another type of defense is based on regularizing the neural network, and many works of this type are combined with adversarial training. For example, feature denoising (Xie et al., 2018b) adds several denoise blocks to the network structure and trains the network with adversarial training. Zhang et al. (Zhang et al., 2019) explicitly added a regularization term to balance the trade-off between standard accuracy and robustness, obtaining state-of-the-art robust accuracy on CIFAR. Some other regularization-based methods require no adversarial training. For example, Ross & Doshi Velez (2017) proposed to regularize the gradient of the model with respect to its input; Zheng et al. (2016) generated adversarial samples by adding random Gaussian noise to input data. However, these methods are shown to be brittle under stronger iterative gradient-based attacks such as PGD (Zhang et al., 2019). In contrast, as demonstrated in our experiments, our method without using adversarial training is able to greatly improve robustness under PGD and other attacks. B DISCONTINUITY JUMP OF Wφk(x) Consider a gradual and smooth change of the vector x. For the ease of illustration, let us assume all the values in x are distinct. Because every element in x changes smoothly, when the activation pattern A(x) changes, the k-th largest and k + 1-th largest value in x must swap: the previously k-th largest value is removed from the activation pattern, while the previously k + 1-th largest value is added in the activation pattern. Let i and j denote the indices of the two values, that is, xi is previously the k-th largest and xj is previously the k + 1-th largest. When this swap happens, xi and xj must be infinitesimally close to each other, and we use x to indicate their common value. This swap affects the computation of Wφk(x). Before the swap happens, xi is in the activation pattern but xj is not, therefore Wi takes effect but Wj does not. After the swap, Wj takes effect while Wj is suppressed. Therefore, the discontinuity jump due to this swap is (Wj Wi)x . When W is determined, the magnitude of the jump depends on x . Recall that x is the k-th largest value in x when the swap happens. Thus, it depends on k and in turn the sparsity ratio γ: the smaller the γ is, the smaller k is effectively used (for a fixed vector length). As a result, the k-th largest value becomes larger when k = 1, the largest value of x is used as x . C THEORETICAL PROOFS In this section, we will prove Theorem 1 and Theorem 2. The formal version of the two theorems are Theorem 9 and Theorem 10 respectively. Notation. We use [n] to denote the set {1, 2, , n}. We use 1(E) to indicate an indicator variable. If the event E happens, the value of 1(E) is 1. Otherwise the value of 1(E) is 0. For a weight matrix W, we use Wi to denote the i-th row of W. For a bias vector b, we use bi to denote the i-th entry of b. In this section, we show some behaviors of the k-WTA activation function. Recall that an n-layer neural network f(x) with k-WTA activation function can be seen as the following: f(x) = W (1) φk(W (2) φk( φk(W (n)x + b(n))) + b(2)) + b(1) where W (i) is the weight matrix, b(i) is the bias vector of the i-th layer, and φ( ) is the k-WTA activation function, i.e., for an arbitrary vector y, φk(y) is defined as the following: φk(y)j = yj, if yj is one of the top-k largest values, 0, otherwise. For simplicity of the notation, if k is clear in the context, we will just use φ(y) for short. Notice that if there is a tie in the above definition, we assume the entry with smaller index has larger value. For a vector y Rl, we define the activation pattern A(y) [l] as A(y) = {i [l] | yi is one of the top-k largest values}. Notice that if the activation pattern A(y) is different from A(y ), then W φ(y) and W φ(y ) will be in different linear region. Actually, W φ(y) may even represent a discontinuous function. In the next section, we will show that when the network is much wider, the function may be more discontinuous with respect to the input. Published as a conference paper at ICLR 2020 C.1 DISCONTINUITY WITH RESPECT TO THE INPUT We only consider the activation pattern of the output of one layer. We consider the behavior of the network after the initialization of the weight matrix and the bias vector. By initialization, the entries of the weight matrix W are i.i.d. random Gaussian variables, and the bias vector is zero. We can show that if the weight matrix is sufficiently wide, then for any vector x, with high probability, for all vector x satisfying that the "perpendicular distance between x and x is larger than a small threshold, the activation patterns of Wx and Wx are different. Notice that the scaling of W does not change the activation pattern of Wx for any x, we can thus assume that each entry of W is a random variable with standard Gaussian distribution N(0, 1). Before we prove Theorem 9, let us prove several useful lemmas. The following several lemmas does not depend on the randomness of the weight matrix. Lemma 1 (Inputs with the same activation pattern form a convex set). Given an arbitrary weight matrix W Rl m and an arbitrary bias vector b Rl, for any x Rm, the set of all the vectors x Rm satisfying A(Wx + b) = A(Wx + b) is convex, i.e., the set {x Rm | A(Wx + b) = A(Wx + b)} is convex. Proof. If A(Wx + b) = A(Wx + b), then x should satisfy: i A(Wx + b), j [l] \ A(Wx + b), Wix + bi (or >) Wjx + bj. Notice that the inequality Wix + bi (or >) Wjx + bj denotes a half hyperplane (Wi Wj)x + (bi bj) (or >) 0. Thus, the set {x Rm | A(Wx + b) = A(Wx + b)} is convex since it is an intersection of half hyperplanes. Lemma 2 (Different patterns of input points with small angle imply different patterns of input points with large angle). Let α (0, 1). Given an arbitrary weight matrix W Rl m, a bias vector b = 0, and a vector x Rm with x 2 = 1, if every vector x Rm with x 2 = 1 and x, x = α satisfies A(Wx + b) = A(Wx + b), then for any x Rm with x 2 = 1 and x, x < α, it satisfies A(Wx + b) = A(Wx + b). Proof. We draw a line between x and x . There must be a point x Rm on the line and x, x = α, where x = x / x 2. Since b = 0, we have A(Wx +b) = A(Wx +b) = A(Wx+b). Since x is on the line between x and x , we have A(Wx +b) = A(Wx+b) by convexity (see Lemma 1). Lemma 3 (A sufficient condition for different patterns). Consider two vectors y Rl and y Rl. If i A(y), j [l] \ A(y) such that y i < y j, then A(y) = A(y ). Proof. Suppose A(y) = A(y ). We have i A(y ). It means that y i is one of the top-k largest values among all entries of y . Thus y j is also one of the top-k largest values, and j should be in A(y ) which leads to a contradiction. In the remaining parts, we will assume that each entry of the weight matrix W Rl m is a standard random Gaussian variable. Lemma 4 (Upper bound of the entires of W). Consider a matrix W Rl m where each entry is a random variable with standard Gaussian distribution N(0, 1). With probability at least 0.99, i [l], Wi 2 10 Proof. Consider a fixed i [l]. We have E[ Wi 2 2] = m. By Markov s inequality, we have Pr[ Wi 2 2 > 100ml] 0.01/l. By taking union bound over all i [l], with probability at least 0.99, we have i [l], Wi 2 10 Lemma 5 (Two vectors may have different activation patterns with a good probability). Consider a matrix W Rl m where each entry is a random variable with standard Gaussian distribution N(0, 1). Let γ (0, 0.48) be the sparsity ratio of the activation, i.e., γ = k/l. For any two vectors x, x Rm with x 2 = x 2 = 1 and x, x = α for some arbitrary α (0.5, 1), with probability at least 1 2 Θ((1/α2 1)γl), A(Wx) = A(Wx ) and i A(Wx), j [l] \ A(Wx) such that Published as a conference paper at ICLR 2020 Proof. Consider arbitrary two vectors x, x Rm with x 2 = x 2 = 1 and x, x = α. We can find an orthogonal matrix Q Rm m such that x := Qx = (1, 0, 0, , 0) Rm and x := Qx = (α, 1 α2, 0, 0, , 0) Rm. Let W = WQ . Then we have W x = Wx and W x = Wx . Thus, we only need to analyze the activation patterns of W x and W x . Since Q is an orthogonal matrix and each entry of W is an i.i.d. random variable with standard Gaussian distribution N(0, 1), W = WQ is also a random matrix where each entry is an i.i.d. random variable with standard Gaussian distribution N(0, 1). Let the entries in the first column of W be X1, X2, , Xl and let the entries in the second column of W be Y1, Y2, , Yl. Then we have , Wx = W x = 1 α2Y1 αX2 + 1 α2Y2 αXl + 1 α2/(96α) and define R 1 < R1 < R2 < R 2 as follows: Pr X N(0,1)[X R 2] = (1 2ε)γ, (7) Pr X N(0,1)[X R2] = (1 ε)γ, (8) Pr X N(0,1)[X R1] = (1 + ε)γ, (9) Pr X N(0,1)[X R 1] = (1 + 2ε)γ. (10) Since γ < 0.48 and ε 0.02, we have (1 + 2ε)γ < 0.5. It implies 0 < R 1 < R1 < R2 < R 2. Proof. By Equation (7) and Equation (10), Pr X N(0,1)[R 1 X R 2] = 4εγ. Due to the density function of standard Gaussian distribution, we have R 1 e t2/2dt = Pr X N(0,1)[R 1 X R 2] = 4εγ. Since R 2 R 1 0, we have t [R 1, R 2], e t2/2 e R 2 2 /2. Thus, 2π e R 2 2 /2(R 2 R 1) = 1 2π e R 2 2 /2 Z R 2 R 1 e t2/2dt = 4εγ. By the tail bound of Gaussian distribution, we have Pr X N(0,1)[X R 2] e R 2 2 /2. By combining with Equation (7), we have 2π (R 2 R 1) = Pr X N(0,1)[X R 2] 1 2π (R 2 R 1) e R 2 2 /2 1 2π (R 2 R 1) which implies R 2 R 1 4ε 1 2ε where the last inequality follows from 1 2ε 0.5. Published as a conference paper at ICLR 2020 Pr X1,X2, ,Xl i=1 1(Xi R2) (1 ε/2)γl e ε2γl/24 (11) Pr X1,X2, ,Xl i=1 1(Xi R1) (1 + ε/2)γl e ε2γl/18 (12) Pr X1,X2, ,Xl i=1 1(R 2 Xi R2) εγl/2 e εγl/8 (13) Pr X1,X2, ,Xl i=1 1(R1 Xi R 1) εγl/2 e εγl/8 (14) Proof. For i [l], we have E[1(Xi R2)] = Pr[Xi R2] = (1 ε)γ by Equation (8). By Chernoff bound, we have i=1 1(Xi R2) (1 + ε/2) (1 ε)γl e (ε/2)2(1 2ε)γl/3. Since ε 0.02, i=1 1(Xi R2) (1 ε/2)γl We have E[1(Xi R1)] = Pr[Xi R1] = (1 + ε)γ by Equation (9). By Chernoff bound, we have i=1 1(Xi R1) (1 ε/3) (1 + ε)γl e (ε/3)2(1+ε)γl/2. i=1 1(Xi Ri) (1 + ε/2)γl We have E [1(R 2 Xi R2)] = Pr[R 2 Xi R2] = εγ by Equation (7) and Equation (8). By Chernoff bound, we have i=1 1(R 2 Xi R2) 1/2 εγl Similarly, we have E[1(R1 Xi R 1)] = Pr[R1 Xi R 1] = εγ by Equation (9) and Equation (10). By chernoff bound, we have Pr X1,X2, ,Xl i=1 1(R1 Xi R 1) 1/2 εγl Equation (11) says that, with high probability, i [l] with Xi R2, it has i A(Wx). Equation (12) says that, with high probability, i [l] with Xi R1, it has i A(Wx). Equation (14) (Equation (13)) says that, with high probability, there are many i [l] such that Wix [R 1, R1] (Wix [R2, R 2]). Let E = E1 E2 E3 E4, where E1: Pl i=1 1(Xi R2) (1 ε/2)γl, E2: Pl i=1 1(Xi R1) (1 + ε/2)γl, Published as a conference paper at ICLR 2020 E3: Pl i=1 1(R1 Xi R 1) εγl/2, E4: Pl i=1 1(R 2 Xi R2) εγl/2. According to Equation (11), Equation (12), Equation (13) and Equation (14), the probability that E happens is at least 1 4e ε2γl/24 (15) by union bound over E1, E2, E3, E4. Claim 5. Condition on E, the probability that i [l] with Xi [R2, R 2] such that Yi < α/ 2π is at least Proof. For a fixed i [l], Pr h Yi α/ p 2π e t2/2dt + 1 Thus, according to event E4, we have Pr h i with Xi [R2, R 2], Yi α/ p 2π | E i 16ε α Claim 6. Condition on E, the probability that i [l] with Xi [R 1, R1] such that Yi 0 is at least 1 (1/2)εγl/2 . Proof. For a fixed i [l], Pr[Yi 0] = 1/2. Thus, according to event E3, we have Pr [ i with Xi [R 1, R1], Yi 0 | E] (1/2)εγl/2. Condition on that E happens. Because of E1, if Xi R2, Xi must be one of the top-k largest values. Due to Equation (6), we have Xi = Wix. Thus, if Xi R2, i A(Wx). By Claim 5, with probability at least εγl/2 , (16) there is i A(Wx) such that Wix = αXi + p where the first step follows from Equation (6), the second step follows from Yi α/ 2π, and the last step follows from Xi [R2, R 2]. Published as a conference paper at ICLR 2020 Because of E2 if Xj R1, Xj should not be one of the top-k largest values. Due to Equation (6), we have Xj = Wjx. Thus, if Xj R1, j A(Wx). By Claim C.1, with probability at least 1 (1/2)εγl/2 , (18) there is j A(Wx) such that Wjx = αXj + p 1 α2Yj αXj αR 1, (19) where the first step follows from Equation (6), the second step follows from Yj 0, and the last step follows from Xj [R 1, R1]. By Equation (19) and Equation (17), i A(Wx), j [l] \ A(Wx), Wix α(R 2 16ε 2π) α(R 1 8ε 2π) Wjx 8αε where the second step follows from Claim 3, the forth step follows from α 0.5, and the last step follows from ε = 1 α2/(96α). By Lemma 3, we can conclude A(Wx) = A(Wx ). By Equation (15), Equation (16), Equation (18), and union bound, the overall probability is at least 4e ε2γl/24 + 16ε α 4e ε2γl/24 + 2 where the first and the last step follows from ε = Next, we will use a tool called ε-net. Definition 7 (ε-Net). For a given set S, if there is a set N S such that x S there exists a vector y N such that x y 2 ε, then N is an ε-net of S. There is a standard upper bound of the size of an ε-net of a unit norm ball. Lemma 6 (Wojtaszczyk (1996) II.E, 10). Given a matrix U Rm d, let S = {Uy | Uy 2 = 1}. For ε (0, 1), there is an ε-net N of S with |N| (1 + 1/ε)d. Now we can extend above lemma to the following. Lemma 7 (ε-Net for the set of points with a certain angle). Given a vector x Rm with x 2 = 1 and a parameter α ( 1, 1), let S = {x Rm | x 2 = 1, x, x = α}. For ε (0, 1), there is an ε-net N of S with |N| (1 + 1/ε)m 1. Proof. Let U Rm (m 1) have orthonormal columns and Ux = 0. Then S can be represented as S = {α x + p 1 α2 Uy | y Rm 1, Uy 2 = 1}. Let S = {Uy | y Rm 1, Uy 2 = 1}. According to Lemma 6, there is an ε-net N of S with size |N | (1 + 1/ε)m 1. We construct N as following: N = {α x + p 1 α2 z | z N }. It is obvious that |N| = |N | (1 + 1/ε)m 1. Next, we will show that N is indeed an ε-net of S. Let x be an arbitrary vector from S. Let x = α x + 1 α2 z for some z S . There is a vector (α x + 1 α2 z ) N such that z N and z z 2 ε. Thus, we have 1 α2 z ) 2 = p 1 α2 z z 2 ε. Published as a conference paper at ICLR 2020 Theorem 8 (Rotating a vector a little bit may change the activation pattern). Consider a weight matrix W Rl m where each entry is an i.i.d. sample drawn from the Gaussian distribution N(0, 1/l). Let γ (0, 0.48) be the sparsity ratio of the activation function, i.e., γ = k/l. With probability at least 0.99, it has i [l], Wi 2 10 m. Condition on that i [l], Wi 2 10 m happens, then, for any x Rm and α (0.5, 1), if l C m + log(1/δ) log m + log(1/δ) for a sufficiently large constant C, with probability at least 1 δ 2 m, x Rm with x,x x 2 x 2 α, A(Wx) = A(Wx ). Proof. Notice that the scale of W does not affect the activation pattern of Wx for any x Rm. Thus, we assume that each entry of W is a standard Gaussian random variable in the remaining proof, and we will instead condition on i [l], Wi 2 10 ml. The scale of x or x will not affect x,x x 2 x 2 . It will not affect the activation pattern either. Thus, we assume x 2 = x 2 = 1. By Lemma 4, with probability at least 0.99, we have i [l], Wi 2 10 S = {y Rm | y 2 = 1, x, y = α}. By Lemma 7, there is an ε-net N of S such that By Lemma 5, for any y N, with probability at least 1 2 Θ((1/α2 1)γl), i A(Wx), j [l] \ A(Wx) such that By taking union bound over all y N, with probability at least 1 |N| 2 Θ((1/α2 1)γl) α2 1)γ m+log(1/δ) 1 α2 log ml 1 α2 // C is a sufficiently large constant 2 C (m+log(1/δ)) log ml 1 α2 the following event E happens: y N, i A(Wx), j [l] \ A(Wx) such that Published as a conference paper at ICLR 2020 In the remaining of the proof, we will condition on the event E . Consider y S. Since N is an ε-net of S, we can always find a y N such that Since event E happens, we can find i A(Wx) and j [l] \ A(Wx) such that Then, we have Wiy = Wiy + Wi(y y) 2π + Wi 2 y y 2 = Wjy + Wj(y y ) Wjy + Wj 2 y y 2 where the second step follows from Wiy < Wjy 2π and Wi(y y) Wi 2 y y 2, the third step follows from Wi 2 10 ml and y y 2 p 2π(1 α2)/(720α ml), the sixth step follows from Wj(y y ) Wj 2 y y 2, and the seventh step follows from Wi 2 10 ml and y y 2 p 2π(1 α2)/(720α ml). By Lemma 3, we know that A(Wx) = A(Wy ). Thus, y Rm with y 2 = 1 and x, y = α, we have A(Wx) = A(Wy ) conditioned on E . By Lemma 2, we can conclude that x Rm with x 2 = 1 and x, x α, we have A(Wx) = A(Wx ) conditioned on E . Theorem 9 (A formal version of Theorem 1). Consider a weight matrix W Rl m where each entry is an i.i.d. sample drawn from the Gaussian distribution N(0, 1/l). Let γ (0, 0.48) be the sparsity ratio of the activation function, i.e., γ = k/l. With probability at least 0.99, it has i [l], Wi 2 10 m. Condition on that i [l], Wi 2 10 m happens, then, for any x Rm, if l C m + log(1/δ) log m + log(1/δ) for some β (0, 1) and a sufficiently large constant C, with probability at least 1 δ 2 m, x Rm with x 2 2/ x 2 2 β, A(Wx) = A(Wx ), where x = c (x + x) for some scaler c, and x is perpendicular to x. Proof. If x, x 0, then the statement follows from Theorem 8 directly. In the following, we consider the case x, x > 0. If x 2/ x 2 2 β, x, x 2 x 2 2 x 2 2 = c2 x 4 2 x 2 2(c2( x 2 2 + x 2 2)) = x 2 2 x 2 2 + x 2 2 x 2 2 x 2 2 + β x 2 2 1 1 + β . Published as a conference paper at ICLR 2020 Thus, we have the bounds: x 2 2 x 2 2 1 By Theorem 8, we conclude the proof. Example 1. Suppose that the training data contains N points x1, x2, , x N Rm (m Ω(log N)), where each entry of xi for i [N] is an i.i.d. Bernoulli random variable, i.e., each entry is 1 with some probability p (100 log(N)/m, 0.5) and 0 otherwise. Consider a weight matrix W Rl m where each entry is an i.i.d. sample drawn from the Gaussian distribution N(0, 1/l). Let γ (0, 0.48) be the sparsity ratio of the activation function, i.e., γ = k/l. If l Ω(m/γ log(m/γ)), then with probability at least 0.9, i, j [N], the activation pattern of Wxi and Wxj are different, i.e., A(Wxi) = A(Wxj). Proof. Firstly, let us bound xi 2. We have E[ xi 2 2] = E [Pm t=1 xi,t] = pm. By Bernstein inequality, we have t=1 xi,t pm 2e (pm/10)2/2 10 pm 0.01/N. Thus, by taking union bound over all i [N], with probability at least 0.99, i [N], 0.9pm xi 2 1.1pm. Next we consider xi, xj . Notice that E[ xi, xj ] = E [Pm t=1 xi,txj,t] = p2m. There are two cases. Case 1 (p2m > 20 log N). By Bernstein inequality, we have Pr xi, xj p2m > 1 2p2m 2e (p2m/2)2/2 p2m+ 1 3 1 2 p2m = 2e 3 28 p2m 0.01/N 2. By taking union bound over all pairs of i, j, with probability at least 0.99, i = j, xi, xj 3 2p2m. Since xi 2, xj 2 0.9pm, we have xi, xj xi 2 xj 2 3p2m/2 Case 2 (p2m 20 log N). By Bernstein inequality, we have Pr xi, xj p2m > 10 log N 2e (10 log N)2/2 p2m+ 1 3 10 log N 0.01/N 2. By taking union bound over all pairs of i, j, with probability at least 0.99, i = j, xi, xj 10 log N, Since xi 2, xj 2 0.9pm 90 log N, we have xi, xj xi 2 xj 2 10 log N 90 log N = 1 Thus, with probability at least 0.98, we have i = j, xi, xj /( xi 2 xj 2) 5/6. By Theorem 8, with probability at least 0.99, q [l], Wq 2 10 m. Condition on this event, and since i = j we have xi, xj /( xi 2 xj 2) 5/6, by Theorem 8 again and union bound over all i [N], with probability at least 0.99, i = j, A(Wxi) = A(Wxj). C.2 DISJOINTNESS OF ACTIVATION PATTERNS OF DIFFERENT INPUT POINTS Let X1, X2, , Xm be i.i.d. random variables drawn from the standard Gaussian distribution N(0, 1). Let Z = Pm i=1 X2 i . We use the notation χ2 m to denote the distribution of Z. If m is clear in the context, we just use χ2 for short. Lemma 8 (A property of χ2 distribution). Let Z be a random variable with χ2 m m (m 2) distribution. Given arbitrary ε, η (0, 1), if R is sufficiently large then Pr[Z (1 + ε)R]/ Pr[(1 + ε)R Z R] η. Published as a conference paper at ICLR 2020 Proof. Let R be a sufficiently large number such that: eεR/8 Rm/2 1. Let ξ = ε/4. By the density function of χ2 distribution, we have Pr[R Z (1 + ε)R] = 1 2m/2Γ(m/2) R tm/2 1e t/2dt, Pr[Z (1 + ε)R] = 1 2m/2Γ(m/2) (1+ε)R tm/2 1e t/2dt, where Γ( ) is the Gamma function, and for integer m/2, Γ(m/2) = (m/2 1)(m/2 2) 2 1 = (m/2 1)!. By our choice of R, we have Pr[R Z (1 + ε)R] 1 2m/2Γ(m/2) = 1 2m/2Γ(m/2) 2 e R/2 e (1+ε)R/2 1 2m/2Γ(m/2) 2(1 ξ) e R/2, where the first step follows from t R, tm/2 1 1, and the third step follows from e R/2 = e εR/2 ξ. We also have: Pr[Z (1 + ε)R] 1 2m/2Γ(m/2) (1+ε)R e (1 ξ)t/2dt = 1 2m/2Γ(m/2) 2 1 ξ e (1 ξ)(1+ε)R/2 1 2m/2Γ(m/2) 2 1 ξ e (1+ε/2)R/2, where the first step follos from t R, tm/2 1 eξt/2, and the third step follows from (1 ξ)(1 + ε) (1 + ε/2). Thus, we have Pr[Z (1 + ε)R] Pr[(1 + ε)R Z R] 1 (1 ξ)2 e εR/4 16 9 e εR/4 η. Lemma 9. Consider x, y, z Rm. If x,y x 2 y 2 α, x,z x 2 z 2 β for some α, β 0, then y,z y 2 z 2 α + p 1 β2. Furthermore, if β = 2+α+ 2 α2 4 , then y,z y 2 z 2 (1 εα)β, where εα (0, 1) only depends on α. Proof. Without loss of generality, we suppose x 2 = y 2 = z 2 = 1. We can decompose y as ax + y where y is perpendicular to x. We can decompose z as b1x + b2y / y 2 + z where z is perpendicular to both x and y . Then we have: y, z = ab1 + b2 y 2 α + p Published as a conference paper at ICLR 2020 where the last inequality follows from 0 b1 1, a α, and b2 p 1 β2, 0 y 2 1. By solving β α + p 1 β2, we can get β α+ 2 α2 2 . Thus, if we set β should be strictly larger than α + p 1 β2, and the gap only depends on α. Lemma 10. Give x Rm, let y Rm be a random vector, where each entry of y is an i.i.d. sample drawn from the standard Gaussian distribution N(0, 1). Given β (0.5, 1), Pr[ x, y /( x 2 y 2) β] 1/(1 + 1/ p Proof. Without loss of generality, we can assume x 2 = 1. Let y = y/ y 2. Since each entry of y is an i.i.d. Gaussian variable, y is a random vector drawn uniformly from a unit sphere. Notice that if x, y β, then x y 2 p 2(1 β). Let C = {z Rm | z 2 = 1, z x 2 p 2(1 β)} be a cap, and let S = {z Rm | z 2 = 1} be the unit sphere. Then we have Pr[ x, y β] = area(C)/area(S). According to Lemma 6, there is an p 2(1 β)-net N with |N| (1 + 1/ p 2(1 β))m. If we put a cap centered at each point in N, then the whole unit sphere will be covered. Thus, we can conclude Pr[ x, y β] 1/(1 + 1/ p Theorem 10 (A formal version of Theorem 2). Consider N data points x1, x2, , x N Rm and a weight matrix W Rl m where each entry of W is an i.i.d. sample drawn from the Gaussian distribution N(0, 1/l). Suppose i = j [N], xi, xj /( xi 2 xj 2) α for some α (0.5, 1). Fix k 1 and δ (0, 1), if l is sufficiently large, then with probability at least 1 δ, i, j [N], A(Wxi) A(Wxj) = . Proof. Notice that the scale of W and x1, x2, , x N do not affect either xi, xj /( xi 2 xj 2) or the activation pattern. Thus, we can assume x1 2 = x2 2 = = x N 2 = 1 and each entry of W is an i.i.d. standard Gaussian random variable. Let β = 2+α+ 2 α2 4 and εα be the same as mentioned in Lemma 9. Set ε and β as 2 , β = (1 + ε)β. 100k log(N/δ) (1 + 2/ p 2(1 β ))m , and let R satisfies Pr Z χ2m [Z (1 + ε)2R2] = δ/100 According to Lemma 8, if l is sufficiently large, then R is sufficiently large such that Pr Z χ2m [Z (1 + ε)2R2]/ Pr Z χ2 m [(1 + ε)2R2 Z R2] η. Notice that for t [l], Wt 2 2 is a random variable with χ2 m distribution. Thus, Pr[ Wt 2 (1 + ε)R] = δ/100 l . By taking union bound over all t [l], with probability at least 1 δ/100, t [l], Wt 2 (1 + ε)R. In the remaining of the proof, we will condition on that t [l], Wt 2 (1 + ε)R. Consider i, j [N], t [l], if Wtxi > β R, then we have Wtxi Wt 2 > β R (1 + ε)R β /(1 + ε) = β. Published as a conference paper at ICLR 2020 Due to Lemma 9, we have Wtxj Wt 2 < (1 εα)β. Wtxj < (1 εα)β Wt 2 (1 εα)β(1 + ε)R (1 εα)β R. (20) Notice that for i [N], t [l], we have Pr[Wtxi > β R] Pr[ Wt 2 R] Pr Wtxi l 100k log(N/δ). By Chernoff bound, with probability at least 1 δ/(100N), t=1 1(Wtxi > β R) k. By taking union bound over i [N], with probability at least 1 δ/100, i [N], t=1 1(Wtxi > β R) k. This implies that i [N], if t A(Wxi), then Wtxi > β R. Due to Equation (20), j [N], we have Wtxj < β R which implies that t A(Wxj). Thus, with probability at least 1 δ/50 1 δ probability, i = j, A(Wxi) A(Wxj) = . Remark 1. Consider any x1, x2, , x N Rm with x1 2 = x2 2 = = x N 2 = 1. If i = j [N], xi, xj α for some α (0.5, 1), then |N| (1 + 2/ p Proof. Since xi, xj α, xi xj 2 2 = xi 2 2 + xj 2 2 2 xi, xj 2 2α. Let S be the unit sphere, i.e., S = {x Rm | x 2 = 1}. Due to Lemma 6, there is a ( p 2(1 α)/2)-net N of S with size at most |N| (1 + 2/ p 2(1 α))m. Consider xi, xj, and y N. By triangle inequality, if xi y 2 < p 2(1 α)/2, then xj y 2 > p 2(1 α)/2 due to xi xj 2 p 2(1 α). Since N is a net of S, for each xi, we can find a y N such that xi y 2 < p 2(1 α)/2. Thus, we can conclude N |N| (1 + 2/ p Theorem 11. Consider N data points x1, x2, , x N Rm with their corresponding labels z1, z2, , z N R and a weight matrix W Rl m where each entry of W is an i.i.d. sample drawn from the Gaussian distribution N(0, 1/l). Suppose i = j [N], xi, xj /( xi 2 xj 2) α for some α (0.5, 1). Fix k 1 and δ (0, 1), if l is sufficiently large, then with probability at least 1 δ, there exists a vector v Rl such that i [N], v, φk(Wxi) = zi. Proof. Due to Theorem 10, with probability at least 1 δ, i = j, A(Wxi) A(Wxj) = . Let t1, t2, , t N [l] such that ti A(Wxi). Then ti A(Wxj) for j = i. For each entry vt, if t = ti for some i [N], then set vt = zi/(Wtxi). Then for i [N], we have v, φk(Wxi) = X t A(W xi) vt Wtxi = zi/(Wtixi) Wtixi = zi. Published as a conference paper at ICLR 2020 D ADDITIONAL EXPERIMENTAL RESULTS This section presents details of our experiment settings and additional results for evaluating and empirically understanding the robustness of k-WTA networks. D.1 EXPERIMENT SETTINGS First, we describe the details of setting up the experiments described in Sec. 4. To compare k-WTA networks with their Re LU counterparts, we replace all Re LU activations in a network with k-WTA activation, while retaining all other modules (such as Batch Norm, Convolution, and pooling). To test on different network architectures, including Res Net18, Dense Net121, and Wide Res Net, we use the standard implementations that are publicly available3. All experiments are conducted using Py Torch framework. Training setups. We follow the same training procedure on CIFAR-10 and SVHN datasets. All the Re LU networks are trained with stochastic gradient descent (SGD) method with momentum=0.9. We use a learning rate 0.1 from the first to 50-th epoch and 0.01 from 50-th to 80-th epoch. To compare with Re LU networks, the k-WTA networks are trained in the same way as Re LU networks. All networks are trained with a batch size of 256. For k-WTA networks with a sparsity ratio γ = 0.1, when adversarial training is not used, we train them incrementally (recall in Sec. 2.2). starting with γ = 0.2 for 50 epochs with SGD (using learning rate=0.1, momentum=0.9) and then decreasing γ by 0.005 every 2 epochs until γ reaches 0.1. When adversarial training is enabled, we use untargeted PGD attack with 8 iterations to construct adversarial examples. To train networks with TRADES (Zhang et al., 2019), we use the implementation of its original paper4 with the parameter 1/λ = 6, a value that reportedly leads to the best robustness according to the paper. To train networks with the free adversarial training method (Shafahi et al., 2019b), we implement the training algorithm by following the original paper. We set the parameter m = 8 as suggested in the paper. Attack setups. All attacks are evaluated under the ℓ metric, with perturbation size ϵ = 0.031 (CIFAR-10) and 0.047 (SVHN) for pixels ranging in [0, 1]. We use Foolbox (Rauber et al., 2017), a third-party toolbox for evaluating adversarial robustness. We use the following setups for generating adversarial examples in various attack methods: For PGD attack, we use 40 iterations with random start, the step size is 0.003. For C&W attack, we set the binary search step to 5, maximum number of iterations to 20, learning rate to 0.01, and initial constant to 0.01. For Deepfool, we use 20 steps and 10 sub-samples in its configuration. For momentum attack, we set the step size to 0.003 and number of iterations to 20. All other parameters are set by Foolbox to be its default values. D.2 EFFICACY OF INCREMENTAL TRAINING We now report additional experiments to demonstrate the efficacy of the incremental fine-tuning method (described in Sec. 2.2). As shown in Figure 6 and described its caption, models trained with incremental fine-tuning (denoted as w/ FT in the plots legend) performs better in terms of both standard accuracy (denoted as std in the plots legend) and robust accuracy (denoted as Rob in the plots) when the k-WTA sparsity γ < 0.2, suggesting that fine-tuning is worthwhile when γ is small. D.3 ADDITIONAL RESULTS ON CIFAR-10 Tests on different network architectures. We evaluate the robustness of k-WTA on different network architectures, including Res Net-18, Dense Net-121 and Wide Res Net-22-10. The results are reported in Table 2, where similar to the notation used in Table 1 of the main text, Arob is calculated as the worst-case robustness, i.e., under the most effective attack among PGD, C&W, Deepfool and MIM. The training and attacking settings are same as other experiments described Sec. D.1. As shown in Table 2, while the standard and robustness accuracies, Astd and Arob, vary on different network architectures, k-WTA networks consistently improves the worst-case robustness Arob over Re LU networks, no matter what the network architecture and training method are used. 3https://github.com/kuangliu/pytorch-cifar 4https://github.com/yaodongyu/TRADES Published as a conference paper at ICLR 2020 Res Net18 Wide Res Net Figure 6: Efficacy of incremental training. We sweep through a range of sparsity ratios, and evaluate the standard and robust accuracies of two network structures (left: Res Net18 and right: Wide Res Net). We compare the performance differences between the regular training (i.e., training without incremental fine-tuning) and the training with incremental fine-tuning. Table 2: Additional CIFAR-10 results. Training Model Activation Astd Arob Natural Res Net-18 Re LU 92.9% 0.0% LWTA-0.1 82.8% 3.7% LWTA-0.2 84.6% 0.9% k-WTA-0.1 89.3% 13.1% k-WTA-0.2 91.7% 4.2% AT Res Net-18 Re LU 83.5% 43.6% LWTA-0.1 71.4% 46.6% LWTA-0.2 78.7% 43.1% k-WTA-0.1 78.9% 50.7% k-WTA-0.2 81.4% 47.4% Natural Dense Net-121 Re LU 93.6% 0.0% LWTA-0.1 86.1% 4.6% LWTA-0.2 88.5% 1.4% k-WTA-0.1 90.5% 12.3% k-WTA-0.2 93.3% 6.2% AT Dense Net-121 Re LU 84.2% 46.3% LWTA-0.1 74.0% 49.1% LWTA-0.2 80.2% 44.9% k-WTA-0.1 81.6% 52.4% k-WTA-0.2 83.4% 49.6% Natural Wide Res Net-22-10 Re LU 93.4% 0.0% LWTA-0.1 83.7% 4.2% LWTA-0.2 86.1% 2.8% k-WTA-0.1 88.6% 18.3% k-WTA-0.2 92.7% 7.4% AT Wide Res Net-22-10 Re LU 83.3% 43.1% LWTA-0.1 74.2% 47.5% LWTA-0.2 79.8% 44.7% k-WTA-0.1 78.9% 50.4% k-WTA-0.2 82.4% 47.1% Comparison with LWTA. We in addition compare k-WTA to LWTA activation (Srivastava et al., 2013; 2014). For fair comparisons, we use the same sparsity ratio γ in both k-WTA and LWTA. As shown in Table 2, on all network architectures and training methods we tested, k-WTA networks consistently have better robustness performance than LWTA networks (in terms of both Astd and Arob). These results suggest that k-WTA is more suitable then LWTA for defending against adversarial attacks. Transfer attack. Since a k-WTA network is architecturally similar to its Re LU counterpart with the only difference being the activation we evaluate their robustness under (black-box) transfer Published as a conference paper at ICLR 2020 Table 3: Transferability test on CIFAR-10. Target Model Source Model Re LU k-WTA-0.1 Re LU (AT) k-WTA-0.1 (AT) Re LU 4.8% 75.5% 59.4% 84.7% k-WTA-0.1 61.2% 71.2% 67.8% 86.4% Re LU (AT) 62.7% 80.9% 61.6% 78.6% k-WTA-0.1 (AT) 79.2% 78.6% 69.2% 67.2% Table 4: White-box attack results on MNIST dataset. Activation Training Astd Arob Activation Training Astd Arob Natural 99.4% 0.0% Natural 99.3% 62.2% AT 99.2% 95.0% AT 99.2% 96.4% TRADES 99.2% 96.0% TRADES 99.0% 96.9% FAT 98.2% 94.7% FAT 98.1% 96.0% attacks across k-WTA and Re LU networks. To this end, we build a Re LU and a k-WTA-0.1 network on Res Net-18, and train both networks with natural (non-adversarial) training as well as adversarial training. This gives us four different models denoted (in Table 3) as Re LU, k-WTA-0.1, Re LU (AT), and k-WTA-0.1 (AT). We then launch transfer attacks across each pair of models. We also consider by-far the strongest black-box attack (according to Papernot et al. (2017)): for the same model, for example, a k-WTA-0.1 network optimized by adversarial training, we train two independent versions, each with a different random initialization, and apply the transfer attacks across the two versions. The results are reported in Table 3, where each row corresponds to a target (attacked) model, and each column corresponds to a source model from which the adversarial examples are generated. On the diagonal line of Table 3, each entry corresponds to the robustness under aforementioned transfer attacks across the two versions of the same models. The results suggest that 1) it is more difficult to transfer attack k-WTA networks than Re LU networks using adversarial examples from other models, and 2) it is also more difficult to use adversarial examples of a k-WTA network to attack other models. In a sense, the adversarial examples of a k-WTA network tend to be disjoint from the adversarial examples of a Re LU network, despite their architectural similarity. Inspecting the diagonal entries of Table 3, we also find that k-WTA networks are more robust than their Re LU counterparts under the strongest black-box attack (Papernot et al., 2017) (i.e., transfer attacks across two different versions of the same model). D.4 MNIST RESULTS On MNIST dataset, we conduct experiments with an adversarial perturbation size ϵ=0.3 for pixels ranging in [0, 1]. We use Stochastic Gradient Descent (SGD) with learning rate=0.01 and momentum=0.9 to train a 3-layer CNN. The training takes 20 epochs for all the methods we evaluate. The robust accuracy are evaluated under PGD attacks that take 20 iterations with random initialization and a step size of 0.03. The results are summarized in Table 4. Again, k-WTA activation consistently improves robustness under all different training methods. Even with natural (non-adversarial) training, the resulting k-WTA network still has 62.2% robust accuracy, significantly outperforming Re LU network. D.5 ROBUSTNESS WITH RESPECT TO NATURAL PERTURBATIONS We also evaluate the robustness of k-WTA networks under (non-adversarial) natural perturbations. We evaluated various types of perturbations, including adding Gaussian noise to the input image (std=0.05/0.1), random translation (maximum 5 pixel), random rotation (maximum 10 degrees) and color jittering (i.e., randomly changing the brightness, contrast and saturation of an image, with a maximum perturbation of 0.4), following Hendrycks & Dietterich (2019). The results are summerized in Table 5, in which all models are Res Net18. We found that under all the tested perturbations, the accuracy drops in k-WTA networks are no worse than those in Re LU networks. Note that in these tests, all our models are trained with standard data Published as a conference paper at ICLR 2020 Table 5: Robustness under natural perturbations. Model Clean GN(0.05) GN (0.1) Translation Rotation Color Jitter Re LU 92.9% 69.7% 27.0% 92.3% 88.9% 90.7% k-WTA-0.1 89.3% 80.2% 50.9% 89.1% 86.8% 87.4% k-WTA-0.2 91.7% 77.9% 39.6% 91.2% 88.9% 89.4% Re LU (AT) 83.5% 80.3% 73.9% 80.5% 79.8% 74.9% k-WTA-0.1 (AT) 78.9% 77.8% 69.8% 75.9% 74.7% 68.7% k-WTA-0.2 (AT) 81.4% 80.6% 70.9% 79.8% 78.6% 74.7% augmentations (e.g., random crop and random flip); they are not specifically trained to avert the tested perturbations. We also highlight an interesting finding here. Adding Gaussian noise leads to a large accuracy drop (e.g., from 92.9% to 69.7%/27.0% as shown in the table) on naturally trained Re LU network, but in k-WTA networks (especially k-WTA-0.1), the corresponding accuracy drop is much smaller (e.g., from 89.3% to 80.2%/50.9%). We conjecture that the dense discontinuities in k-WTA networks (recall Figure 5) effectively add noise to the input distribution, thus making the model more robust against input noise. D.6 LOSS LANDSCAPE VISUALIZATION In addition to the experiments shown in Figure 5 and Sec. 4.3 of the main text, we further we visualize the loss landscapes of k-WTA networks when different sparsity ratios γ are used. The plots are shown in Figure 7, produced in the same way as Figure 5 described in Sec. 4.3. As analyzed in Sec. 3, a larger γ tends to smooth the loss surface of the k-WTA network with respect to the input, while a smaller γ renders the loss surface more discontinuous and spiky . In addition, adversarial training tends to reduce the range of the loss values a similar phenomenon in Re LU networks has already been reported Madry et al. (2017); Tramèr et al. (2017) but that does not mean that the loss surface becomes smoother; the loss surface remains spiky. Published as a conference paper at ICLR 2020 Res Net18-0.3 Res Net18-0.2 Res Net18-0.1 Dense Net121-0.2 Dense Net121-0.1 Dense Net121-0.1+adv Figure 7: Visualization of Res Net18 and Dense Net121 with different γ values. The last one (Dense Net121-0.1+adv) is the result using adversarial training. The others are optimized using natural (non-adversarial) training.