# efficient_activation_function_optimization_through_surrogate_modeling__0bea887a.pdf

Efficient Activation Function Optimization through Surrogate Modeling

Garrett Bingham The University of Texas at Austin and Cognizant AI Labs San Francisco, CA 94105 garrett@gjb.ai

Risto Miikkulainen The University of Texas at Austin and Cognizant AI Labs San Francisco, CA 94105 risto@cs.utexas.edu

Carefully designed activation functions can improve the performance of neural networks in many machine learning tasks. However, it is difficult for humans to construct optimal activation functions, and current activation function search algorithms are prohibitively expensive. This paper aims to improve the state of the art through three steps: First, the benchmark datasets Act-Bench-CNN, Act-Bench-Res Net, and Act-Bench-Vi T were created by training convolutional, residual, and vision transformer architectures from scratch with 2,913 systematically generated activation functions. Second, a characterization of the benchmark space was developed, leading to a new surrogate-based method for optimization. More specifically, the spectrum of the Fisher information matrix associated with the model s predictive distribution at initialization and the activation function s output distribution were found to be highly predictive of performance. Third, the surrogate was used to discover improved activation functions in several real-world tasks, with a surprising finding: a sigmoidal design that outperformed all other activation functions was discovered, challenging the status quo of always using rectifier nonlinearities in deep learning. Each of these steps is a contribution in its own right; together they serve as a practical and theoretical foundation for further research on activation function optimization.

1 Introduction

Activation functions are an important choice in neural network design [2, 46]. In order to realize the benefits of good activation functions, researchers often design new functions based on characteristics like smoothness, groundedness, monotonicity, and limit behavior. While these properties have proven useful, humans are ultimately limited by design biases and by the relatively small number of functions they can consider. On the other hand, automated search methods can evaluate thousands of unique functions, and as a result, often discover better activation functions than those designed by humans. However, such approaches do not usually have a theoretical justification, and instead focus only on performance. This limitation results in computationally inefficient ad hoc algorithms that may miss good solutions and may not scale to large models and datasets.

This paper addresses these drawbacks in a data-driven way through three steps. First, in order to provide a foundation for theory and algorithm development, convolutional, residual, and vision transformer based architectures were trained from scratch with 2,913 different activation functions, resulting in three activation function benchmark datasets: Act-Bench-CNN, Act-Bench-Res Net,

GB is currently a research scientist at Google Deep Mind. AQua Sur F code is available at https:// github.com/cognizant-ai-labs/aquasurf, and the benchmark datasets are at https://github.com/ cognizant-ai-labs/act-bench.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

and Act-Bench-Vi T. These datasets make it possible to analyze activation function properties at a large scale in order to determine which are most predictive of performance.

The second step was to characterize the activation functions in these benchmark datasets analytically, leading to a surrogate performance measure. Exploratory data analysis revealed two activation function properties that are highly indicative of performance: (1) the spectrum of the Fisher information matrix associated with the model s predictive distribution at initialization, and (2) the activation function s output distribution. Both sets of features contribute unique information. Both are predictive of performance on their own, but they are most powerful when used in tandem. These features were combined to create a metric space where a low-dimensional representation of the activation functions was learned. This space was then used as a surrogate in the search for good activation functions.

In the third step, this surrogate was evaluated experimentally, first by verifying that it can discover known good functions in the benchmark datasets efficiently and reliably, and second by demonstrating that it can discover improved activation functions in new tasks involving different datasets, search spaces, and architectures. The representation turned out to be so powerful that an out-of-the-box regression algorithm was able to search it effectively. This algorithm improved performance on various tasks, and also discovered a sigmoidal activation function that outperformed all baselines, a surprising discovery that challenges the common practice of using Re LU and its variants. The approach, called AQua Sur F (Activation Quality with a Surrogate Function), is orders of magnitude more efficient than past work. Indeed, whereas previous approaches evaluated hundreds or thousands of activation functions, AQua Sur F requires only tens of evaluations in order to discover functions that outperform a wide range of baseline activation functions in each context. Code implementing the AQua Sur F algorithm is available at https://github.com/cognizant-ai-labs/aquasurf.

Prior research on activation function optimization and Fisher information matrices is reviewed in Section A. This work extends it in three ways. First, the benchmark collections are made available at https://github.com/cognizant-ai-labs/act-bench, providing a foundation for further research on activation function optimization. Second, the low-dimensional representation of the Fisher information matrix makes it a practical surrogate measure, making it possible to apply it to not only activation function design, but potentially also to other applications in the future. Third, the already-discovered functions can be used immediately to improve performance in image processing tasks, and potentially in other tasks in the future.

0.0 0.2 0.4 0.6 0.8 1.0 0

Number of functions

Act-Bench-CNN

Kept Removed

0.0 0.2 0.4 0.6 0.8 1.0

Validation Accuracy

Act-Bench-Res Net

Kept Removed

0.0 0.2 0.4 0.6 0.8 1.0

Act-Bench-Vi T

Kept Removed

Figure 1: Distribution of validation accuracies with 2,913 unique activation functions from the three benchmark datasets. Many activation functions result in failed training (indicated by the chance accuracy of 0.1), suggesting that searching for activation functions is a challenging problem. However, most of these functions have invalid FIM eigenvalues, and can thus be filtered out effectively.

2 Activation Function Benchmarks

As the first step, three activation function benchmark datasets are introduced: Act-Bench-CNN, Act-Bench-Res Net, and Act-Bench-Vi T. Each dataset contains training results for 2,913 unique activation functions when paired with different architectures and tasks: All-CNN-C on CIFAR-10, Res Net-56 on CIFAR-10, and Mobile Vi Tv2-0.5 on Imagenette [22, 24, 31, 41, 51]. These functions were created using the main three-node computation graph from PANGAEA [5]. Details are in Appendix B.

0.1 0.3 0.5 0.7 0.9 Act-Bench-CNN

Act-Bench-Res Net

0.1 0.3 0.5 0.7 0.9 Act-Bench-CNN

Act-Bench-Vi T

0.1 0.3 0.5 0.7 0.9 Act-Bench-Res Net

Act-Bench-Vi T

Figure 2: Distribution of validation accuracies across the benchmark datasets. Each point represents a unique activation function s performance on two of the three datasets. Some functions perform well on all tasks, while others are specialized.

Figure 1 shows the distribution of validation accuracies in these datasets. In all three datasets, the distribution is highly skewed towards functions that result in failed training. The plots suggest that it is difficult to design good activation functions, and explain why existing methods are computationally expensive. Notwithstanding this difficulty, the histograms show that many unique functions do achieve good performance. Thus, searching for new activation functions is a worthwhile task that requires a smart approach.

Figure 2 shows the same data as Figure 1, but with scatter plots that show how performance varies across different tasks. All three plots contain linearly correlated clusters of points in the upper right corner, suggesting that there are modifications to activation functions that make them more powerful across tasks. However, the best results come from discovering functions specialized to individual tasks, indicated by the clusters of points in the upper left and lower right corners.

The three benchmark datasets form a foundation for developing and evaluating methods for automated activation function design. In the next two sections, they are used to develop a surrogate performance metric, making it possible to scale up activation function optimization to large networks and datasets.

3 Features and Distance Metrics

To make efficient search for activation functions possible, the surrogate space needs to be lowdimensional, represent informative features, and have an appropriate distance metric. In the second step, an approach is developed based on (1) the eigenvalues of the Fisher information matrix and (2) the outputs of the activation function. This section motivates each feature type and develops a metric for computing distances between activation functions. They form a surrogate in the next section.

FIM Eigenvalues The Fisher information matrix (FIM) is an important concept in characterizing neural network models. Viewed from various perspectives, the FIM determines a neural network s capacity for learning, ability to generalize, the robustness of the network to small perturbations of its parameters, and the geometry of the loss function near the global minimum [16, 21, 27 29, 33, 34].

Consider a neural network f with weights θ. Given inputs x drawn from a training distribution Qx, the network defines the conditional distribution Ry|f(x;θ). The FIM associated with this model is

F = E x Qx y Ry|f(x;θ)

θL(y, f(x; θ)) θL(y, f(x; θ)) , (1)

where L(y, z) is the loss function representing the negative log-likelihood associated with Ry|f(x;θ).

The FIM has |θ| eigenvalues. The distribution of eigenvalues can be represented by binning the eigenvalues into an m-bucket histogram, and this m-dimensional vector serves as a computational characterization of the network. To calculate the FIM and its eigenvalues, this paper uses the K-FAC approach [20, 38]. Full details are in Appendix C.

Different activation functions induce different FIM eigenvalues for a given neural network. They can be calculated at initialization without training; they can thus serve as a low-dimensional feature vector representation of the activation function. The FIM eigenvalues are immediately useful for filtering out poor activation functions; if they are invalid, the activation function is likely to fail in training (Figure 1). However, in order to use them as a surrogate, a distance metric needs to be defined.

Given a neural network architecture f, let fϕ and fψ be two instantiations with different activation functions ϕ and ψ. Let µl and νl represent the distributions of eigenvalues corresponding to the weights in layer l of neural networks fϕ and fψ, respectively, and let wl be the number of weights in layer l of the networks. The distance between fϕ and fψ is then computed as a weighted layer-wise sum of 1-Wasserstein distances d(fϕ, fψ) = PL l=1 W1(µl, νl)/wl. (2)

With this distance metric, the FIM eigenvalue vector representations encode a low-dimensional embedding space for activation functions, making efficient search possible. Because the FIM eigenvalues depend on several factors (Equation 1), including the activation function ϕ, network architecture f, data distribution Q, and loss function L, they are susceptible to more potential sources of noise. Fortunately, incorporating activation function outputs helps to compensate for this noise.

Activation Function Outputs The shape of an activation function ψ can be described by a vector of n sample values ψ(x). If the network s weights are appropriately initialized, the input activations to its neurons are initially distributed as N(0, 1) [4]. Therefore, the sampling x N(0, 1) provides an n-dimensional feature vector that represents the expected use of the activation function at initialization. A distance metric in this feature vector space can be defined naturally as the Euclidean distance

d(fϕ, fψ) = p Pn i=1(ϕ(xi) ψ(xi))2/n, x N(0, 1). (3)

Functions with similar shapes will have a small distance between them, while those with different shapes will have a large distance. Because these output feature vectors depend only on the activation function, they are reliable and inexpensive to compute. Most importantly, together with the FIM eigenvalues, they constitute a powerful surrogate search space, demonstrated in the next section.

4 Using the Features as a Surrogate relu(x) identity(x)

tanh(x) abs(x)

Figure 3: UMAP embedding of the 2,913 activation functions in the benchmark datasets. Each point stands for a unique activation function, represented by an 80-dimensional output feature vector. The embedding locations of four common activation functions are labeled. The black x s mark coordinates interpolating between these four functions, and the grid of plots on the bottom shows reconstructed activation functions at each of these points. UMAP interpolates smoothly between different kinds of functions, suggesting that it is a good approach for learning low-dimensional representations of activation functions.

In this section, the UMAP dimensionality reduction technique is used to visualize the FIM and output features across the benchmark datasets. This visualization leads to a combined surrogate space that can be used to accelerate the search for good activation functions.

Visualization with UMAP The features developed above can be visualized using the UMAP algorithm [40]. UMAP is a dimension reduction approach similar to t-SNE, but is better at scaling to large sample sizes and preserving global structure [55]. As a first demonstration, Figure 3 shows a 2D representation of the 2,913 activation functions in the benchmark datasets. Each function was represented as an 80-dimensional vector of output values. Interpolating between embedded points confirms that UMAP learns a good underlying representation.

UMAP was also used to project the activation functions to nine two-dimensional spaces according to the distance metrics in Equations 2 and 3. In Figure 4, each column represents a different benchmark dataset (Act-Bench-CNN, Act-Bench-Res Net, or Act-Bench-Vi T) and each row a different distance metric (FIM eigenvalues with m = |θ|/100 , activation function outputs with n = 1,000, or both). The plots only show activation functions that were not filtered out. Each point represents a unique function, colored according to its validation accuracy on the benchmark task. Although the performance of each activation function is already known, this information was not given to UMAP; the embeddings are entirely unsupervised.

Thus, Figure 4 illustrates how predictive each feature type is of activation function performance in each dataset. The next subsections evaluate each feature type in this role in detail, and show that utilizing both features provides better results than either feature alone. Details are in Appendix D.

FIM Eigenvalues The first row of Figure 4 shows the 2D UMAP embeddings of the FIM eigenvalue vectors associated with each activation function. There are clusters in these plots where the points share similar colors, indicating distinct activation functions with similar FIM eigenvalues. Such functions induce similar training dynamics in the neural network and lead to similar performance. On the other hand, some clusters contain activation functions with a wide range of performances, and some points do not belong to any cluster at all. Overall, the plots suggest that FIM eigenvalues are a useful predictor of performance, but that incorporating additional features could lead to better results.

Activation Function Outputs The middle row of Figure 4 shows the 2D UMAP embeddings of the output vectors associated with each activation function. Points are close to each other in this space if the corresponding activation functions have similar shapes. These plots are demonstrably

more informative than the plots based on the FIM eigenvalues in three ways. First, the purple points are better separated from the others. This separation means that activation functions that fail (those achieving 0.1 chance accuracy) are better separated from those that do well. Second, most points immediate neighbors have similar colors. This similarity means that activation functions with similar shapes lead to similar accuracy, and analyzing activation function outputs on their own is more informative than analyzing the FIM eigenvalues. Third, the plots include multiple regions where there are one-dimensional manifolds that exhibit smooth transitions in accuracy, from purple to blue to green to yellow. Thus, not only does UMAP successfully embed similar activation functions near each other, but it also is able to organize the activation functions in a meaningful way.

FIM Eigenvalues

Act-Bench-CNN Act-Bench-Res Net Act-Bench-Vi T

Activation Function Outputs Eigenvalues & Outputs

Validation Accuracy

ELU(x) -ELU(x) tanh(x) -tanh(x) abs(x) -abs(x) Figure 4: UMAP embeddings of activation functions for each dataset (column) and feature type (row). Each point represents a unique activation function; the points are colored by validation accuracy on the given dataset. The colored triangles identify the locations of six well-known activation functions. The areas of similar performance are more continuous in the bottom row; that is, using both FIM eigenvalues and activation function outputs provides a better low-dimensional representation than either feature alone.

There is one drawback to this approach: the performant activation functions (those represented by yellow dots) are often in distinct clusters. This dispersion means that a search algorithm would have to explore multiple areas of the search space in order to find all of the best functions. As the next subsection suggests, this issue can be alleviated by utilizing both FIM eigenvalues and activation function outputs.

Combining Eigenvalues & Outputs The UMAP algorithm uses an intermediate fuzzy topological representation to represent relationships between data points, similar to a neighborhood graph. This property makes it possible to combine multiple sources of data by taking intersections or unions of the representations in order to yield new representations [40]. The bottom row of Figure 4 utilizes both FIM eigenvalues and activation function outputs by taking the union of the two representations. Thus, activation functions are embedded close to each other in this space if they have similar shapes, if they induce similar FIM eigenvalues, or both.

The bottom row of Figure 4 shows the benefits of combining the two features. Unlike the activation function output plots, which contain multiple clusters of high-performing activation functions in different locations in the embedding space, the combined UMAP model embeds all of the best activation functions in similar regions. The combined UMAP model also places poor activation functions (purple points) in the edge of the embedding space, and brings good functions (yellow points) to the center. Thus, the embedding space is more convex, and therefore easier to optimize.

In general, activation functions with similar shapes lead to similar performances, and those with different shapes often produce different results. This property is why the middle row of Figure 4 appears locally smooth. However, in some cases the shape of the activation function does not tell the whole story, and additional information is needed to ascertain its performance.

For example, the colored triangles in Figure 4 identify the location of six activation functions in the low-dimensional space. In the activation function output space (middle row), all of these functions are mapped to different regions of the space. The points are spread apart because an activation function and its negative have very different shapes, i.e. their output will be different for every nonzero input (Figure 5). In contrast, in the FIM eigenvalue space (top row), the points for these pairs of functions overlap because the FIM eigenvalues are comparable (Figure 5). Indeed, assuming the weights are initialized from a distribution symmetric about zero, negating an activation function does not change the training dynamics of a neural network, and they are functionally equivalent.

This issue complicates the search process in two ways. First, good activation functions are mapped to different regions of the embedding space, and so a search algorithm must explore multiple areas in order to find the best function. Second, distinct regions of the space may contain redundant

information: if ELU(x) is known to be a good activation function, it is not helpful to spend compute resources evaluating ELU(x) only to discover that it achieves the same performance.

Negating an activation function is a clear example of a modification that changes the shape of the activation function, but does not affect the training of a neural network. More broadly, it is likely that there exist activation functions that differ in other ways (besides just negation), but that still induce similar training dynamics in neural networks. Fortunately, utilizing FIM eigenvalues and activation function outputs together provides enough information to tease out these relationships. FIM eigenvalues take into account the activation function, the neural network architecture, the loss function, and the data distribution. The eigenvalues are more meaningful features than activation function outputs, which only depend on the shape of the function. However, as Figure 4 shows, the FIM eigenvalues are noisier features, while the activation function outputs are quite reliable. Thus, utilizing both features is a natural way to combine their strengths and address their weaknesses.

32 30 28 26 24 22 20 18 0.0

ELU(x): 0.89 -ELU(x): 0.89 tanh(x): 0.72 -tanh(x): 0.73 abs(x): 0.10 -abs(x): 0.10

24 22 20 18 16 14 12 10 8 6 0.0

ELU(x): 0.91 -ELU(x): 0.91 tanh(x): 0.85 -tanh(x): 0.86 abs(x): 0.35 -abs(x): 0.23

55 50 45 40 35 30 25 20 log( )

Mobile Vi Tv2-0.5

ELU(x): 0.90 -ELU(x): 0.89 tanh(x): 0.84 -tanh(x): 0.84 abs(x): 0.75 -abs(x): 0.74

Figure 5: FIM eigenvalue distributions for different architectures and activation functions. The legends show the activation function and the corresponding validation accuracy in different tasks. Although negating an activation function changes its shape, it does not substantially change its behavior nor its performance. FIM eigenvalues capture this relationship between activation functions. The eigenvalues are thus useful for finding activation functions that appear different but in fact behave similarly, and these discoveries in turn improve the efficiency of activation function search.

Constructing a Surrogate These observations suggest an opportunity for an effective surrogate measure: The UMAP coordinates in the bottom row of Figure 4 have the information needed to predict how well an activation function will perform. They capture the essence of the m and n dimensional feature vectors, and distill it into a 2D representation that can be computed efficiently and used to guide the search for good functions. As the third step in this research, the next two sections evaluate this process experimentally, demonstrating that it is efficient and reliable, and that it scales to new and challenging datasets and search spaces.

5 Searching on the Benchmarks

Searching for activation functions typically requires training a neural network from scratch in order to evaluate each candidate function fully, which is often computationally expensive. With the benchmark datasets, the results are already precomputed. This information makes it possible to experiment with different search algorithms and conduct repeated trials to understand the statistical significance of the results. These results serve to inform both algorithm design and feature selection, as shown in this section.

Setup Three algorithms were evaluated: weighted k-nearest regression with k = 3 (KNR), random forest regression (RFR), and support vector regression (SVR). Gaussian Process Regression (GPR) was also evaluated but found to be inconsistent in preliminary experiments (Appendix E). Random search (RS) was included as a baseline comparison; it did not utilize the FIM eigenvalue filtering mechanism. The algorithms were used out of the box with default hyperparameters from the scikit-learn package [47]. They were provided different activation function features in order to understand their potential to predict performance. The features included FIM eigenvalues, activation function outputs, or both. The features were preprocessed and embedded in a two-dimensional space by UMAP. These representations are visualized in Figure 4; the coordinates of each point correspond exactly to the information given to the regression algorithms.

The Re LU activation function is ubiquitous in machine learning. For many neural network architectures, the performance with Re LU is already known [2, 45, 46], which makes it a good starting point for search. For this reason, the search algorithms began by evaluating Re LU and seven other

randomly chosen activation functions. In general, such evaluation requires training from scratch, but with the benchmark datasets, it requires only looking up the precomputed results. The algorithms then used the validation accuracy of these eight functions to predict the performance of all unevaluated functions in the dataset. The activation function with the highest predicted accuracy was then evaluated. The performance of this new function was then added to the list of known results, and this process continued until 100 activation functions had been evaluated. Each experiment comprising a different search algorithm, activation function feature set, and benchmark dataset was repeated 100 times. Full details are in Appendix E.

0.896 0.898 0.900 0.902 0.904 0.906 0.908 0.910 Best Accuracy So Far

Act-Bench-CNN

0 50 100 Functions Evaluated

0.916 0.918 0.920 0.922 0.924 0.926 0.928

Act-Bench-Res Net

KNR (FIM Eigenvalues) RFR (FIM Eigenvalues) SVR (FIM Eigenvalues) KNR (Function Outputs) RFR (Function Outputs) SVR (Function Outputs)

KNR (Eigenvalues & Outputs) RFR (Eigenvalues & Outputs) SVR (Eigenvalues & Outputs) Random Search Benchmark Optimum Accuracy with Re LU

0.9000 0.9025 0.9050 0.9075 0.9100 0.9125 0.9150

Act-Bench-Vi T

Figure 6: Search results on the three benchmark datasets. Each curve represents a different search algorithm (KNR, RFR, or SVR) utilizing a different UMAP feature (FIM eigenvalues, function outputs, or both; these features are visualized in Figure 4). The curves represent the validation accuracy of the best activation function discovered so far, averaged across 100 independent trials, and the shaded areas show the 95% confidence interval around the mean. In all cases, regression with UMAP features outperforms random search, and searching with both eigenvalues and outputs outperforms searching with either feature alone. Of the three regression algorithms, KNR performs the best, rapidly surpassing Re LU and quickly discovering near-optimal activation functions in all benchmark tasks. Thus, the features make it possible to find good activation functions efficiently and reliably even with off-the-shelf search methods; the benchmark datasets make it possible to demonstrate these conclusions with statistical reliability.

Results Figure 6 shows the results of the searches. Importantly, the curves do not depict just one search trial. Instead, they represent the average performance aggregated from 100 independent runs, which is made possible by the benchmark datasets. As indicated by the shaded confidence intervals, the results are reliable and are not simply due to chance.

A number of conclusions can be drawn from Figure 6. First, all search algorithms, even random search, reliably discover activation functions that outperform Re LU. This finding is supported by previous work (reviewed in Section A): Although Re LU is a good activation function that performs well in many different tasks, better performance can be achieved with novel activation functions. Therefore, continuing to use Re LU in the future is unlikely to lead to best results; The choice of the activation function should be an important part of the design, similar to the choice of the network architecture or the selection of its hyperparameters.

Second, all regression algorithms outperform random search. This finding holds across the three types of activation function features and across the three benchmark datasets. The FIM eigenvalues and activation function outputs are thus important in predicting performance of activation functions.

Third, regression algorithms trained on both FIM eigenvalues and activation function outputs outperform algorithms trained on just eigenvalues or outputs alone. This result is consistent across the regression algorithms and benchmark datasets. It suggests that the FIM eigenvalues and activation function outputs contribute complimentary pieces of information. The finding quantitatively reinforces the qualitative visualization in Figure 4: FIM eigenvalues are useful for matching activation functions that induce similar training dynamics in neural networks, activation function outputs enable a low-dimensional representation where search is more practical, and combining the two features results in a problem that is more convex and easier to optimize.

Fourth, the searches are efficient. Previous approaches require hundreds or thousands of evaluations to discover good activation functions [5, 6, 49]. In contrast, this paper leverages FIM eigenvalues and activation function outputs to reduce the problem to simple two-dimensional regression; the features are powerful enough that out-of-the-box regression algorithms can discover good functions with only tens of evaluations. This efficiency makes it possible to search for better functions directly on large datasets such as Image Net [11], demonstrated next.

6 Searching with New Settings

0.62 0.64 0.66 Validation Accuracy

All-CNN-C CIFAR-100

0 50 100 Functions Evaluated

0.68 0.70 0.72

Res Net-56 CIFAR-100

elu(x) relu(x) selu(x)

sigmoid(x) softplus(x) softsign(x)

swish(x) tanh(x)

New function Best so far

0.60 0.62 0.64

Mobile Vi Tv2-0.5

Figure 7: Progress of activation function searches. Each point represents the validation accuracy with a unique activation function, and the solid line indicates the performance of the best activation function found so far. AQua Sur F discovers new activation functions that outperform all baseline functions in every case.

The experiments in Section 5 used precomputed datasets and search spaces to demonstrate that UMAP embeddings are predictive of activation function performance, and that KNR can find good functions based on them. To verify that these conclusions extend beyond the benchmark tasks, this section contains three experiments demonstrating that AQua Sur F scales up to more challenging datasets and search spaces, that the activation functions can be transferred to other tasks, and that AQua Sur F extends to new architectures and baseline functions.

Scaling Up the Datasets and Search Space In the first experiment, the tasks involve larger and more challenging datasets: All-CNN-C on CIFAR-100, Res Net-56 on CIFAR-100, and Mobile Vi Tv2-0.5 on Image Net. Additionally, a larger space with 425,896 unique activation functions was searched, based on four-node computation graphs (Appendix B). This space is large, diverse, and not precomputed, putting the conclusions from the benchmark experiments to test in a production setting.

Based on the benchmark results, KNR with k = 3 was used as the search algorithm. The searches all began by evaluating the same eight existing activation functions: ELU, Re LU, SELU, sigmoid, Softplus, Softsign, Swish, and tanh. From this starting point, eight workers operated in parallel evaluating the functions with the highest predicted performance. Details are in Appendix E.

Figure 7 shows that all three searches find improved activation functions over time, and Figure 10 in Appendix B shows how the searches navigate the search space. In every experiment, new activation functions were discovered that outperform all baseline functions. Although the search space is large, the searches are efficient, requiring only tens of evaluations to improve performance. Impressively, the search with Res Net-56 on CIFAR-100 produced an activation function that outperformed all baselines on just the second evaluation.

Table 1 shows the final results from AQua Sur F. The results reinforce the fact that substantial gains can be obtained when using better activation functions than the default Re LU, and especially those optimized specifically for the task.

Transferring to a New Task In the second experiment, the best activation functions from Table 1 were transferred to a new task: Res Net-50 on Image Net. As demonstrated in Table 2, good functions can be discovered efficiently in smaller tasks and then used to improve performance in larger ones.

New Architectures and Baseline Functions In the third experiment, the Co At Net architecture was trained on Imagenette [10]. As a hybrid convolution and attention architecture, Co At Net presents a new challenge for AQua Sur F. The activation functions ELi SH, GELU, Hard Sigmoid, Leaky Re LU, and Mish [3, 23, 37, 43] were added to the original set of baseline functions (Table 1), as well as to the set of unary operators, forming a new search space for AQua Sur F to explore (Appendix B).

The results show that AQua Sur F extends to architectures and baseline functions not considered in the benchmark tasks (Table 3). AQua Sur F discovered multiple activation functions that substantially outperform all baseline functions. Although the extended list of baseline functions presents a more challenging task, it also provides the surrogate function more information that it uses for performance prediction, resulting in the discovery of even better functions.

7 Understanding the Discoveries

Aside from the raw performance improvements afforded by AQua Sur F, the experiments on the new settings are particularly interesting because they illustrate both the process of refining existing activation functions and the process of discovering novel designs.

Table 1: Accuracy with different activation functions. The CIFAR-100 results show the median test accuracy from three runs, and the Image Net results show the validation accuracy from a single run. AQua Sur F discovers novel activation functions that outperform all baselines in every case. This result demonstrates both that good functions matter, and the power of optimizing them to the task.

All-CNN-C on CIFAR-100

Hard Sigmoid(Hard Sigmoid(x)) ELU(x) 0.6990 σ(Softsign(x)) ELU(x) 0.6950 Swish(x)/SELU(1) 0.6931

ELU 0.6312 Re LU 0.6897 SELU 0.0100 sigmoid 0.0100 Softplus 0.6563 Softsign 0.2570 Swish 0.6913 tanh 0.3757

Res Net-56 on CIFAR-100

Swish( 2x) 0.7469 SELU(sinh(earctan(x) 1)) 0.7458 x erfc(ELU(x)) 0.7419

ELU 0.7411 Re LU 0.7348 SELU 0.6967 sigmoid 0.5766 Softplus 0.7397 Softsign 0.6624 Swish 0.7401 tanh 0.6754

Mobile Vi Tv2-0.5 on Image Net

x σ(x) Hard Sigmoid(x) 0.6396 ELU(Swish( x)) 0.6394 Swish(x) erfc(bessel_i0e(x)) 0.6336

ELU 0.6233 Re LU 0.6139 SELU 0.6096 sigmoid 0.5032 Softplus 0.5853 Softsign 0.5710 Swish 0.6383 tanh 0.6098

Table 2: Res Net-50 top-1 accuracy on Image Net. Results are the median of three runs. The best activation functions discovered in the searches (Table 1) successfully transfer to this new task, with eight of the nine functions outperforming Re LU.

x σ(x) Hard Sigmoid(x) 0.7776 Swish(x)/SELU(1) 0.7771 Swish(x) erfc(bessel_i0e(x)) 0.7755 σ(Softsign(x)) ELU(x) 0.7734 SELU(sinh(earctan(x) 1)) 0.7719 Hard Sigmoid(Hard Sigmoid(x)) ELU(x) 0.7718 ELU(Swish( x)) 0.7679 Swish( 2x) 0.7664 x erfc(ELU(x)) 0.7635

Re LU(x) 0.7660

Refinement and Novelty Figure 8 shows different activation functions discovered during the searches. (Plots of all 100 functions evaluated in each search are included in Figures 13 16 in Appendix E.) Visually, many the best functions (shown in 8a) are similar to existing functions like ELU and Swish, with subtle changes in their saturation value, the slope of the positive segment, and the width and depth of the negative bump. This result is not surprising since these functions formed the starting point for the search. Indeed, after a few good functions were found, much of the search process focused on refining their design (Figure 10 in Appendix B). Although these refinements appear small, they were not known ahead of time and they are significant, as evidenced by the final results (Tables 1 3).

Table 3: Co At Net validation accuracy on Imagenette. AQua Sur F finds novel functions that outperform all baselines.

erfc(Softplus(x))2 0.8907 min{Softplus(x)2, x} 0.8861 arcsinh(ELU(Swish(x))) 0.8828

ELi SH 0.1000 ELU 0.8629 GELU 0.8841 Hard Sigmoid 0.8487 Leaky Re LU 0.8815 Mish 0.8762 Re LU 0.8772 SELU 0.8194 sigmoid 0.8586 Softplus 0.8678 Softsign 0.8530 Swish 0.8736 tanh 0.8415

However, some of the best discovered activation functions, including the top function for the Co At Net experiment, employ properties uncommon among the usual deep learning activation functions (Figure 8b): Some of them have discontinuous derivatives at x = 0; some do not saturate, but diverge as x ; some of them contain positive bumps (in contrast to e.g. Swish, which features a negative bump). Many of these functions performed comparably to the existing best functions, and all of them outperformed Re LU. In the future, these designs may provide a comprehensive foundation for discovering better activation functions for specific new tasks.

Together, the plots show that AQua Sur F is capable of both exploitation (Figure 8a) and exploration (Figure 8b). In the future, it will be interesting to explore tradeoffs between these concepts. A more comprehensive discussion of this and other future research directions is included in Appendix F.

Discovering a Hybrid Rectifier-Sigmoidal Activation Function In the past, sigmoidal nonlinearities like sigmoid and tanh were often used because they saturate and thus prevent exploding signals. However, currently these functions are usually discarded in favor of rectifier nonlinearities like Re LU and its variants as these functions give better performance on modern deep learning benchmarks [2]. Indeed, in Tables 1 and 3, sigmoid, tanh, Hard Sigmoid, and Softsign all perform relatively poorly. It is therefore surprising to see that the very best function discovered in the Co At Net experiment, erfc(Softplus(x))2 (bottom left of Figure 8a), is sigmoidal in shape.

HS(HS(x)) ELU(x)

(Softsign(x)) ELU(x)

Swish(x)/SELU(1)

SELU(sinh(earctan(x) 1))

x erfc(ELU(x))

Mobile Vi Tv2-0.5

x (x) HS(x)

ELU(Swish( x))

Swish(x) erfc(bessel_i0e(x))

erfc(Softplus(x))2

min{Softplus(x)2, x}

arcsinh(ELU(Swish(x)))

max{Swish(x), erf(SELU(x))}

max{ x, log( (1/x))}

min{x2, SELU(Swish(x))}

Softsign(x) + SELU(x) + |x|

SELU(x) HS(x)

min{x, bessel_i0e(Softplus(x))}

Mobile Vi Tv2-0.5

erf(x) ELU(Swish(x))

|x|( (x) 1)

6 Swish(x) Softsign(Softsign(x))

arctan(Swish(x)) tanh(x)

erf(ELU(x)) (x)

Swish(arcsinh(x))

(a) (b) Figure 8: Sample activation functions discovered with AQua Sur F in the four searches in Section 6. HS stands for Hard Sigmoid. (a) The top three functions (columns) discovered in each search (rows). Many of these functions are refined versions of existing activation functions like ELU and Swish. (b) Selected novel activation functions. All of these functions outperformed Re LU and are distinct from existing activation functions. Such designs may serve as a foundation for further improvement and specialization in new settings.

Why does this function perform so well? As shown in Figure 9, the function saturates to 1 as x and to 0 as x , and has an approximately linear region in between. The regions of the function that the neural network actually utilizes in its feedforward pass are superimposed as histograms on this plot. Interestingly, at initialization, the network does not use the saturation regimes. The inputs to the function are tightly concentrated around x = 0 for all instances of the activation function throughout the network. As training progresses, the network makes use of a larger domain of the activation function, and by the time training has concluded the network uses the saturation regimes at approximately x < 4 and x > 1.

8 6 4 2 0 2 4 6 8

At Initialization After Training

erfc(Softplus(x))2

Figure 9: The best discovered function in the Co At Net experiment, erfc(Softplus(x))2, and its utilization by the network. The red curve shows the activation function itself, and the two histograms show the distributions of inputs to the activation function at initialization and after training, aggregated across all instances of the activation function in the entire network. The network uses the function like a rectifier at initialization and like a sigmoidal activation function after training. This result suggests that sigmoidal designs may be powerful after all, thus challenging the conventional wisdom.

Thus, Figure 9 shows that erfc(Softplus(x))2 serves a dual purpose. At initialization, it performs like a rectifier nonlinearity, but by the end of training, it acts like a sigmoidal nonlinearity. This discovery challenges conventional wisdom about activation function design. It shows that neural networks use activation functions in different ways in the different stages of training, and suggests that sigmoidal designs may play an important role after all.

8 Conclusion

This paper introduced three benchmark datasets, Act-Bench-CNN, Act-Bench-Res Net, and Act-Bench-Vi T, to support research on activation function optimization. Experiments with these datasets showed that FIM eigenvalues and activation function outputs, and their lowdimensional UMAP embeddings, predict activation function performance accurately, and can thus be used as a surrogate for finding better functions, even with out-of-the-box regression algorithms. These conclusions extended from the benchmark datasets to challenging realworld tasks, where better functions were discovered with a variety of datasets, search spaces, and architectures. AQua Sur F also discovered a highly performant sigmoidal activation function, challenging the conventional wisdom of using Re LU-like functions exclusively in deep learning. The study reinforces the idea that activation function design is an important part of deep learning, and shows AQua Sur F is an efficient and flexible mechanism for doing it.

[1] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi. Learning activation functions to improve deep neural networks. ar Xiv:1412.6830, 2014.

[2] A. Apicella, F. Donnarumma, F. Isgrò, and R. Prevete. A survey on modern trainable activation functions. Neural Networks, 2021.

[3] M. Basirat and P. M. Roth. The quest for the golden activation function. ar Xiv:1808.00783, 2018.

[4] G. Bingham and R. Miikkulainen. Autoinit: Analytic signal-preserving weight initialization for neural networks. ar Xiv preprint ar Xiv:2109.08958, 2021.

[5] G. Bingham and R. Miikkulainen. Discovering parametric activation functions. Neural Networks, 148: 48 65, 2022.

[6] G. Bingham, W. Macke, and R. Miikkulainen. Evolutionary optimization of deep learning activation functions. In Genetic and Evolutionary Computation Conference (GECCO 20), July 8 12, 2020, Cancún, Mexico, 2020.

[7] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). Co RR, abs/1511.07289, 2015.

[8] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Auto Augment: Learning augmentation policies from data. ar Xiv:1805.09501, 2018.

[9] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702 703, 2020.

[10] Z. Dai, H. Liu, Q. V. Le, and M. Tan. Co At Net: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34:3965 3977, 2021.

[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li. Image Net: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[12] S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3 11, 2018.

[13] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1 21, 2019.

[14] A. F. Emery and A. V. Nenarokomov. Optimal experiment design. Measurement Science and Technology, 9(6):864, 1998.

[15] Y. Furusho and K. Ikeda. Effects of skip-connection in resnet and batch-normalization on Fisher information matrix. In INNS Big Data and Deep Learning Conference, pages 341 348. Springer, 2019.

[16] Y. Furusho and K. Ikeda. Theoretical analysis of skip connections and batch normalization from generalization and optimization perspectives. APSIPA Transactions on Signal and Information Processing, 9, 2020.

[17] S. Gonzalez and R. Miikkulainen. Evolving loss functions with multivariate Taylor polynomial parameterizations. ar Xiv:2002.00059, 2020.

[18] S. Gonzalez and R. Miikkulainen. Improved training speed, accuracy, and data utilization through loss function optimization. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1 8, 2020. doi: 10.1109/CEC48606.2020.9185777.

[19] M. Goyal, R. Goyal, and B. Lall. Learning activation functions: A new paradigm of understanding neural networks. ar Xiv:1906.09529, 2019.

[20] R. Grosse and J. Martens. A Kronecker-factored approximate Fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573 582. PMLR, 2016.

[21] T. Hayase and R. Karakida. The spectrum of Fisher information of deep networks achieving dynamical isometry. In International Conference on Artificial Intelligence and Statistics, pages 334 342. PMLR, 2021.

[22] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630 645. Springer, 2016.

[23] D. Hendrycks and K. Gimpel. Gaussian error linear units (GELUs). ar Xiv:1606.08415, 2016.

[24] J. Howard. Imagenette. https://github.com/fastai/imagenette, 2019. Accessed: 2022-07-27.

[25] L. Huang, J. Qin, L. Liu, F. Zhu, and L. Shao. Layer-wise conditioning analysis in exploring the learning dynamics of DNNs. In European Conference on Computer Vision, pages 384 401. Springer, 2020.

[26] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448 456, 2015.

[27] S. Jastrzebski, D. Arpit, O. Astrand, G. B. Kerg, H. Wang, C. Xiong, R. Socher, K. Cho, and K. J. Geras. Catastrophic Fisher explosion: Early phase Fisher matrix impacts generalization. In International Conference on Machine Learning, pages 4772 4784. PMLR, 2021.

[28] R. Karakida, S. Akaho, and S.-i. Amari. Universal statistics of Fisher information in deep neural networks: Mean field approach. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1032 1041. PMLR, 2019.

[29] R. Karakida, S. Akaho, and S.-i. Amari. Pathological spectra of the Fisher information metric and its variants in deep neural networks. Neural Computation, 33(8):2274 2307, 2021.

[30] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizing neural networks. In Advances in neural information processing systems, pages 971 980, 2017.

[31] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[32] J. Lehman and K. O. Stanley. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 19(2):189 223, 2011.

[33] T. Liang, T. Poggio, A. Rakhlin, and J. Stokes. Fisher-Rao metric, geometry, and complexity of neural networks. In The 22nd international conference on artificial intelligence and statistics, pages 888 896. PMLR, 2019.

[34] Z. Liao, T. Drummond, I. Reid, and G. Carneiro. Approximate Fisher information matrix to characterize the training of deep neural networks. IEEE transactions on pattern analysis and machine intelligence, 42 (1):15 26, 2018.

[35] H. Liu, A. Brock, K. Simonyan, and Q. V. Le. Evolving normalization-activation layers. ar Xiv:2004.02967, 2020.

[36] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

[37] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th international conference on machine learning (ICML-13), page 3, 2013.

[38] J. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In International conference on machine learning, pages 2408 2417. PMLR, 2015.

[39] J. Martens, J. Ba, and M. Johnson. Kronecker-factored curvature approximations for recurrent neural networks. In International Conference on Learning Representations, 2018.

[40] L. Mc Innes, J. Healy, and J. Melville. UMAP: Uniform manifold approximation and projection for dimension reduction. ar Xiv preprint ar Xiv:1802.03426, 2018.

[41] S. Mehta and M. Rastegari. Separable self-attention for mobile vision transformers. ar Xiv preprint ar Xiv:2206.02680, 2022.

[42] J. Mellor, J. Turner, A. Storkey, and E. J. Crowley. Neural architecture search without training. ar Xiv:2006.04647, 2020.

[43] D. Misra. Mish: A self regularized non-monotonic neural activation function. ar Xiv:1908.08681, 2019.

[44] A. Molina, P. Schramowski, and K. Kersting. Padé activation units: End-to-end learning of flexible activation functions in deep networks. ar Xiv:1907.06732, 2019.

[45] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807 814, 2010.

[46] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall. Activation functions: Comparison of trends in practice and research for deep learning. ar Xiv:1811.03378, 2018.

[47] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12:2825 2830, 2011.

[48] J. Pennington and P. Worah. The spectrum of the Fisher information matrix of a single-hidden-layer neural network. In Neur IPS, pages 5415 5424, 2018.

[49] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, 2018.

[50] Y. Shen, Y. Li, J. Zheng, W. Zhang, P. Yao, J. Li, S. Yang, J. Liu, and B. Cui. Proxybo: Accelerating neural architecture search via Bayesian optimization with zero-cost proxies. ar Xiv preprint ar Xiv:2110.10423, 2021.

[51] J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. In ICLR (workshop track), 2015.

[52] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929 1958, 2014.

[53] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1 9, 2015.

[54] M. Tavakoli, F. Agostinelli, and P. Baldi. SPLASH: Learnable activation functions for improving accuracy and adversarial robustness. ar Xiv:2006.08947, 2020.

[55] L. Van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9 (11), 2008.

[56] C. White, A. Zela, R. Ru, Y. Liu, and F. Hutter. How powerful are performance predictors in neural architecture search? Advances in Neural Information Processing Systems, 34:28454 28469, 2021.

[57] M. Wistuba, A. Rawat, and T. Pedapati. A survey on neural architecture search. ar Xiv:1905.01392, 2019.

[58] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. Cut Mix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023 6032, 2019.

[59] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017.

[60] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. ar Xiv:1611.01578, 2016.

A Related Work

The techniques in this paper were inspired by prior research in multiple areas, including neural architecture and activation function search, as well as research on the FIM.

Neural Architecture Search In neural architecture search [NAS; 13, 57, 60], the goal is to design a neural network architecture automatically. NAS approaches typically focus on optimizing the type and location of the layers and the connections between them, but often use standard activation functions like Re LU. This work in complimentary to NAS approaches, because it uses standard architectures but optimizes the design of the activation function.

Zero-cost NAS Proxies Recently, zero-cost NAS proxies have received increased attention [42, 50, 56]. These approaches aim to accelerate neural architecture search by using cheap surrogate calculations in place of expensive full training of architectures. This paper adopts a similar approach, using FIM eigenvalues and activation function outputs to predict which activation functions are likely to be most promising before dedicating resources to evaluating them.

Activation Function Search Methods for automatically discovering activation functions include reinforcement learning [49], evolutionary computation [3, 5, 6, 35], and gradient-based methods [1, 5, 19, 44, 54]. This paper builds upon existing work, focusing on efficient search and on understanding the properties that make activation functions effective.

Other Uses of the FIM This paper used FIM eigenvalues to predict the performance of different activation functions. The FIM is an important quantity in machine learning with several uses. One important example is optimal experiment design [14], where experiments are designed to be optimal with respect some criterion. The criteria vary, but are often functions of the eigenvalues of the FIM, such as the maximum or minimum eigenvalue, or the trace of the FIM (sum of the eigenvalues) or determinant of the FIM (product of the eigenvalues). Instead of choosing one optimality criterion and only considering one summary statistic, this paper keeps all of the eigenvalues of the FIM and learns an optimal distribution experimentally.

Past work has also used the eigenvalues of the FIM to determine suitable values of the batch size or learning rate for neural networks [15, 16, 21, 29, 34]. The FIM provides insights to the learning dynamics of SGD [27] and the dynamics of signal propagation at different layers in networks with and without batch normalization layers [25]. The FIM has also been used to develop second-order optimization algorithms for neural networks [20, 38, 39]. Applying it to activation function design is thus a compelling further opportunity.

B Activation Function Search Spaces

The activation functions in this paper were implemented as computation graphs from the PANGAEA search space [5]. The space includes unary and binary operators, in addition to existing activation functions [7, 12, 30, 45, 49]. This approach allows specifying families of functions in a compact manner. It is thus possible to focus the search on a space where good functions are likely to be located, and also to search it comprehensively.

Benchmark Datasets The benchmark datasets introduced in Section 2 contain every activation function of the three-node form binary(unary(x),unary(x)) using the operators in Table 4. The result is 5,103 activation functions, of which 2,913 are unique. This space is visualized in Figure 4.

For Act-Bench-CNN and Act-Bench-Res Net, the accuracies are the median from three runs. For Act-Bench-Vi T, the results are from single runs due to computational costs.

New Settings The first experiment in Section 6 utilized a larger search space. Specifically, it was based on the following four-node computation graphs: binary(unary(unary(x)),unary(x)), binary(unary(x),unary(unary(x))), n-ary(unary(x),unary(x),unary(x)), unary(binary(unary(x),unary(x))), and unary(unary(unary(unary(x)))). The unary and binary nodes used the operators in Table 4, and the n-ary node used the sum, product,

Table 4: Activation function search spaces were defined through computation graphs consisting of basic unary and binary operators as well as existing activation functions [5].

Unary Binary

0 erf(x) Re LU(x) x1 + x2 1 erfc(x) ELU(x) x1 x2 x sinh(x) SELU(x) x1 x2 x cosh(x) Swish(x) x1/x2 |x| tanh(x) Softplus(x) xx2 1 x 1 arcsinh(x) Softsign(x) max{x1, x2} x2 arctan(x) Hard Sigmoid(x) min{x1, x2} ex ex 1 bessel_i0e(x) σ(x) log(σ(x)) bessel_i1e(x)

maximum, and minimum operators. Together, these computation graphs create a search space with 1,023,516 functions, of which 425,896 are unique. This space is visualized in Figures 10 and 11.

The third experiment in Section 6, i.e. Co At Net on Imagenette, added ELi SH, GELU, Hard Sigmoid, Leaky Re LU, and Mish as unary operators to the original benchmark search space. In this experiment, the search space comprised functions of the form binary(unary(unary(x)),unary(x)) and unary(unary(unary(x))). This search space contains 238,341 activation functions, of which 146,779 are unique.

C Fisher Information Matrix Details

In order to calculate the FIM, this paper uses the K-FAC approach [20, 38, 39]. This technique is summarized in this Appendix, with notation similar to that of Grosse and Martens [20].

Preliminaries A feedforward neural network maps an input a0 = x to an output a L = f(x; θ) through a series of L layers. Each layer l {1, . . . , L} is comprised of a weight matrix Wl, a bias vector bl, and an element-wise activation function ϕl. With Wl = (bl Wl) and al = 1 a l , each layer implements the transformation

sl = Wl al 1, (4) al = ϕl(sl). (5)

Let θ = vec( W1) vec( WL) represent the vector of all network parameters. Parameterized by θ and given inputs x drawn from a training distribution Qx, the neural network defines the conditional distribution Ry|f(x;θ). The Fisher information matrix associated with this model is

F = E x Qx y Ry|f(x;θ)

θL(y, f(x; θ)) θL(y, f(x; θ)) . (6)

As usual in deep learning, the loss function L(y, z) represents the negative log-likelihood associated with Ry|f(x;θ) and quantifies the discrepancy between the model s prediction z = f(x; θ) and the true label y. The network is trained to minimize the loss by updating its parameters according to the gradient θL(y, f(x; θ)).

Approximations For ease of notation, write Dv = v L(y, f(x; θ)). Recalling that θ = vec( W1) vec( WL) , the FIM can be expressed as an L L block matrix:

E vec(D W1)vec(D W1) E vec(D W1)vec(D WL)

... ... ... E vec(D WL)vec(D W1) E vec(D WL)vec(D WL)

Figure 10: Low-dimensional UMAP representation of the 425,896 function search space. The activation functions are embedded according to their outputs; each point represents a unique function. The larger points represent activation functions that were evaluated during the searches; they are colored according to their validation accuracy. Although the space is vast, the searches require only tens of evaluations to discover good activation functions.

Figure 11: A photograph of a three-dimensional scatter plot laser-engraved into a physical crystal cube. Each point represents one of the unique 425,896 unique activation functions in the search space. Points are arranged according to a 3D UMAP projection according to activation function outputs; the points are the same as those shown in Figure 10. The cube shows the size and complexity of the search space, and the 1D and 2D manifolds reveal the underlying structure.

Note that D Wl = Dsl a l 1, and recall that vec(uv ) = v u. Each block of the FIM can be written as

Fi,j = E vec(D Wi)vec(D Wj) (8)

= E vec(Dsi a i 1)vec(Dsj a j 1) (9)

= E ( ai 1 Dsi)( aj 1 Dsj) (10)

= E ( ai 1 Dsi)( a j 1 Ds j ) (11)

= E ai 1 a j 1 Dsi Ds j . (12)

Two approximations are necessary in order to make representation of the FIM practical. First, assume that different layers have uncorrelated weight derivatives. The FIM can then be approximated as a block diagonal matrix, with Fi,j = 0 if i = j. Second, if one approximates the pre-activation derivatives Dsl and activations a l 1 as independent, then the diagonal blocks of the FIM can be

further decomposed into the Kronecker product of two smaller matrices:

Fl,l = E al 1 a l 1 Dsl Ds l E al 1 a l 1 E Dsl Ds l . (13)

Let Ωl = E al a l and Γl = E Dsl Ds l . The approximate empirical FIM is then written as

Ω0 Γ1 0 ... 0 ΩL 1 ΓL

Layer-Specific Implementation The above example illustrates FIM approximation for a simple feedforward network. However, most modern architectures contain several different kinds of layers. Some layers like pooling, normalization, or dropout layers do not have trainable weights, and therefore these layers are not included in the FIM [26, 52].

Each diagonal entry Ωl 1 Γl corresponds to one layer with weights. The calculation differs slightly depending on the layer type, but otherwise the example above can be straightforwardly extended to more complicated networks. Calculations for three common layer types are presented below.

Dense Layers For dense layers, the matrices Ωl 1 and Γl can be readily computed with one forward and backward pass through the network using a mini-batch of data. The eigenvalues are then computed using standard techniques.

Convolutional Layers Convolutional layers require special consideration to calculate Ωl 1 and Γl. For a given layer, let M represent the batch size, T the set of spatial locations (typically a two-dimensional grid), the set of spatial offsets from the center of the filter, and I and J the number of output and input maps, respectively. The activations are represented by the M |T | J array Al 1. The weights are represented by the I | | J array Wl which is interpreted as an I | |J matrix. The expansion operator J K extracts patches around each spatial location and flattens them into vectors that become the rows of a matrix: JAl 1K is a M|T | J| | matrix.

Similar to the feedforward networks, the bias (if used) can be prepended to the weights matrix as Wl = (bl Wl) and a homogeneous column of ones to the expanded activations as JAl 1KH = (1 JAl 1K). This constructions allows the forward pass to be written as

Sl = JAl 1KH W l , (15) Al = ϕ (Sl) , (16)

from which the factors are computed as

Ωl = E JAl K HJAl KH , (17)

Γl = 1 |T |E DS l DSl . (18)

Depthwise Convolutional Layers Depthwise convolutional layers utilize separate kernels for each channel. In this case, JAl 1K is a M|T |J | | matrix. Otherwise, the factors Ωl 1 and Γl are calculated in the same way as they are for standard convolutional layers.

Eigenvalue Calculation Because ˆF is a block-diagonal matrix, its eigenvalues are simply the combined eigenvalues of each block: λ(ˆF) = {λ(ˆFl)}L l=1. The eigenvalue calculation for one block ˆFl = Ωl 1 Γl is further simplified by first computing the eigenvalues λ(Ωl 1) and λ(Γl) for each Kronecker factor separately and then returning all pairwise products from the two sets of eigenvalues. For numerical stability, the eigenvalues can first be log-scaled and then all pairwise sums from the two sets are returned. Calculating the eigenvalues requires one forward and backward pass through the network with a mini-batch of data. The computational cost is therefore relatively cheap, especially compared with the cost of fully training a network from scratch.

It is possible for the FIM eigenvalues to be invalid. For example, if the forward propagated activations or backward propagated gradients explode or vanish, then the diagonal entries Ωl 1 Γl may be undefined. Such invalid values result from activation functions that are unstable. Therefore, invalid FIM eigenvalues provide a good way to filter out bad activation functions.

D Features and Surrogate Details

This section describes how the activation function features were implemented and how the surrogate was constructed.

Calculating FIM Eigenvalues The FIM eigenvalues were calculated for each activation function as discussed in Section 3. The eigenvalues were log-scaled for numerical stability. By definition, the number of eigenvalues is the same as the number of weights in the neural network. To save space, the eigenvalues were binned to histograms. For a layer l with |θl| weights, |θl|/100 equally sized bins from 100 to 100 were used. One histogram was computed for each layer in a network, and all of the histograms were concatenated together into a single feature vector for a given activation function. In this manner, the total dimensionality was 13,692 for All-CNN-C, 16,500 for Res Net-56, and 11,013 for Mobile Vi Tv2-0.5.

Calculating Activation Function Outputs The activation function outputs y = f(x) were calculated for each activation function f by sampling n =1,000 values x N(0, 1) and truncating to the range [ 5, 5]. The same random inputs were used for all activation functions.

Per-Layer FIM Eigenvalues In Figure 5, the eigenvalues for the entire network are shown for completeness. However, the UMAP representations shown in Figure 4 were produced by keeping the eigenvalues at each layer separate and computing a weighted distance between them (according to Equation 2). As pointed out in the main text, FIM eigenvalues are informative but noisy features. In preliminary experiments, keeping the eigenvalues separate at each layer reduced some of this noise, resulting in a more informative Figure 4 and consequently improving the performance of the search algorithms.

FIM Eigenvalue Features Preliminary experiments aimed to predict activation function performance using common features in the literature, including maximum eigenvalue, minimum eigenvalue, sum of the eigenvalues, and product of the eigenvalues [14]. More recently proposed features, such as (second moment) / (first moment)2, were also considered [48]. Ultimately, learning the relevant features from the entire eigenvalue distribution was found to be the most flexible and powerful approach.

UMAP Settings UMAP exposes a number of parameters that can be used to customize its behavior [40]. The metric parameter determines how distances are computed between points, the n_neighbors parameter adjusts the tradeoff between the local and global structure of the data, and the min_dist parameter controls the minimum distance between points in the embedding space.

The plots in Figure 4 were produced by computing the distances between FIM eigenvalues and activation function outputs. For the FIM eigenvalues UMAP(metric= manhattan , n_neighbors=3, min_dist=0.1) was used, and for the activation function outputs UMAP(metric= euclidean , n_neighbors=15, min_dist=0.1) was used. The distance metrics were chosen to implement Equations 2 and 3.

In preliminary experiments, decreasing n_neighbors from the default of 15 down to 3 for the FIM eigenvalues qualitatively improved the embedding for the combined features. The combined features were visualized with a union model, i.e. umap_combined = umap_fim_eigs + umap_fn_outputs [40].

E Experiment Details

This section specifies the details for the experiments in the main text of the paper. Several variations to the approach presented in the main text were also evaluated in preliminary experiments. The approach turned out to be robust to most of them, but the results also justify the choices used for the main experiments.

Training Details For CIFAR-10 and CIFAR-100, balanced validation sets were created by sampling 5,000 images from the training set. Full training details and hyperparameters are listed in Tables 5 and 6.

Table 5: Training details and hyperparameter values used in the CIFAR-10 and CIFAR-100 experiments.

All-CNN-C on CIFAR-10 and CIFAR-100

Batch Size 128 Dropout 0.5 Epochs 25 for Act-Bench-CNN and search (Figure 7), 50 for full evaluation (Table 1) Image Size 32 32 Learning Rate Linear warmup to 0.1 for five epochs, then linear decay Mean/Std. Normalization Yes Momentum 0.9 Optimizer SGD Random Crops 32 32 crops of images padded with four pixels on all sides Random Flips Yes Weight Decay 1e 4 Weight Initialization Auto Init [4]

Res Net-56 on CIFAR-10 and CIFAR-100

Batch Size 128 Dropout 0.0 Epochs 25 for Act-Bench-Res Net and search (Figure 7), 50 for full evaluation (Table 1) Image Size 32 32 Learning Rate Linear warmup to 0.1 for five epochs, then linear decay Mean/Std. Normalization No Momentum 0.9 Optimizer SGD Random Crops 32 32 crops of images padded with five pixels on all sides Random Flips Yes Weight Decay 1e 4 Weight Initialization Auto Init [4]

Co At Net A smaller variant of the Co At Net architecture2 was used in order to fit the model and data on the available GPU memory. The architecture has three convolutional blocks with 64 channels, four convolutional blocks with 128 channels, six transformer blocks with 256 channels, and three transformer blocks with 512 channels. This architecture is slightly deeper but thinner than the original Co At Net-0 architecture, which has two convolutional blocks with 96 channels, three convolutional blocks with 192 channels, five transformer blocks with 384 channels, and two transformer blocks with 768 channels [10]. The models are otherwise identical.

Search Implementation In order to predict performance for an unevaluated activation function, the function outputs and FIM eigenvalues must first be computed. Thus, the searches in Section 6 were implemented in three steps. First, activation function outputs for all 425,896 activation functions in the search space were calculated. This computation is inexpensive and easily parallelizable. Second, eight workers operated in parallel to sample activation functions uniformly at random from the search space and calculate their FIM eigenvalues. Third, once the number of activation functions with FIM eigenvalues calculated reached 5,000, seven of the workers began the search by evaluating the functions with the highest predicted performance. The eighth worker continued calculating FIM eigenvalues for new functions so that their performance could be predicted during the search. This setup allowed taking best advantage of the available compute for the regression-type search methods.

The experiments on Image Net required substantially more compute than the experiments on CIFAR100. For this reason, all eight workers evaluated activation functions once the number of functions with FIM eigenvalues reached 7,000.

Computing FIM eigenvalues took approximately 26 seconds, 84 seconds, and 37 seconds per activation function for All-CNN-C, Res Net-56, and Mobile Vi Tv2-0.5, respectively. This cost is not trivial, but it is well worth it, as the experiments in the main paper show.

2https://github.com/leondgarse/keras_cv_attention_models/blob/v1.3.0/keras_cv_ attention_models/coatnet/coatnet.py#L199

Table 6: Training details and hyperparameter values used in the Imagenette and Image Net experiments.

Mobile Vi Tv2-0.5 on Imagenette and Image Net

Batch Size 256 Cut Mix Alpha [58] 1.0 Epochs 105 Evaluation Center Crop 95% Image Size 160 160 Learning Rate Linear warmup from 1e 4 to 4e 3 for five epochs, then cosine decay to 1e 6 Mixup Alpha [59] 0.1 Optimizer Adam W [36] Rand Augment [9] Magnitude six, applied twice Random Resized Crop [53] Minimum 8% of the original image Weight Decay 0.02 current learning rate

Res Net-50 on Image Net

Batch Size 256 Cut Mix Alpha [58] 1.0 Epochs 105 Evaluation Center Crop 95% Image Size 160 160 Learning Rate Linear warmup from 1e 4 to 2e 3 for five epochs, then cosine decay to 1e 6 Mixup Alpha [59] 0.1 Optimizer Adam W [36] Rand Augment [9] Magnitude six, applied twice Random Resized Crop [53] Minimum 8% of the original image Weight Decay 0.02 current learning rate Weight Initialization Auto Init [4]

Co At Net on Imagenette

Batch Size 256 Cut Mix Alpha [58] 1.0 Epochs 105 Evaluation Center Crop 95% Image Size 160 160 Learning Rate Linear warmup from 1e 4 to 4e 4 for five epochs, then cosine decay to 1.6e 7 Mixup Alpha [59] 0.1 Optimizer Adam W [36] Rand Augment [9] Magnitude six, applied twice Random Resized Crop [53] Minimum 8% of the original image Weight Decay 0.02 current learning rate

Act-Bench-CNN Act-Bench-Res Net Act-Bench-Vi T

Validation Accuracy

Figure 12: UMAP projections of FIM eigenvalues using the default hyperparameter of n_neighbors=15. The embedding is informative but also noisy. Using n_neighbors=3, as shown in the main text, improved performance.

Unique Activation Functions Different computation graphs can result in the same activation function (e.g. max{x, 0} and max{0, x}). In the benchmark dataset and in the larger search space of Section 6, repeated activation functions were filtered out. 1,000 inputs were sampled N(0, 1) and truncated to [ 5, 5]. Two activation functions were considered the same if their outputs were identical.

Improving the Combined UMAP Projection Figure 12 displays a projection of FIM eigenvalues using default UMAP hyperparameters. The plots show the eigenvalues organized in multiple distinct one-dimensional manifolds. Again, FIM eigenvalues are noisy features; there are some clusters of activation functions achieving similar performance, but there are also regions where performance varies widely. As mentioned in the main text, this issue was addressed by reducing the UMAP parameter n_neighbors to 3. This change reduced the connectivity of the low-dimensional FIM eigenvalue representation, resulting in a space with many distinct clusters (as seen in Figure 4).

On its own, this setting did not improve the search on the benchmark datasets. However, it did improve performance when the FIM eigenvalues were combined with activation function outputs (as was discussed in Section 4). The reason is that the UMAP model for the activation function outputs did not decrease n_neighbors, and so the combined UMAP model relied more on the activation function outputs than it did on the FIM eigenvalues. As Figure 4 shows, the activation function outputs are reliable but sometimes project good activation functions to distinct regions in the search space. Introducing extra connectivity into the fuzzy topological representation via the FIM eigenvalues was sufficient to address this issue, bringing good activation functions to common regions of the space.

Increasing the Dimension of the UMAP Projections The UMAP plots show two-dimensional projections of FIM eigenvalues and activation function outputs. Regression algorithms were also trained on five and 10-dimensional projections. These runs resulted in comparable or worse performance. Therefore, the two-dimensional projections were selected in the paper for simplicity and for consistency between the algorithm implementation and figure visualizations.

Gaussian Process Regression As an alternative search method, Gaussian process regression (GPR) was evaluated in activation function search. Several different acquisition mechanisms were used, including expected improvement, probability of improvement, maximum predicted value, and upper confidence bound. The approach worked well, but the results were inconsistent across the different acquisition mechanisms. GPR was also more expensive to run compared to the algorithms in the main text (KNR, RFR, SVR), and so those algorithms were used instead for simplicity and efficiency.

Adjusting k in KNR The initial experiments with the KNR algorithm used k = 3. Experimenting with k = {1, 5, 8} did not reliably improve performance, so k = 3 was kept.

Uniformly Spaced Inputs for Activation Function Outputs In an alternative implementation, equally spaced inputs from 5 to 5 were given to the activation functions instead of normally distributed inputs. This variation did not noticeably change the quality of the embeddings nor the performance of the search algorithms. Therefore, normal inputs were used for consistency with Equation 3. Figure 3 is the only exception; it used 80 inputs equally spaced from 5 to 5 and increased the UMAP parameter min_dist to 0.5. These settings improved the quality of the reconstructed activation functions in the plot.

Evaluated Functions Figures 13, 14, 15, and 16 show plots and the validation accuracy of every candidate activation function evaluated in the searches for All-CNN-C on CIFAR-100, Res Net-56 on CIFAR-100, Mobile Vi Tv2-0.5 on Image Net, and Co At Net on Imagenette, respectively.

F Future Work

This paper demonstrated that FIM eigenvalues and activation function outputs are efficient and reliable features that can predict performance of activation functions accurately. This finding enabled discovering better activation functions for various tasks, improving the state of the art in machine learning. Because the technique is efficient, it was possible to scale it up to large datasets such as Image Net. These discoveries inspire several avenues for future research, discussed below.

5.0 2.5 0.0 2.5 5.0

mul(hard_sigmoid(hard_sigmoid(x)),elu(x))

Val Acc: 0.6628

5.0 2.5 0.0 2.5 5.0

div(swish(x),selu(one(x)))

Val Acc: 0.6510

5.0 2.5 0.0 2.5 5.0

mul(sigmoid(softsign(x)),elu(x))

Val Acc: 0.6510

5.0 2.5 0.0 2.5 5.0

mul(identity(x),erf(softplus(x)))

Val Acc: 0.6500

5.0 2.5 0.0 2.5 5.0

swish(sinh(arcsinh(elu(x))))

Val Acc: 0.6464

5.0 2.5 0.0 2.5 5.0

mul(tanh(softplus(x)),selu(x))

Val Acc: 0.6456

5.0 2.5 0.0 2.5 5.0

elu(swish(sinh(arcsinh(x))))

Val Acc: 0.6438

5.0 2.5 0.0 2.5 5.0

max(swish(x),erf(selu(x)))

Val Acc: 0.6414

5.0 2.5 0.0 2.5 5.0

max(negative(x),log_sigmoid(reciprocal(x)))

Val Acc: 0.6408

5.0 2.5 0.0 2.5 5.0

selu(swish(swish(negative(x))))

Val Acc: 0.6396

5.0 2.5 0.0 2.5 5.0

sub(swish(x),bessel_i1e(sigmoid(x)))

Val Acc: 0.6384

5.0 2.5 0.0 2.5 5.0

sub(negative(x),swish(negative(x)))

Val Acc: 0.6350

5.0 2.5 0.0 2.5 5.0

selu(max(swish(x),identity(x)))

Val Acc: 0.6342

5.0 2.5 0.0 2.5 5.0

min(square(x),selu(swish(x)))

Val Acc: 0.6298

5.0 2.5 0.0 2.5 5.0

mul(sinh(sigmoid(x)),identity(x))

Val Acc: 0.6298

5.0 2.5 0.0 2.5 5.0

max(erf(tanh(x)),elu(x))

Val Acc: 0.6112

5.0 2.5 0.0 2.5 5.0

max(selu(x),negative(sigmoid(x)))

Val Acc: 0.6106

5.0 2.5 0.0 2.5 5.0

max(tanh(tanh(x)),identity(x))

Val Acc: 0.6098

5.0 2.5 0.0 2.5 5.0

sum_n(one(x),negative(x),log_sigmoid(x))

Val Acc: 0.6088

5.0 2.5 0.0 2.5 5.0

swish(elu(reciprocal(reciprocal(x))))

Val Acc: 0.6038

5.0 2.5 0.0 2.5 5.0

swish(reciprocal(reciprocal(negative(x))))

Val Acc: 0.6028

5.0 2.5 0.0 2.5 5.0

mul(negative(x),hard_sigmoid(elu(x)))

Val Acc: 0.5904

5.0 2.5 0.0 2.5 5.0

min(swish(one(x)),identity(x))

Val Acc: 0.5564

5.0 2.5 0.0 2.5 5.0

max(selu(x),arcsinh(arcsinh(x)))

Val Acc: 0.4938

5.0 2.5 0.0 2.5 5.0

abs(mul(log_sigmoid(x),elu(x)))

Val Acc: 0.4580

5.0 2.5 0.0 2.5 5.0

max(expm1(arctan(x)),erf(x))

Val Acc: 0.4524

5.0 2.5 0.0 2.5 5.0

sub(elu(negative(x)),bessel_i0e(x))

Val Acc: 0.4496

5.0 2.5 0.0 2.5 5.0

mul(sigmoid(x),expm1(arctan(x)))

Val Acc: 0.4458

5.0 2.5 0.0 2.5 5.0

div(arcsinh(x),softplus(hard_sigmoid(x)))

Val Acc: 0.4230

5.0 2.5 0.0 2.5 5.0

arcsinh(selu(elu(negative(x))))

Val Acc: 0.4138

5.0 2.5 0.0 2.5 5.0

min(swish(x),exp(softsign(x)))

Val Acc: 0.4132

5.0 2.5 0.0 2.5 5.0

div(sigmoid(x),reciprocal(arcsinh(x)))

Val Acc: 0.4008

5.0 2.5 0.0 2.5 5.0

min(tanh(swish(x)),arcsinh(x))

Val Acc: 0.3896

5.0 2.5 0.0 2.5 5.0

max(sinh(tanh(x)),arcsinh(x))

Val Acc: 0.3826

5.0 2.5 0.0 2.5 5.0

max(arctan(arcsinh(x)),arcsinh(x))

Val Acc: 0.3798

5.0 2.5 0.0 2.5 5.0

min(arctan(softplus(x)),arcsinh(x))

Val Acc: 0.3734

5.0 2.5 0.0 2.5 5.0

min(swish(x),cosh(hard_sigmoid(x)))

Val Acc: 0.3730

5.0 2.5 0.0 2.5 5.0

min(cosh(softsign(x)),arcsinh(x))

Val Acc: 0.3718

5.0 2.5 0.0 2.5 5.0

sub(erf(softplus(x)),erfc(x))

Val Acc: 0.3550

5.0 2.5 0.0 2.5 5.0

add(square(sigmoid(x)),erf(x))

Val Acc: 0.3544

5.0 2.5 0.0 2.5 5.0

arcsinh(add(log_sigmoid(x),arctan(x)))

Val Acc: 0.3512

5.0 2.5 0.0 2.5 5.0

expm1(min(sigmoid(x),selu(x)))

Val Acc: 0.3482

5.0 2.5 0.0 2.5 5.0

arctan(add(swish(x),erf(x)))

Val Acc: 0.3422

5.0 2.5 0.0 2.5 5.0

reciprocal(reciprocal(negative(arcsinh(x))))

Val Acc: 0.3402

5.0 2.5 0.0 2.5 5.0

arcsinh(add(selu(x),erf(x)))

Val Acc: 0.3288

5.0 2.5 0.0 2.5 5.0

arcsinh(sub(bessel_i0e(x),negative(x)))

Val Acc: 0.3270

5.0 2.5 0.0 2.5 5.0

sinh(selu(sinh(tanh(x))))

Val Acc: 0.3270

5.0 2.5 0.0 2.5 5.0

sub(swish(hard_sigmoid(x)),erf(x))

Val Acc: 0.3270

5.0 2.5 0.0 2.5 5.0

sinh(sinh(arcsinh(arctan(x))))

Val Acc: 0.3248

5.0 2.5 0.0 2.5 5.0

selu(negative(selu(tanh(x))))

Val Acc: 0.3198

5.0 2.5 0.0 2.5 5.0

arcsinh(add(hard_sigmoid(x),arctan(x)))

Val Acc: 0.3006

5.0 2.5 0.0 2.5 5.0

arctan(sub(sigmoid(x),negative(x)))

Val Acc: 0.2990

5.0 2.5 0.0 2.5 5.0

erf(add(swish(x),arctan(x)))

Val Acc: 0.2948

5.0 2.5 0.0 2.5 5.0

add(hard_sigmoid(bessel_i0e(x)),erf(x))

Val Acc: 0.2912

5.0 2.5 0.0 2.5 5.0

arctan(sub(erfc(x),sigmoid(x)))

Val Acc: 0.2894

5.0 2.5 0.0 2.5 5.0

div(bessel_i0e(softplus(x)),reciprocal(x))

Val Acc: 0.2782

5.0 2.5 0.0 2.5 5.0

sub(softsign(bessel_i0e(x)),tanh(x))

Val Acc: 0.2740

5.0 2.5 0.0 2.5 5.0

sub(swish(hard_sigmoid(x)),arctan(x))

Val Acc: 0.2700

5.0 2.5 0.0 2.5 5.0

erf(sub(erfc(x),arctan(x)))

Val Acc: 0.2540

5.0 2.5 0.0 2.5 5.0

erf(add(selu(x),erf(x)))

Val Acc: 0.2380

5.0 2.5 0.0 2.5 5.0

arcsinh(erfc(erfc(erf(x))))

Val Acc: 0.1160

5.0 2.5 0.0 2.5 5.0

arctan(max(swish(x),sinh(x)))

Val Acc: 0.0886

5.0 2.5 0.0 2.5 5.0

sub(softsign(one(x)),swish(x))

Val Acc: 0.0648

5.0 2.5 0.0 2.5 5.0

max(tanh(sinh(x)),arcsinh(x))

Val Acc: 0.0616

5.0 2.5 0.0 2.5 5.0

div(selu(x),arctan(cosh(x)))

Val Acc: 0.0614

5.0 2.5 0.0 2.5 5.0

add(log_sigmoid(expm1(x)),hard_sigmoid(x))

Val Acc: 0.0442

5.0 2.5 0.0 2.5 5.0

max(selu(x),erf(expm1(x)))

Val Acc: 0.0422

5.0 2.5 0.0 2.5 5.0

tanh(add(sinh(x),exp(x)))

Val Acc: 0.0388

5.0 2.5 0.0 2.5 5.0

arctan(div(selu(x),softplus(x)))

Val Acc: 0.0148

5.0 2.5 0.0 2.5 5.0

expm1(add(tanh(x),arctan(x)))

Val Acc: 0.0140

5.0 2.5 0.0 2.5 5.0

expm1(arcsinh(swish(negative(x))))

Val Acc: 0.0130

5.0 2.5 0.0 2.5 5.0

min(square(x),expm1(arcsinh(x)))

Val Acc: 0.0120

5.0 2.5 0.0 2.5 5.0

max(selu(x),bessel_i1e(selu(x)))

Val Acc: 0.0114

5.0 2.5 0.0 2.5 5.0

max(selu(arcsinh(x)),bessel_i1e(x))

Val Acc: 0.0114

5.0 2.5 0.0 2.5 5.0

expm1(max(tanh(x),bessel_i1e(x)))

Val Acc: 0.0112

5.0 2.5 0.0 2.5 5.0

selu(min(sinh(x),bessel_i1e(x)))

Val Acc: 0.0106

5.0 2.5 0.0 2.5 5.0

sub(square(bessel_i1e(x)),swish(x))

Val Acc: 0.0106

5.0 2.5 0.0 2.5 5.0

sub(cosh(swish(x)),hard_sigmoid(x))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

reciprocal(add(expm1(x),elu(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

elu(sub(abs(x),erfc(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

softplus(add(arctan(x),abs(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

cosh(pow(abs(x),arctan(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

expm1(sub(one(x),expm1(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

arcsinh(div(swish(x),bessel_i0e(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

softsign(pow(square(x),elu(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

reciprocal(add(sinh(x),one(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

sub(log_sigmoid(sinh(x)),square(x))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

relu(mul(square(x),softsign(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

add(tanh(exp(x)),erf(x))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

prod_n(sinh(x),selu(x),hard_sigmoid(x))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

softsign(add(swish(x),sinh(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

selu(div(arcsinh(x),softplus(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

swish(add(swish(x),swish(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

selu(max(expm1(x),erfc(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

erf(div(exp(x),expm1(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

arcsinh(elu(swish(sinh(x))))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

expm1(sigmoid(arcsinh(reciprocal(x))))

Val Acc: 0.0098

5.0 2.5 0.0 2.5 5.0

erf(sub(square(x),bessel_i1e(x)))

Val Acc: 0.0094

5.0 2.5 0.0 2.5 5.0

mul(erfc(erf(x)),abs(x))

Val Acc: 0.0088

5.0 2.5 0.0 2.5 5.0

sub(erf(bessel_i1e(x)),identity(x))

Val Acc: 0.0074

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 13: Activation functions evaluated in the search for All-CNN-C on CIFAR-100.

5.0 2.5 0.0 2.5 5.0

selu(sinh(expm1(arctan(x))))

Val Acc: 0.7270

5.0 2.5 0.0 2.5 5.0

mul(identity(x),erfc(elu(x)))

Val Acc: 0.7232

5.0 2.5 0.0 2.5 5.0

swish(sub(negative(x),identity(x)))

Val Acc: 0.7224

5.0 2.5 0.0 2.5 5.0

add(swish(x),log_sigmoid(hard_sigmoid(x)))

Val Acc: 0.7222

5.0 2.5 0.0 2.5 5.0

add(swish(x),elu(selu(x)))

Val Acc: 0.7220

5.0 2.5 0.0 2.5 5.0

add(tanh(swish(x)),swish(x))

Val Acc: 0.7218

5.0 2.5 0.0 2.5 5.0

identity(expm1(arcsinh(selu(x))))

Val Acc: 0.7204

5.0 2.5 0.0 2.5 5.0

expm1(arcsinh(selu(selu(x))))

Val Acc: 0.7198

5.0 2.5 0.0 2.5 5.0

mul(softplus(x),arcsinh(arcsinh(x)))

Val Acc: 0.7192

5.0 2.5 0.0 2.5 5.0

max(softsign(swish(x)),selu(x))

Val Acc: 0.7178

5.0 2.5 0.0 2.5 5.0

negative(mul(selu(x),hard_sigmoid(x)))

Val Acc: 0.7174

5.0 2.5 0.0 2.5 5.0

sinh(arcsinh(expm1(arcsinh(x))))

Val Acc: 0.7172

5.0 2.5 0.0 2.5 5.0

sum_n(softsign(x),selu(x),abs(x))

Val Acc: 0.7168

5.0 2.5 0.0 2.5 5.0

relu(expm1(negative(arcsinh(x))))

Val Acc: 0.7164

5.0 2.5 0.0 2.5 5.0

negative(swish(sinh(arcsinh(x))))

Val Acc: 0.7152

5.0 2.5 0.0 2.5 5.0

negative(elu(elu(elu(x))))

Val Acc: 0.7142

5.0 2.5 0.0 2.5 5.0

add(tanh(relu(x)),swish(x))

Val Acc: 0.7128

5.0 2.5 0.0 2.5 5.0

mul(softplus(x),selu(erf(x)))

Val Acc: 0.7124

5.0 2.5 0.0 2.5 5.0

add(swish(x),expm1(arctan(x)))

Val Acc: 0.7116

5.0 2.5 0.0 2.5 5.0

mul(log_sigmoid(x),elu(selu(x)))

Val Acc: 0.7112

5.0 2.5 0.0 2.5 5.0

swish(add(tanh(x),selu(x)))

Val Acc: 0.7108

5.0 2.5 0.0 2.5 5.0

swish(mul(softplus(x),arctan(x)))

Val Acc: 0.7082

5.0 2.5 0.0 2.5 5.0

add(tanh(swish(x)),relu(x))

Val Acc: 0.7078

5.0 2.5 0.0 2.5 5.0

mul(softplus(x),erf(tanh(x)))

Val Acc: 0.7074

5.0 2.5 0.0 2.5 5.0

sub(erfc(sigmoid(x)),relu(x))

Val Acc: 0.7068

5.0 2.5 0.0 2.5 5.0

elu(elu(elu(selu(x))))

Val Acc: 0.7056

5.0 2.5 0.0 2.5 5.0

mul(sigmoid(arctan(x)),elu(x))

Val Acc: 0.7038

5.0 2.5 0.0 2.5 5.0

mul(swish(x),sigmoid(hard_sigmoid(x)))

Val Acc: 0.7024

5.0 2.5 0.0 2.5 5.0

min(identity(x),bessel_i0e(softplus(x)))

Val Acc: 0.7014

5.0 2.5 0.0 2.5 5.0

min(negative(x),arctan(hard_sigmoid(x)))

Val Acc: 0.6998

5.0 2.5 0.0 2.5 5.0

swish(selu(swish(selu(x))))

Val Acc: 0.6960

5.0 2.5 0.0 2.5 5.0

swish(add(arctan(x),abs(x)))

Val Acc: 0.6954

5.0 2.5 0.0 2.5 5.0

prod_n(hard_sigmoid(x),erf(x),abs(x))

Val Acc: 0.6946

5.0 2.5 0.0 2.5 5.0

div(erf(x),erfc(sigmoid(x)))

Val Acc: 0.6946

5.0 2.5 0.0 2.5 5.0

add(softsign(selu(x)),relu(x))

Val Acc: 0.6928

5.0 2.5 0.0 2.5 5.0

mul(identity(x),erfc(erfc(x)))

Val Acc: 0.6916

5.0 2.5 0.0 2.5 5.0

sum_n(negative(x),log_sigmoid(x),sigmoid(x))

Val Acc: 0.6898

5.0 2.5 0.0 2.5 5.0

expm1(erf(swish(arcsinh(x))))

Val Acc: 0.6860

5.0 2.5 0.0 2.5 5.0

elu(sub(tanh(x),negative(x)))

Val Acc: 0.6836

5.0 2.5 0.0 2.5 5.0

sum_n(zero(x),relu(x),arctan(x))

Val Acc: 0.6820

5.0 2.5 0.0 2.5 5.0

div(erf(x),sigmoid(selu(x)))

Val Acc: 0.6808

5.0 2.5 0.0 2.5 5.0

expm1(swish(arctan(swish(x))))

Val Acc: 0.6802

5.0 2.5 0.0 2.5 5.0

sum_n(tanh(x),relu(x),abs(x))

Val Acc: 0.6770

5.0 2.5 0.0 2.5 5.0

add(selu(tanh(x)),relu(x))

Val Acc: 0.6768

5.0 2.5 0.0 2.5 5.0

identity(elu(arctan(swish(x))))

Val Acc: 0.6756

5.0 2.5 0.0 2.5 5.0

expm1(arctan(arcsinh(expm1(x))))

Val Acc: 0.6742

5.0 2.5 0.0 2.5 5.0

add(sinh(erf(x)),relu(x))

Val Acc: 0.6736

5.0 2.5 0.0 2.5 5.0

prod_n(sigmoid(x),hard_sigmoid(x),abs(x))

Val Acc: 0.6706

5.0 2.5 0.0 2.5 5.0

expm1(sinh(erf(selu(x))))

Val Acc: 0.6688

5.0 2.5 0.0 2.5 5.0

sinh(negative(arctan(expm1(x))))

Val Acc: 0.6686

5.0 2.5 0.0 2.5 5.0

negative(arctan(elu(swish(x))))

Val Acc: 0.6680

5.0 2.5 0.0 2.5 5.0

mul(exp(arctan(x)),elu(x))

Val Acc: 0.6650

5.0 2.5 0.0 2.5 5.0

sub(arctan(abs(x)),negative(x))

Val Acc: 0.6644

5.0 2.5 0.0 2.5 5.0

selu(square(arcsinh(swish(x))))

Val Acc: 0.6640

5.0 2.5 0.0 2.5 5.0

div(identity(x),sigmoid(erf(x)))

Val Acc: 0.6598

5.0 2.5 0.0 2.5 5.0

div(tanh(x),log_sigmoid(softsign(x)))

Val Acc: 0.6590

5.0 2.5 0.0 2.5 5.0

sub(erf(softsign(x)),selu(x))

Val Acc: 0.6588

5.0 2.5 0.0 2.5 5.0

sum_n(relu(x),abs(x),log_sigmoid(x))

Val Acc: 0.6574

5.0 2.5 0.0 2.5 5.0

expm1(square(expm1(softsign(x))))

Val Acc: 0.6570

5.0 2.5 0.0 2.5 5.0

sub(square(erfc(x)),one(x))

Val Acc: 0.6568

5.0 2.5 0.0 2.5 5.0

swish(expm1(negative(erf(x))))

Val Acc: 0.6518

5.0 2.5 0.0 2.5 5.0

tanh(sub(sinh(x),hard_sigmoid(x)))

Val Acc: 0.6482

5.0 2.5 0.0 2.5 5.0

mul(hard_sigmoid(selu(x)),abs(x))

Val Acc: 0.6472

5.0 2.5 0.0 2.5 5.0

sub(tanh(square(x)),elu(x))

Val Acc: 0.6384

5.0 2.5 0.0 2.5 5.0

sub(tanh(abs(x)),arcsinh(x))

Val Acc: 0.6364

5.0 2.5 0.0 2.5 5.0

square(arcsinh(elu(selu(x))))

Val Acc: 0.6274

5.0 2.5 0.0 2.5 5.0

expm1(negative(selu(elu(x))))

Val Acc: 0.6230

5.0 2.5 0.0 2.5 5.0

sub(swish(relu(x)),sigmoid(x))

Val Acc: 0.6204

5.0 2.5 0.0 2.5 5.0

sub(identity(elu(x)),erfc(x))

Val Acc: 0.6034

5.0 2.5 0.0 2.5 5.0

abs(sub(erf(x),abs(x)))

Val Acc: 0.5904

5.0 2.5 0.0 2.5 5.0

sum_n(negative(x),elu(x),abs(x))

Val Acc: 0.5902

5.0 2.5 0.0 2.5 5.0

div(selu(x),sigmoid(erf(x)))

Val Acc: 0.5878

5.0 2.5 0.0 2.5 5.0

add(selu(x),log_sigmoid(bessel_i0e(x)))

Val Acc: 0.5714

5.0 2.5 0.0 2.5 5.0

div(erf(exp(x)),reciprocal(x))

Val Acc: 0.5684

5.0 2.5 0.0 2.5 5.0

sum_n(arctan(x),arcsinh(x),log_sigmoid(x))

Val Acc: 0.5624

5.0 2.5 0.0 2.5 5.0

sum_n(tanh(x),arcsinh(x),hard_sigmoid(x))

Val Acc: 0.5526

5.0 2.5 0.0 2.5 5.0

prod_n(softsign(x),one(x),abs(x))

Val Acc: 0.5162

5.0 2.5 0.0 2.5 5.0

div(swish(x),hard_sigmoid(expm1(x)))

Val Acc: 0.4854

5.0 2.5 0.0 2.5 5.0

add(selu(x),bessel_i0e(expm1(x)))

Val Acc: 0.4080

5.0 2.5 0.0 2.5 5.0

sum_n(swish(x),relu(x),bessel_i1e(x))

Val Acc: 0.0124

5.0 2.5 0.0 2.5 5.0

div(elu(x),reciprocal(sigmoid(x)))

Val Acc: 0.0122

5.0 2.5 0.0 2.5 5.0

div(hard_sigmoid(x),arcsinh(reciprocal(x)))

Val Acc: 0.0120

5.0 2.5 0.0 2.5 5.0

mul(arctan(expm1(x)),abs(x))

Val Acc: 0.0110

5.0 2.5 0.0 2.5 5.0

relu(expm1(arctan(square(x))))

Val Acc: 0.0110

5.0 2.5 0.0 2.5 5.0

add(erf(erf(x)),abs(x))

Val Acc: 0.0110

5.0 2.5 0.0 2.5 5.0

identity(bessel_i1e(selu(bessel_i1e(x))))

Val Acc: 0.0106

5.0 2.5 0.0 2.5 5.0

erfc(cosh(softsign(exp(x))))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

sub(abs(x),softplus(arcsinh(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

prod_n(exp(x),elu(x),log_sigmoid(x))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

max(negative(square(x)),log_sigmoid(x))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

max(erf(expm1(x)),abs(x))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

negative(selu(softplus(cosh(x))))

Val Acc: 0.0100

6 4 2 0 2 4

square(reciprocal(log_sigmoid(sinh(x))))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

arcsinh(div(arctan(x),softplus(x)))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

expm1(arctan(square(sinh(x))))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

reciprocal(tanh(exp(negative(x))))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

selu(arcsinh(square(relu(x))))

Val Acc: 0.0100

5.0 2.5 0.0 2.5 5.0

add(relu(erf(x)),abs(x))

Val Acc: 0.0094

5.0 2.5 0.0 2.5 5.0

mul(swish(x),sigmoid(expm1(x)))

Val Acc: 0.0088

5.0 2.5 0.0 2.5 5.0

sub(bessel_i1e(x),selu(elu(x)))

Val Acc: 0.0088

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 14: Activation functions evaluated in the search for Res Net-56 on CIFAR-100.

5.0 2.5 0.0 2.5 5.0

prod_n(sigmoid(x),negative(x),hard_sigmoid(x))

Val Acc: 0.6396

5.0 2.5 0.0 2.5 5.0

identity(elu(swish(negative(x))))

Val Acc: 0.6394

5.0 2.5 0.0 2.5 5.0

mul(swish(x),erfc(bessel_i0e(x)))

Val Acc: 0.6336

5.0 2.5 0.0 2.5 5.0

mul(mul(relu(x),sigmoid(x)),sigmoid(x))

Val Acc: 0.6316

5.0 2.5 0.0 2.5 5.0

mul(erf(x),elu(swish(x)))

Val Acc: 0.6292

5.0 2.5 0.0 2.5 5.0

max(elu(x),arctan(erf(x)))

Val Acc: 0.6290

5.0 2.5 0.0 2.5 5.0

mul(relu(x),arcsinh(hard_sigmoid(x)))

Val Acc: 0.6278

5.0 2.5 0.0 2.5 5.0

prod_n(softplus(x),hard_sigmoid(x),sigmoid(x))

Val Acc: 0.6278

5.0 2.5 0.0 2.5 5.0

mul(mul(softplus(x),hard_sigmoid(x)),sigmoid(x))

Val Acc: 0.6278

5.0 2.5 0.0 2.5 5.0

max(tanh(erf(x)),identity(x))

Val Acc: 0.6251

5.0 2.5 0.0 2.5 5.0

abs(swish(selu(swish(x))))

Val Acc: 0.6245

5.0 2.5 0.0 2.5 5.0

sub(softplus(x),bessel_i0e(relu(x)))

Val Acc: 0.6214

5.0 2.5 0.0 2.5 5.0

sub(swish(x),softsign(softsign(x)))

Val Acc: 0.6205

5.0 2.5 0.0 2.5 5.0

mul(expm1(log_sigmoid(x)),abs(x))

Val Acc: 0.6204

5.0 2.5 0.0 2.5 5.0

sub(softplus(x),arcsinh(erfc(x)))

Val Acc: 0.6200

5.0 2.5 0.0 2.5 5.0

min(swish(x),exp(sigmoid(x)))

Val Acc: 0.6179

5.0 2.5 0.0 2.5 5.0

selu(add(swish(x),softsign(x)))

Val Acc: 0.6169

5.0 2.5 0.0 2.5 5.0

sub(selu(elu(x)),zero(x))

Val Acc: 0.6157

5.0 2.5 0.0 2.5 5.0

mul(softsign(softplus(x)),abs(x))

Val Acc: 0.6140

5.0 2.5 0.0 2.5 5.0

sub(arcsinh(one(x)),softplus(x))

Val Acc: 0.6139

5.0 2.5 0.0 2.5 5.0

sub(softsign(x),selu(identity(x)))

Val Acc: 0.6133

5.0 2.5 0.0 2.5 5.0

max(selu(x),log_sigmoid(bessel_i0e(x)))

Val Acc: 0.6121

5.0 2.5 0.0 2.5 5.0

min(swish(x),softplus(arcsinh(x)))

Val Acc: 0.6119

5.0 2.5 0.0 2.5 5.0

sinh(selu(sinh(tanh(x))))

Val Acc: 0.6107

5.0 2.5 0.0 2.5 5.0

mul(tanh(x),cosh(erf(x)))

Val Acc: 0.6093

5.0 2.5 0.0 2.5 5.0

add(tanh(x),relu(erf(x)))

Val Acc: 0.6083

5.0 2.5 0.0 2.5 5.0

add(softsign(relu(x)),erf(x))

Val Acc: 0.6081

5.0 2.5 0.0 2.5 5.0

arcsinh(sinh(expm1(arctan(x))))

Val Acc: 0.6065

5.0 2.5 0.0 2.5 5.0

add(square(hard_sigmoid(x)),erf(x))

Val Acc: 0.6056

5.0 2.5 0.0 2.5 5.0

sub(tanh(abs(x)),relu(x))

Val Acc: 0.6053

5.0 2.5 0.0 2.5 5.0

add(log_sigmoid(x),bessel_i0e(hard_sigmoid(x)))

Val Acc: 0.6050

5.0 2.5 0.0 2.5 5.0

sub(hard_sigmoid(log_sigmoid(x)),softplus(x))

Val Acc: 0.6048

5.0 2.5 0.0 2.5 5.0

mul(log_sigmoid(erf(x)),erf(x))

Val Acc: 0.6045

5.0 2.5 0.0 2.5 5.0

add(log_sigmoid(one(x)),erf(x))

Val Acc: 0.6036

5.0 2.5 0.0 2.5 5.0

negative(max(tanh(x),arcsinh(x)))

Val Acc: 0.6031

5.0 2.5 0.0 2.5 5.0

sub(abs(x),sigmoid(swish(x)))

Val Acc: 0.6024

5.0 2.5 0.0 2.5 5.0

sub(sigmoid(relu(x)),abs(x))

Val Acc: 0.6022

5.0 2.5 0.0 2.5 5.0

mul(softsign(x),softplus(relu(x)))

Val Acc: 0.6009

5.0 2.5 0.0 2.5 5.0

negative(mul(softsign(x),erfc(x)))

Val Acc: 0.5978

5.0 2.5 0.0 2.5 5.0

arctan(sub(negative(x),elu(x)))

Val Acc: 0.5965

5.0 2.5 0.0 2.5 5.0

arcsinh(add(selu(x),erf(x)))

Val Acc: 0.5949

5.0 2.5 0.0 2.5 5.0

erfc(sub(erfc(x),log_sigmoid(x)))

Val Acc: 0.5935

5.0 2.5 0.0 2.5 5.0

max(identity(x),erf(reciprocal(x)))

Val Acc: 0.5890

5.0 2.5 0.0 2.5 5.0

div(arctan(x),arctan(one(x)))

Val Acc: 0.5880

5.0 2.5 0.0 2.5 5.0

tanh(add(softplus(x),erf(x)))

Val Acc: 0.5879

5.0 2.5 0.0 2.5 5.0

tanh(add(swish(x),relu(x)))

Val Acc: 0.5853

5.0 2.5 0.0 2.5 5.0

arctan(add(sigmoid(x),elu(x)))

Val Acc: 0.5851

5.0 2.5 0.0 2.5 5.0

erf(square(relu(negative(x))))

Val Acc: 0.5847

5.0 2.5 0.0 2.5 5.0

div(softsign(x),sigmoid(softsign(x)))

Val Acc: 0.5843

5.0 2.5 0.0 2.5 5.0

add(selu(arcsinh(x)),arctan(x))

Val Acc: 0.5835

5.0 2.5 0.0 2.5 5.0

mul(sigmoid(relu(x)),arcsinh(x))

Val Acc: 0.5771

5.0 2.5 0.0 2.5 5.0

bessel_i1e(sub(erfc(x),hard_sigmoid(x)))

Val Acc: 0.5594

5.0 2.5 0.0 2.5 5.0

sub(sinh(erf(x)),negative(x))

Val Acc: 0.5393

5.0 2.5 0.0 2.5 5.0

erf(add(softsign(x),softplus(x)))

Val Acc: 0.5340

5.0 2.5 0.0 2.5 5.0

sub(elu(erf(x)),negative(x))

Val Acc: 0.5319

5.0 2.5 0.0 2.5 5.0

softsign(sub(sigmoid(x),square(x)))

Val Acc: 0.5230

5.0 2.5 0.0 2.5 5.0

mul(identity(x),bessel_i0e(bessel_i0e(x)))

Val Acc: 0.4652

5.0 2.5 0.0 2.5 5.0

bessel_i1e(sub(erfc(x),bessel_i0e(x)))

Val Acc: 0.2227

5.0 2.5 0.0 2.5 5.0

bessel_i1e(sub(erfc(x),erf(x)))

Val Acc: 0.2214

5.0 2.5 0.0 2.5 5.0

tanh(sub(expm1(x),abs(x)))

Val Acc: 0.0015

5.0 2.5 0.0 2.5 5.0

tanh(max(sinh(x),log_sigmoid(x)))

Val Acc: 0.0013

5.0 2.5 0.0 2.5 5.0

sub(erf(x),erfc(exp(x)))

Val Acc: 0.0012

5.0 2.5 0.0 2.5 5.0

add(square(bessel_i1e(x)),elu(x))

Val Acc: 0.0011

5.0 2.5 0.0 2.5 5.0

min(swish(x),swish(expm1(x)))

Val Acc: 0.0011

5.0 2.5 0.0 2.5 5.0

min(selu(sinh(x)),erf(x))

Val Acc: 0.0011

5.0 2.5 0.0 2.5 5.0

bessel_i1e(sub(abs(x),negative(x)))

Val Acc: 0.0011

5.0 2.5 0.0 2.5 5.0

add(log_sigmoid(exp(x)),erf(x))

Val Acc: 0.0011

5.0 2.5 0.0 2.5 5.0

erf(add(abs(x),abs(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

arctan(add(sinh(x),bessel_i0e(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

relu(sub(bessel_i1e(x),identity(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

mul(erf(square(x)),arcsinh(x))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

reciprocal(reciprocal(expm1(arctan(x))))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

negative(div(abs(x),erf(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

prod_n(reciprocal(x),identity(x),selu(x))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

elu(reciprocal(elu(reciprocal(x))))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

elu(mul(sinh(x),erfc(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

sub(arcsinh(x),relu(arcsinh(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

elu(swish(arcsinh(sinh(x))))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

swish(arcsinh(elu(sinh(x))))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

sinh(sub(swish(x),selu(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

expm1(softsign(expm1(negative(x))))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

expm1(elu(softsign(expm1(x))))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

prod_n(relu(x),log_sigmoid(x),expm1(x))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

relu(div(hard_sigmoid(x),reciprocal(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

add(reciprocal(reciprocal(x)),arctan(x))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

arcsinh(sub(expm1(x),erf(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

min(sigmoid(cosh(x)),elu(x))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

swish(add(selu(x),bessel_i1e(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

pow(relu(x),erf(one(x)))

Val Acc: 0.0010

5.0 2.5 0.0 2.5 5.0

min(sigmoid(expm1(x)),identity(x))

Val Acc: 0.0009

5.0 2.5 0.0 2.5 5.0

add(swish(x),arctan(bessel_i1e(x)))

Val Acc: 0.0009

5.0 2.5 0.0 2.5 5.0

tanh(add(expm1(x),exp(x)))

Val Acc: 0.0009

5.0 2.5 0.0 2.5 5.0

tanh(add(erf(x),bessel_i1e(x)))

Val Acc: 0.0009

5.0 2.5 0.0 2.5 5.0

erfc(add(expm1(x),exp(x)))

Val Acc: 0.0008

5.0 2.5 0.0 2.5 5.0

add(selu(x),expm1(bessel_i1e(x)))

Val Acc: 0.0008

5.0 2.5 0.0 2.5 5.0

bessel_i1e(sub(swish(x),identity(x)))

Val Acc: 0.0008

5.0 2.5 0.0 2.5 5.0

abs(sub(selu(x),bessel_i1e(x)))

Val Acc: 0.0008

5.0 2.5 0.0 2.5 5.0

tanh(div(arctan(x),erfc(x)))

Val Acc: 0.0008

5.0 2.5 0.0 2.5 5.0

bessel_i1e(sub(square(x),tanh(x)))

Val Acc: 0.0008

5.0 2.5 0.0 2.5 5.0

min(selu(x),arctan(exp(x)))

Val Acc: 0.0008

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 15: Activation functions evaluated in the search for Mobile Vi Tv2-0.5 on Image Net.

5.0 2.5 0.0 2.5 5.0

square(erfc(softplus(x)))

Val Acc: 0.8907

5.0 2.5 0.0 2.5 5.0

min(square(softplus(x)),negative(x))

Val Acc: 0.8861

5.0 2.5 0.0 2.5 5.0

arcsinh(elu(swish(x)))

Val Acc: 0.8828

5.0 2.5 0.0 2.5 5.0

mul(erfc(relu(x)),selu(x))

Val Acc: 0.8825

5.0 2.5 0.0 2.5 5.0

sub(arctan(swish(x)),tanh(x))

Val Acc: 0.8823

5.0 2.5 0.0 2.5 5.0

sub(arctan(swish(x)),erf(x))

Val Acc: 0.8820

5.0 2.5 0.0 2.5 5.0

min(negative(elu(x)),hard_sigmoid(x))

Val Acc: 0.8815

5.0 2.5 0.0 2.5 5.0

mul(erf(elu(x)),sigmoid(x))

Val Acc: 0.8815

5.0 2.5 0.0 2.5 5.0

min(leakyrelu(selu(x)),negative(x))

Val Acc: 0.8813

5.0 2.5 0.0 2.5 5.0

negative(swish(arcsinh(x)))

Val Acc: 0.8805

5.0 2.5 0.0 2.5 5.0

min(arctan(swish(x)),square(x))

Val Acc: 0.8797

5.0 2.5 0.0 2.5 5.0

add(bessel_i1e(sigmoid(x)),leakyrelu(x))

Val Acc: 0.8795

5.0 2.5 0.0 2.5 5.0

max(swish(erf(x)),identity(x))

Val Acc: 0.8792

5.0 2.5 0.0 2.5 5.0

max(negative(leakyrelu(x)),selu(x))

Val Acc: 0.8785

5.0 2.5 0.0 2.5 5.0

mul(gelu(one(x)),swish(x))

Val Acc: 0.8785

5.0 2.5 0.0 2.5 5.0

arcsinh(swish(softplus(x)))

Val Acc: 0.8782

5.0 2.5 0.0 2.5 5.0

add(log_sigmoid(sigmoid(x)),relu(x))

Val Acc: 0.8780

5.0 2.5 0.0 2.5 5.0

min(swish(relu(x)),leakyrelu(x))

Val Acc: 0.8772

5.0 2.5 0.0 2.5 5.0

min(hard_sigmoid(negative(x)),arctan(x))

Val Acc: 0.8769

5.0 2.5 0.0 2.5 5.0

min(sigmoid(negative(x)),arctan(x))

Val Acc: 0.8769

5.0 2.5 0.0 2.5 5.0

sub(softsign(leakyrelu(x)),leakyrelu(x))

Val Acc: 0.8762

5.0 2.5 0.0 2.5 5.0

min(hard_sigmoid(arcsinh(x)),negative(x))

Val Acc: 0.8759

5.0 2.5 0.0 2.5 5.0

min(tanh(sigmoid(x)),negative(x))

Val Acc: 0.8754

5.0 2.5 0.0 2.5 5.0

min(elu(swish(x)),one(x))

Val Acc: 0.8752

5.0 2.5 0.0 2.5 5.0

swish(swish(relu(x)))

Val Acc: 0.8739

5.0 2.5 0.0 2.5 5.0

abs(swish(elu(x)))

Val Acc: 0.8736

5.0 2.5 0.0 2.5 5.0

sub(leakyrelu(tanh(x)),arcsinh(x))

Val Acc: 0.8736

5.0 2.5 0.0 2.5 5.0

add(square(erf(x)),tanh(x))

Val Acc: 0.8734

5.0 2.5 0.0 2.5 5.0

max(log_sigmoid(square(x)),erf(x))

Val Acc: 0.8726

5.0 2.5 0.0 2.5 5.0

arctan(softsign(swish(x)))

Val Acc: 0.8716

5.0 2.5 0.0 2.5 5.0

max(expm1(tanh(x)),swish(x))

Val Acc: 0.8703

5.0 2.5 0.0 2.5 5.0

sub(softsign(swish(x)),erf(x))

Val Acc: 0.8703

5.0 2.5 0.0 2.5 5.0

expm1(elu(arctan(x)))

Val Acc: 0.8698

5.0 2.5 0.0 2.5 5.0

sub(arctan(arcsinh(x)),swish(x))

Val Acc: 0.8696

5.0 2.5 0.0 2.5 5.0

min(softplus(zero(x)),swish(x))

Val Acc: 0.8688

5.0 2.5 0.0 2.5 5.0

max(negative(erf(x)),log_sigmoid(x))

Val Acc: 0.8685

5.0 2.5 0.0 2.5 5.0

add(leakyrelu(softsign(x)),leakyrelu(x))

Val Acc: 0.8683

5.0 2.5 0.0 2.5 5.0

min(swish(tanh(x)),identity(x))

Val Acc: 0.8680

5.0 2.5 0.0 2.5 5.0

max(swish(erf(x)),erf(x))

Val Acc: 0.8670

5.0 2.5 0.0 2.5 5.0

tanh(square(softplus(x)))

Val Acc: 0.8670

5.0 2.5 0.0 2.5 5.0

tanh(square(log_sigmoid(x)))

Val Acc: 0.8662

5.0 2.5 0.0 2.5 5.0

swish(erf(erf(x)))

Val Acc: 0.8662

5.0 2.5 0.0 2.5 5.0

sub(hard_sigmoid(arctan(x)),softsign(x))

Val Acc: 0.8650

5.0 2.5 0.0 2.5 5.0

sub(erfc(softplus(x)),sigmoid(x))

Val Acc: 0.8642

5.0 2.5 0.0 2.5 5.0

add(erfc(one(x)),softsign(x))

Val Acc: 0.8639

5.0 2.5 0.0 2.5 5.0

tanh(negative(elu(x)))

Val Acc: 0.8639

5.0 2.5 0.0 2.5 5.0

mul(arcsinh(erfc(x)),sigmoid(x))

Val Acc: 0.8634

5.0 2.5 0.0 2.5 5.0

mul(relu(negative(x)),arctan(x))

Val Acc: 0.8632

5.0 2.5 0.0 2.5 5.0

mul(sigmoid(identity(x)),erfc(x))

Val Acc: 0.8609

5.0 2.5 0.0 2.5 5.0

add(log_sigmoid(swish(x)),softplus(x))

Val Acc: 0.8606

5.0 2.5 0.0 2.5 5.0

max(erf(tanh(x)),selu(x))

Val Acc: 0.8594

5.0 2.5 0.0 2.5 5.0

expm1(softsign(square(x)))

Val Acc: 0.8589

5.0 2.5 0.0 2.5 5.0

mul(negative(erf(x)),bessel_i0e(x))

Val Acc: 0.8589

5.0 2.5 0.0 2.5 5.0

arctan(square(arcsinh(x)))

Val Acc: 0.8571

5.0 2.5 0.0 2.5 5.0

min(swish(arctan(x)),erfc(x))

Val Acc: 0.8568

5.0 2.5 0.0 2.5 5.0

expm1(leakyrelu(arcsinh(x)))

Val Acc: 0.8568

5.0 2.5 0.0 2.5 5.0

add(selu(negative(x)),leakyrelu(x))

Val Acc: 0.8563

5.0 2.5 0.0 2.5 5.0

div(softsign(sigmoid(x)),reciprocal(x))

Val Acc: 0.8563

5.0 2.5 0.0 2.5 5.0

erf(softsign(softsign(x)))

Val Acc: 0.8548

5.0 2.5 0.0 2.5 5.0

mul(expm1(sigmoid(x)),negative(x))

Val Acc: 0.8535

5.0 2.5 0.0 2.5 5.0

tanh(softplus(erf(x)))

Val Acc: 0.8530

5.0 2.5 0.0 2.5 5.0

sub(leakyrelu(erf(x)),selu(x))

Val Acc: 0.8527

5.0 2.5 0.0 2.5 5.0

mul(arcsinh(arcsinh(x)),arcsinh(x))

Val Acc: 0.8497

5.0 2.5 0.0 2.5 5.0

mul(bessel_i1e(one(x)),negative(x))

Val Acc: 0.8484

5.0 2.5 0.0 2.5 5.0

sub(tanh(tanh(x)),relu(x))

Val Acc: 0.8466

5.0 2.5 0.0 2.5 5.0

negative(arctan(arcsinh(x)))

Val Acc: 0.8451

5.0 2.5 0.0 2.5 5.0

sub(erfc(square(x)),sigmoid(x))

Val Acc: 0.8311

5.0 2.5 0.0 2.5 5.0

max(tanh(square(x)),arctan(x))

Val Acc: 0.8280

5.0 2.5 0.0 2.5 5.0

sub(tanh(softplus(x)),abs(x))

Val Acc: 0.8084

5.0 2.5 0.0 2.5 5.0

max(erf(square(x)),erf(x))

Val Acc: 0.8043

5.0 2.5 0.0 2.5 5.0

add(swish(negative(x)),selu(x))

Val Acc: 0.7959

5.0 2.5 0.0 2.5 5.0

min(arctan(reciprocal(x)),log_sigmoid(x))

Val Acc: 0.7748

5.0 2.5 0.0 2.5 5.0

swish(softsign(bessel_i0e(x)))

Val Acc: 0.7669

5.0 2.5 0.0 2.5 5.0

mul(elu(arctan(x)),identity(x))

Val Acc: 0.7592

5.0 2.5 0.0 2.5 5.0

mul(hard_sigmoid(softsign(x)),identity(x))

Val Acc: 0.7511

5.0 2.5 0.0 2.5 5.0

sub(sigmoid(reciprocal(x)),softplus(x))

Val Acc: 0.6469

5.0 2.5 0.0 2.5 5.0

negative(sigmoid(sinh(x)))

Val Acc: 0.5990

5.0 2.5 0.0 2.5 5.0

sub(arctan(cosh(x)),sigmoid(x))

Val Acc: 0.5870

5.0 2.5 0.0 2.5 5.0

min(swish(expm1(x)),one(x))

Val Acc: 0.4915

5.0 2.5 0.0 2.5 5.0

mul(hard_sigmoid(negative(x)),elu(x))

Val Acc: 0.3256

5.0 2.5 0.0 2.5 5.0

arctan(leakyrelu(sinh(x)))

Val Acc: 0.3152

5.0 2.5 0.0 2.5 5.0

sub(sigmoid(expm1(x)),one(x))

Val Acc: 0.2932

5.0 2.5 0.0 2.5 5.0

sub(arcsinh(elu(x)),hard_sigmoid(x))

Val Acc: 0.2904

5.0 2.5 0.0 2.5 5.0

add(log_sigmoid(exp(x)),softsign(x))

Val Acc: 0.2265

5.0 2.5 0.0 2.5 5.0

mul(sigmoid(exp(x)),arcsinh(x))

Val Acc: 0.2189

5.0 2.5 0.0 2.5 5.0

erf(bessel_i1e(swish(x)))

Val Acc: 0.1098

5.0 2.5 0.0 2.5 5.0

mul(tanh(sinh(x)),arcsinh(x))

Val Acc: 0.1006

5.0 2.5 0.0 2.5 5.0

sub(erf(sinh(x)),swish(x))

Val Acc: 0.1006

5.0 2.5 0.0 2.5 5.0

mul(swish(arcsinh(x)),selu(x))

Val Acc: 0.0986

5.0 2.5 0.0 2.5 5.0

max(softsign(exp(x)),arctan(x))

Val Acc: 0.0986

5.0 2.5 0.0 2.5 5.0

mul(negative(expm1(x)),erfc(x))

Val Acc: 0.0986

5.0 2.5 0.0 2.5 5.0

arctan(softsign(expm1(x)))

Val Acc: 0.0986

5.0 2.5 0.0 2.5 5.0

mul(erfc(one(x)),exp(x))

Val Acc: 0.0986

5.0 2.5 0.0 2.5 5.0

softsign(expm1(selu(x)))

Val Acc: 0.0986

5.0 2.5 0.0 2.5 5.0

selu(sinh(bessel_i1e(x)))

Val Acc: 0.0958

5.0 2.5 0.0 2.5 5.0

sub(relu(bessel_i1e(x)),identity(x))

Val Acc: 0.0927

5.0 2.5 0.0 2.5 5.0

max(erfc(cosh(x)),negative(x))

Val Acc: 0.0892

5.0 2.5 0.0 2.5 5.0

softsign(bessel_i1e(abs(x)))

Val Acc: 0.0836

5.0 2.5 0.0 2.5 5.0

sub(erf(bessel_i1e(x)),softsign(x))

Val Acc: 0.0813

5.0 2.5 0.0 2.5 5.0

max(negative(exp(x)),identity(x))

Val Acc: 0.0810

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 16: Activation functions evaluated in the search for Co At Net on Imagenette.

New Search Spaces The PANGAEA search space was used in this paper because it is known to work well for deep architectures [5]. In the future it will be interesting to explore search spaces with different unary, binary, and n-ary operators. Beyond computation graphs, it may also be possible to apply techniques in this paper to optimize continuous vector representations of activation functions [1, 44].

Exploration vs. Exploitation The KNR approach was utilized to search for new activation functions because it performed well on the benchmark datasets (Section 5). In the future, it will be interesting to consider other algorithms and analyze their tradeoffs between exploration and exploitation. For example, in a resource-constrained environment where improvement is needed quickly, a more exploitative approach could be used to find an improved activation function in a short time. On the other hand, if substantial compute is available, an approach that focuses on exploration could be used to discover activation functions that perform well but are maximally different from functions used in modern architectures (Figure 8b). Novelty search [32] could serve as a suitable approach, and such discoveries could further understanding of how neural networks utilize different kinds of activation functions to learn.

Optimizing Multiple Activation Functions In a typical neural network design, the same activation function is used throughout the network. However, recent work has shown that it may be beneficial to have different activation functions at different locations, and further, that it may be useful to have different activation functions in the early and late stages of training [5]. Indeed, many hybrid architectures use Swish in convolutional layers and Re LU in attention layers [41]. Unfortunately, it is difficult to design these strategies manually, and so practitioners often use a single activation function for simplicity.

The techniques proposed in this paper may provide an avenue toward optimizing multiple activation functions in tandem. For example, the features for multiple candidate activation functions could be concatenated into a single feature vector, and this vector could be projected with UMAP to a low-dimensional space where performance prediction is more straightforward.

Optimizing Parametric Activation Functions Parametric activation functions have learnable parameters that allow them to refine their shape via gradient descent. In some tasks, this extra flexibility results in better performance over fixed activation functions [5]. The techniques introduced in this paper can be readily extended to optimizing the design of parametric activation functions as well. Because the surrogate considers the state of the network and activation function at initialization, it is possible to predict the performance by treating the activation function parameters as fixed to their initial values.

However, it may be possible to extend this idea further. Because the activation function parameters are implemented as neural network weights, each parameter will have a corresponding FIM eigenvalue. These extra eigenvalues will provide the surrogate with additional information that may help predict the performance more accurately.

For simplicity, current parametric activation functions usually initialize their parameters either to be 1.0 or to approximate some existing activation function, and the initialization is usually the same throughout the network. This method is likely suboptimal; the surrogate introduced in this paper could provide a smarter approach. By adjusting the initial parameter values and observing the change in predicted performance, the surrogate can be used to find better initializations, including different ones at different layers in the network. This contribution could make parametric activation functions even more powerful.

Optimizing Other Aspects of Neural Network Design By fixing the neural network architecture and varying the activation function, this paper showed that it is possible to use FIM eigenvalues to infer future performance. As the FIM is a fundamental quantity in machine learning, it may be possible to apply a similar strategy to optimize other aspects of neural network design, such as normalization layers, loss functions, or data augmentation strategies [8, 17, 18, 35]. If a meaningful distance metric between such objects can be defined, then UMAP could be used to map them to a low-dimensional space where performance prediction is much simpler.

Similarly, one could use the FIM eigenvalues to optimize alternate objectives beyond accuracy. Robustness is a particularly interesting objective, because the FIM can be used to describe a neural

network s robustness to small parameter perturbations. Other objectives, such as interpretability, fairness, or inference cost, could also be considered. For example, one could consider a multidimensional regression approach where instead of just predicting accuracy, the surrogate would predict each of these quantities separately. Such a method could present the user with a Pareto front of activation functions involving tradeoffs between these quantities.

Reverse Engineering Activation Functions UMAP was used to project activation functions to a low-dimensional space, and regression algorithms to predict the performance of activation functions in this space, i.e. to serve as a fitness function for the search. However, it is possible that there is no activation function that maps to the optimum of this fitness landscape. Indeed, because such search spaces are finite, the activation functions do not completely fill them. For example, there are empty regions in Figure 4, corresponding to activation functions outside of the predefined search space.

What should be done if an empty region of the embedding space has a higher predicted fitness than any of the candidate activation functions? In the paper, these regions were simply ignored, and the activation function with the highest predicted fitness was used. However, in the future, it may be possible to create activation functions that map to these empty spaces, an in so doing improve performance. One approach could be based on inverse transforms: Given a coordinate in the low-dimensional embedding space, UMAP can apply an inverse transform and return an object that would have mapped to those coordinates. This technique was already used for visualization in Figure 3. Using this approach, UMAP could generate a hypothetical desired FIM eigenvalue distribution, or a list of activation function outputs.

There are two challenges to this approach. First, because UMAP is a dimensionality-reduction algorithm, different activation functions can map to the same location in the embedding space. Thus, the mapping from embedding space back to activation functions is not well defined. Second, even if UMAP prescribes a FIM eigenvalue distribution that is predicted to result in good performance, it may be difficult to manually design an activation function to satisfy that distribution.

However, a generated list of prescribed activation function outputs is already a good start. From this list, it is possible to construct an activation function that interpolates through these points, either in a piecewise linear fashion, with splines, or using some other standard technique. Even without the corresponding FIM eigenvalues, such an approach could potentially improve the efficiency of novel activation function discovery, and lead to better designs for activation functions in the future.

G Compute Infrastructure

The experiments in this paper were implemented using an AWS g5.48xlarge instance with eight NVIDIA A10G GPUs. The total compute cost for the search experiments in Section 6 was 14.49 GPU-hours for All-CNN-C on CIFAR-100, 21.67 GPU-hours for Res Net-56 on CIFAR-100, and 196.25 GPU-days for Mobile Vi Tv2-0.5 on Image Net. This cost includes the time to train the eight baseline activation functions and then to evaluate 100 additional functions. The instance ran in Oregon (us-west-2) and was powered by renewable energy, so the experiments for this paper contributed no carbon emissions.