# hidden_symmetries_of_relu_networks__28bb5a5d.pdf

Hidden Symmetries of Re LU Networks

J. Elisenda Grigsby * 1 Kathryn Lindsey * 1 David Rolnick * 2 3

The parameter space for any fixed architecture of feedforward Re LU neural networks serves as a proxy during training for the associated class of functions but how faithful is this representation? It is known that many different parameter settings θ can determine the same function f. Moreover, the degree of this redundancy is inhomogeneous: for some networks, the only symmetries are permutation of neurons in a layer and positive scaling of parameters at a neuron, while other networks admit additional hidden symmetries. In this work, we prove that, for any network architecture where no layer is narrower than the input, there exist parameter settings with no hidden symmetries. We also describe a number of mechanisms through which hidden symmetries can arise, and empirically approximate the functional dimension of different network architectures at initialization. These experiments indicate that the probability that a network has no hidden symmetries decreases towards 0 as depth increases, while increasing towards 1 as width and input dimension increase.

1. Introduction

The success of deep learning relies upon the effectiveness of neural networks in expressing a wide variety of functions. However, it is generally impractical to explicitly write down the function computed by a network, so networks of a given architecture are described and learned via parameter vectors (encompassing weights and biases). The space of parameter vectors serves as a convenient proxy for the space of functions represented by a given network architecture, but it is an imperfect proxy since it is possible for two different

*Equal contribution 1Department of Mathematics, Boston College, Boston, USA 2School of Computer Science, Mc Gill University, Montreal, Canada 3Mila Quebec AI Institute, Montreal, Canada. Correspondence to: J. Elisenda Grigsby <grigsbyj@bc.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

parameter vectors to map to the same function.

Indeed, for any fully connected neural network with Re LU activation, it has been observed that the following transformations to the parameters are symmetries i.e., they do not change the function computed by the network (Rolnick & Kording, 2020; Bui Thi Mai & Lampert, 2020):

Permutation (P). Reordering the neurons in any hidden layer, along with the corresponding permutation of the weights and biases associated with them,

Scaling (S). For any neuron in any hidden layer, multiplying the incoming weights and the bias by any c > 0, while dividing the outgoing weights by c.

Such symmetries can have important implications for gradient-based learning algorithms that operate on parameters. Many authors have considered methods to optimize neural networks accounting for scaling symmetries (see e.g. Neyshabur et al. (2015)). While networks trained on the same data from different initializations are far apart in parameter space, they express similar functions; recent work suggests that such networks may in fact be close in parameter space if one accounts for permutation symmetries (see e.g. Ainsworth et al. (2023)).

It remains unknown, however, in what cases permutation and scaling are the only symmetries admitted by the parameters of a neural network, and how often there are other hidden symmetries (formalization in Definition F.4). Rolnick & Kording (2020) prove that under certain conditions, no hidden symmetries exist, and indeed that under these conditions it is possible to reverse-engineer a network s parameters up to permutation and scaling. Bui Thi Mai & Lampert (2020) prove that for all architectures with nonincreasing widths, there exist parameter settings with no hidden symmetries. Work by Grigsby et al. (2022b) on the functional dimension of networks suggests that a wide variety of hidden symmetries may exist depending on the parameter setting.

Our key results in this paper are as follows:

We prove (Theorem 4.1) that if all layers in a fully connected Re LU network are at least as wide as the

Hidden Symmetries of Re LU Networks

input layer, then there exists some setting of the parameters such that the network has no hidden symmetries. Indeed, we show that a positive-measure subset of parameter space admits no hidden symmetries.

We describe four mechanisms through which hidden symmetries can arise (Subsection 5). In particular, we prove (Proposition 5.1) that if the image of the domain in a hidden layer is contained in a subspace of positive codimension, then there is ambiguity in the neuron of the next layer map.

We experimentally estimate the functional dimension of randomly initialized network parameter settings. Our results suggest that the probability that a network has no hidden symmetries decreases with depth, but increases as input dimension and width increase together.

2. Related Work

Several important lines of work have considered the symmetries of the parametric representations of deep Re LU networks and their implications for learning. One focus area has been in designing optimization methods for neural networks that are invariant to scaling symmetries at individual neurons. Approaches for achieving this goal have include path normalization (Neyshabur et al., 2015), manifold optimization (Badrinarayanan et al., 2015), proceeding in a different vector space (Meng et al., 2019), and projection onto a normalized manifold (Huang et al., 2020).

Another fruitful direction of work has been in understanding how permutation symmetries in parameter space affect connectivity of the loss landscape. Kuditipudi et al. (2019) consider when different parameter permutations of a trained network are connected via piecewise linear paths in parameter space with low loss. Brea et al. (2019) show that linearly interpolating between different permutations of a network leads to flat regions of the loss landscape. Several recent works have shown that, if permutation symmetries are taken into account, then it is possible to interpolate between networks trained from different initializations, while maintaining a low loss barrier (Entezari et al., 2022; Ainsworth et al., 2023; Jordan et al., 2023).

In Armenta & Jodoin (2021), the authors define and study the moduli space of neural network functions using quiver representation theory. This theory provides a framework for extracting global symmetries of parameter space of a network architecture from symmetries of the computational graph and the activation functions involved (see also Ganev & Walters (2022)), Armenta et al. (2023); Zhao et al. (2022) build on these ideas to define neural teleportation algorithms aimed at using symmetries in the loss landscape to improve the efficiency of gradient descent in finding a minimal-loss solution. In Kunin et al. (2021) the authors

argue that symmetries in the loss landscape have associated conserved quantities that impact training dynamics.

A number of works have explicitly considered which symmetries are admitted by different Re LU networks. A number of authors consider the relationship between the parameters of a Re LU network and the geometry of its bent hyperplane arrangement (aka fold set); Milli et al. (2019), Rolnick & Kording (2020), and Carlini et al. (2020) use these properties to reverse-engineer the parameters of certain networks up to permutation and scaling symmetries, and Bui Thi Mai & Lampert (2020) proves that for certain architectures there exist parameter settings without hidden symmetries. In particular, in Rolnick & Kording (2020), the authors provide a geometric condition on the bent hyperplane arrangement of a parameter ensuring that the parameters can be reverseengineered up to permutation and scaling, hence has no hidden symmetries. It follows nearly immediately that all depth 2 networks and a positive measure subset of any depth 3 network have no hidden symmetries. In Bui Thi Mai & Lampert (2020), the authors prove that a positive measure subset of parameters in every non-widening (n0 n1 . . . nd) architecture has no hidden symmetries. In the present work, we prove the complementary result that a positive measure subset of parameters in every architecture whose hidden layers are at least as wide as the input layer (that is, n0 nℓ for all ℓ< d) has no hidden symmetries. An example of a family of architectures for which the question of the existence of parameters without hidden symmetries remains unresolved after the present work is an architecture of the form (n0, n1, n2, n3, n4) with n0 < n1 and n2 < n0.

Grigsby et al. (2022b) study the functional dimension of a network parameter setting the dimension of the space of functions that can be achieved by infinitesimally perturbing the parameters proving an upper bound on functional dimension that we conjecture is achieved for almost all parameter settings without hidden symmetries (cf. Lemma F.6).

3. Notation and Background

We consider fully connected neural networks with Re LU activation, denoting by (n0, . . . , nd) the architecture with input dimension n0, hidden layer widths n1, n2, . . . , nd 1, and output dimension nd.

Formally, let σ : Rn Rn denote the function that applies the activation function Re LU(x) := max{0, x} component-wise. For an architecture (n0, . . . , nd), we define a parameter space Ω:= RD where a parameter θ := (W 1, b1, . . . , W d) Ωconsists of weight matrices W i Rni+1 ni and bias vectors bi Rni for i = 0, . . . , d 1. Accordingly, D := nd+Pd i=1 ni(ni 1+1).

Hidden Symmetries of Re LU Networks

From a parameter θ we define a neural network function:

Fθ : Rn0 F 1 / Rn1 F 2 / . . . F d / Rnd , (1)

with layer maps given by:

F i(x) := σ(W ix + bi) for 1 i < d W ix for i = d. (2)

Note that for any θ Ω, Fθ is a finite piecewise-linear function that is, a continuous function for which the domain may be decomposed as the union of finitely many closed, convex pieces, on each of which the function is affine.

Following notation in Definition 4 of Masden (2022), let F(ℓ) := F ℓ . . . F 1 denote the composition of layer maps from the domain, ending with the ℓth layer map, and let F (ℓ) := F d . . . F ℓdenote the composition of the layer maps ending at the codomain, beginning with the ℓth layer map. In particular, Fθ = F (ℓ+1) F(ℓ).

We refer to the components of F(ℓ) as the neurons in the ℓth layer. The pre-activation map z(ℓ),i : Rn0 R associated to the ith neuron in the ℓth layer is given by:

z(ℓ),i(x) = πi W ℓ(F(ℓ 1)(x)) + bℓ , (3)

where πi : Rnℓ R denotes the projection onto the ith component. Following Hanin & Rolnick (2019), we refer to the zero-set of the pre-activation map for the ith neuron in the ℓth layer as its associated bent hyperplane, ˆHℓ i := z 1 (ℓ),i{0}.

The following notions from Rolnick & Kording (2020) (see also Hanin & Rolnick (2019), Grigsby & Lindsey (2022), and Masden (2022)), will play a crucial role in the proofs of our results. A (ternary) activation pattern (aka neural code or sign sequence) for a network architecture (n0, . . . , nd) is an N tuple s { 1, 0, +1}N of signs for N = Pd i=1 ni (Def. A.1). Each activation pattern determines a (frequently empty) subset of Rn0 called its associated activation region. Informally, an activation region is the collection of points x for which the pre-activation sign of a non-input neuron matches the sign of the corresponding component of s (Def. A.3, or Def. 1 of Hanin & Rolnick (2019) and Def. 13 of Masden (2022).) A linear region of a finite piecewise linear function is a maximal connected set on which the function is affine-linear. A Re LU network map Fθ : Rn0 Rnd for an architecture (n0, . . . , nd) is said to satisfy the Linear Regions Assumption (LRA) (Def. D.2) if each linear region is the closure of a single non-empty activation region corresponding to a an activation pattern with all nonzero entries.

Grigsby & Lindsey (2022) proved that for almost all parameters in any fixed architecture (n0, . . . , nd), the bent hyperplane (zero set of the pre-activation output) associated

to a non-input neuron has codimension 1 (i.e., dimension n0 1) in the domain.1

Moreover, it is proved in Masden (2022) that for almost all parameters in any fixed architecture (n0, . . . , nd), the intersection of k bent hyperplanes has codimension k in the domain. Following Masden (2022), we shall call a network whose bent hyperplanes satisfy this enhanced condition supertransversal.2

It is proved in Theorem 2 of Rolnick & Kording (2020) that if a supertransversal Re LU network map Fθ satisfies LRA,3

and each pair of bent hyperplanes associated to each pair of neurons in each pair of adjacent layers has non-empty intersection, then the network admits no hidden symmetries. Indeed, the authors detail an algorithm allowing the parameters to be extracted from the local geometry of the intersections, up to permutation and scaling.

Accordingly, we will say that a network map Fθ associated to a parameter θ Ωsatisfies the transverse pairwiseintersection condition, abbreviated TPIC, if its associated bent hyperplane arrangement is supertransversal, and each pair of bent hyperplanes associated to each pair of neurons in each pair of adjacent layers has non-empty intersection. See Figure 1.

4. Main Result

Theorem 4.1. Let (n0, . . . , nd) be a feedforward Re LU network architecture satisfying (n0 = k) nℓfor all ℓ, and let Ωdenote its parameter space. A positive-measure subset of Ωhas no hidden symmetries.

Proof Sketch: A more explicit version of Theorem 2 of Rolnick & Kording (2020), stated in Lemma D.15, tells us that any network map Fθ satisfying TPIC and LRA on a neighborhood of the intersections admits no hidden symmetries.

By Proposition D.13, TPIC is an open condition. That is, any parameter satisfying TPIC has an open neighborhood of parameters also satisfying TPIC. Lemma D.14 furthermore tells us that any parameter satisfying LRA in a neighborhood of the intersections has an open neighborhood on which LRA is satisfied in a neighborhood of the intersections.

We proceed by induction on the depth, d. When d = 1, there is nothing to prove, so our true base case is d = 2. In this case, we need only show that we can choose a combinatorially stable parameter θ Ωthat satisfies

1Note that it may also be empty. 2The formal definition of supertransversality, given in Definition B.7, is stronger than what is stated here, but implies it. 3Note that (Rolnick & Kording, 2020) do not need the LRA to be satisfied on the entire domain only on a relevant subset for their algorithm. See Section D in the Appendix.

Hidden Symmetries of Re LU Networks

(i) supertransversality, (ii) each bent hyperplane from ℓ= 2 has non-empty intersection with each hyperplane from ℓ= 1, and (iii) LRA on a neighborhood of the pairwise intersections.

Since the set of parameters θ satisfying (i) and (iii) has full measure in Ω, it is routine though technical to guarantee that these conditions are satisfied once we find a single parameter satisfying (ii). Details are given in the appendix.

To arrange that each bent hyperplane from ℓ= 2 has nonempty intersection with each hyperplane from ℓ= 1, we make use of so-called positive-axis hyperplanes. Generically, an affine hyperplane intersects each coordinate axis of Rn on either the positive or negative side. A positive-axis hyperplane intersects all coordinate axes on the positive side. Equivalently, a positive-axis hyperplane is describable as the zero set of an affine-linear equation with positive weights and negative bias:

H := { x Rn | w x + b = 0}

for w = (w1, . . . , wn) satisfying wi > 0 for all i and b < 0.

The important property of a positive-axis hyperplane is that it has non-empty intersection with every origin-based ray contained in the non-negative orthant, O 0 Rn.

We now use the fact (Lemma C.4) that for almost all parameters θ, the image under F 1 of almost every unbounded ray in Rn0 is a ray in O 0 Rn1 (not necessarily based at the origin). Since every affine hyperplane in Rn0 contains an (n0 2) dimensional sphere of unbounded rays, we can ensure, by perturbing the parameters if necessary, that the image of every hyperplane H1 Rn0 associated to F 1 has non-empty intersection with any given positive-axis hyperplane H2 Rn1 with sufficiently high bias. We then apply the following lemma, with G = F 1, H = F 2, to each pair S = F 1(H1) for H1 Rn0 a hyperplane associated to the first layer map F 1 and S = H2 Rn1 a sufficiently high bias positive-axis hyperplane associated to the second layer map F 2 to ensure that the the bent hyperplanes in the domain, Rn0, have non-empty pairwise intersection, as desired.

Lemma 4.2. Let G : A B and H : B C be functions, let F = H G be their composition, and let S, S B be subsets of the intermediate domain. Then G 1(S) G 1(S ) is non-empty iff G(A) S S is non-empty.

Proof. Immediate from the fact that

G 1(S) G 1(S ) = G 1 (G(A) S S ) .

The inductive step in the construction of a combinatorially stable depth d network from a combinatorially stable depth

d 1 network satisfying TPIC is more intricate, but the key idea is to notice that adding a layer to the network preserves all previous bent hyperplanes and their intersections, so all that is needed is to choose parameters for the final layer ensuring that the new bent hyperplanes associated to the final layer have non-empty pairwise intersection with the bent hyperplanes from the penultimate layer.

This is where we will need to use one more key fact about the images of activation regions under Re LU neural network maps, which, when combined with Lemma 4.2 above, will allow us to conclude that certain points of intersection between the images of hyperplanes in the penultimate layer and hyperplanes associated to the final layer can be pulled back to obtain points of intersection between the corresponding bent hyperplanes in the domain:

Proposition 4.3. Let (n0, . . . , nd) be a Re LU neural network architecture with (n0 = k) nℓfor all ℓ. For almost all parameters θ Ω, if C Rk is an activation region of Fθ with activation pattern s C = (s1 C, . . . , sd C), then Fθ(C) is a polyhedral set of dimension min{dim(s1 C), . . . , dim(sd C), k}.

Here, dim(sℓ C) refers to the number of +1 s in the binary tuple that forms the activation pattern sℓ C associated to the output neurons of the ℓth layer map, F ℓ, on the activation region C (cf. Section A in the Appendix). The above proposition has the following useful corollary, which allows us to conclude that any points of intersection between (bent) hyperplanes in the penultimate layer can be pulled back faithfully to points of intersection between the corresponding bent hyperplanes in the domain:

Corollary 4.4. Let (n0, . . . , nd) be a Re LU network architecture and θ Ωas above. If C is an activation region of Fθ with activation pattern satisfying dim(sℓ C) = k for all ℓ, then Fθ restricted to the interior of C is a homeomorphism onto its image.

The linear regions assumption is satisfied for this construction because sufficiently many neurons from previous layers are active, which implies that we can distinguish activation regions (cf. Lemma 8 of Hanin & Rolnick (2019)). The stability of this construction on a closed subset of the domain containing the pairwise intersections (Proposition D.12) follows from genericity and supertransversality, which allows us to find a positive measure open neighborhood of θ that also admits no hidden symmetries.

See Figure 1 for an illustration of what a bent hyperplane arrangement produced by our construction looks like for architecture (n0, . . . , n3) = (2, 5, 3, 3).

Hidden Symmetries of Re LU Networks

Figure 1. An illustration of a bent hyperplane arrangement in the domain Rn0 satisfying TPIC produced by our construction for a Re LU network of architecture (2,5,3,3). The bent hyperplanes for layers 2 (resp., 3) are represented with dark (resp., light) dashed lines.

5. Mechanisms by Which Hidden Symmetries Arise

For any feedforward Re LU architecture (n0, . . . , nd) with at least one hidden layer (i.e. d 2), the set of parameters admitting hidden symmetries has positive measure (Grigsby et al. (2022b)). There are numerous mechanisms by which hidden symmetries can arise. We give a partial list below.

i. A stably unactivated neuron (Def. A.2). The positive half-space in Rnℓassociated to a neuron of the layer map F ℓ, for 1 < ℓ< d, could have empty intersection with F(ℓ 1)(Rn0), the image of Rn0 under the earlier layer maps. If this intersection remains empty under small perturbations of the parameters defining this neuron (while keeping the other neurons fixed), such perturbations will not alter the function.

In particular, the image of Rn0 in any hidden layer Rnℓis contained in the closed positive orthant, so a neuron whose associated half-space has (stably) empty intersection with the positive orthant results in a hidden symmetry. See Theorem 7.3 and Lemma 7.4 of Grigsby et al. (2022b)) for a probabilistic lower bound on this phenomenon, and Figure 2 for an illustration.

ii. A pair of neurons in consecutive layers that are never co-active. As noted in Rolnick & Kording (2020), it is possible for two neurons in adjacent layers of a network to never be simultaneously active (cf. Definition A.5). In such a case, the weight between these neurons is generically able to be perturbed without changing the function computed by the network. This is because

either (a) the upstream neuron is inactive, in which case the downstream neuron receives zero input from it regardless of the weight between them, or (b) the downstream neuron is inactive, in which case it outputs zero regardless of the input received from the upstream neuron. See Figure 3.

iii. A Re LU of a later layer may collapse complexity constructed by earlier layers, negating the impact of the parameters from those layers. One way this could happen is if multiple linear regions of F i 1 . . . F 1(Rn0), 1 < i < d are collapsed by one or more Re LUs in layer F i to form a single linear region (violating LRA). This collapsing could erase the effect of parameters from these earlier layers. See Figure 4.

iv. The (relevant part of the) image in a hidden layer may be contained in a subspace of positive co-dimension. Suppose, for 1 < i < d, the image of the domain in the ith hidden layer, F i 1 . . . F 1(Rn0), is contained in an affine-linear subspace A Rni of positive codimension. Then for any neuron N of the (i + 1)th layer, denoting by H the associated oriented and co-normed hyperplane in Rni, there is a 1-parameter family of oriented, co-normed hyperplanes {Ht} obtained by rotating H around its intersection with A that all give rise to the same map (Lemma F.7). See Figure 5.

Proposition 5.1 says that, given a neuron map of the (k + 1)th layer, if the part of

Im(k) := F k . . . F 1(Rn0)

where that neuron is nonnegative is contained in an affine-linear subspace of Rnk of positive codimension, then the hyperplane associated to that is not uniquely determined.

Proposition 5.1. Let η : Rnk R1 be a neuron given by η = σ A for some fixed affine-linear map A : x 7 x n H b H. Suppose

(a) there exists an affine-linear hyperplane S Rnk such that Im(k) {x | η(x) 0} S, (b) the hyperplanes S and H are in general position

(c) no unbounded ray contained in Im(k) has a slope that is orthogonal to n H = 0.

Then there is a 1-parameter family of (distinct) hyperplanes {Ht}t in Rnk such that

η F k . . . F 1 = f Ht F k . . . F 1

where f Ht : Rnk R is a neuron associated to the hyperplane Ht (equipped with some suitable choice of oriented conorm).

The proof of Proposition 5.1 is in F.1.

Hidden Symmetries of Re LU Networks

F(ℓ 1)(Rn0)

Figure 2. Illustration of mechanism (i) through which hidden symmetries arise. In the figure above we see Rnℓ 1, the domain of the hidden layer map, F ℓ. The image (represented in shades of orange) of Rn0 under the composition, F(ℓ 1), of all previous layer maps is always contained in the non-negative orthant. Since the positive half-space of Hℓ 1 misses the non-negative orthant, any sufficiently small perturbation of Hℓ 1 will have no impact on the overall function.

Figure 3. Illustration of mechanism (ii) through which hidden symmetries arise. In the figure above we see two non-intersecting bent hyperplanes from adjacent layers, pulled back to Rn0, the input layer. Because their intersection is empty, they admit a coorientation (pictured) for which they are never co-active. It follows that the overall function has no dependence on the weight W ℓ ij between them.

6. Experiments

We conducted an empirical investigation of hidden symmetries at various parameter settings. It is computationally impractical to directly rule out the existence of hidden symmetries at a parameter using, e.g., the geometry of the bent hyperplane arrangement. Accordingly, in our experiments we rely on a relationship between the symmetries of a parameter and its functional dimension to probe the prevalence of hidden symmetries for a variety of architectures.

Recall that the functional dimension dimfun(θ) of parameter

F(ℓ 1)(Rn0)

Figure 4. Illustration of mechanism (iii) through which hidden symmetries arise. The orange surface is the image of the domain inside Rnℓ 1. The blue hyperplane Hℓ 1, which is oriented upwards, is associated to a neuron of F ℓ; everything below the blue plane is zeroed out by the Re LU of this neuron. This neuron is not stably unactivated, because the orange surface has nonempty intersection with the positive half space above the blue plane. However, perturbing parameters that are only expressed in the part of the orange surface below the blue plane does not change the output of the neuron.

Figure 5. Illustration of mechanism (iv) through which hidden symmetries arise. If the image of Rn0 in a hidden layer Rnℓis contained in the affine-linear subspace S, replacing the neuron N defined by hyperplane H with the neuron Nt defined by hyperplane Ht (with a suitably scaled co-norm) yields the same function.

θ is, informally, the number of linearly independent ways the function Fθ can be altered by perturbing the parameter θ. Following the formal Definition F.2, we can approximate dimfun(θ) by evaluating the function Fθ at a finite subset Z Rn0 of points in input space, stacking the outputs into a single vector of dimension |Z| nd, and calculating the rank of the Jacobian of this vector with respect to θ. The functional dimension dimfun(θ) is the supremum of this rank over all sets Z. Intuitively, this is because we are evaluating the number of coordinates of θ that independently affect the value of the function Fθ at some point in its domain (recognizing that some weights and biases may not have an

Hidden Symmetries of Re LU Networks

Figure 6. For each of various network architectures, we approximate the distribution of functional dimensions as the parameter θ Ω varies. Different plots show networks of different depths, while different colors in a plot correspond to varying the input dimension and width n0 = n1 = = nd 1. Black dots show the fraction of networks with full functional dimension (no hidden symmetries). We observe that the proportion of networks with full functional dimension increases with width and decreases with depth. Insets in the figure zoom in on certain multimodal distributions, showing that modes are spaced apart by approximately the width of the network.

effect on Fθ except on a limited subset of input space).

In our experiments, we consider networks with nd = 1, and to approximate the functional dimension, we evaluate the set of gradient vectors { θFθ(z)}z Z over a finite subset of m points Z = {z1, . . . , zm} Rn0 in input space (we use points sampled i.i.d. from the zero-centered unit normal). Then, for m sufficiently large, we have:

dimfun(θ) rank

θFθ(z1) ... θFθ(zm)

We initialize networks with weights drawn i.i.d. from normal distributions with variance 2/fan-in, according to standard practice for Re LU networks (He et al., 2015; Hanin & Rolnick, 2018), and biases drawn i.i.d. from a normal distribution with very small variance (arbitrarily set to 0.01). To improve the quality of the approximation in (4), we use m sample points for m equal to 100 times the maximum possible value for dimfun(θ), according to the upper bound given in Grigsby et al. (2022b) (in Appendix G, we show that our experimental conclusions are not dependent on the

choice of m). Note that our approach yields approximations of dimfun(θ) which are also necessarily lower bounds to it; in particular, any computed value that attains the theoretical upper bound on dimfun(θ) is guaranteed to be accurate and thus is consistent with the parameter admitting no hidden symmetries.

In Figure 6, we plot the distribution of approximate functional dimensions for networks with depth d = 4, 5, 6 and with n0 = n1 = = nd 1 equal to 5, 10, 15. For each architecture, we consider 5000 different choices of θ Ω, computing the fraction that lead to networks with a given approximated functional dimension. Thus, we find that for depth 4, the widths 5, 10, 15 result in respectively 25%, 48%, 66% percent of networks having the maximum possible functional dimension (marked with black dots in the Figure), while for depth 6, the widths 5, 10, 15 result in respectively 1%, 3%, 5% percent of networks having the maximum possible functional dimension.

We observe that increasing the depth d (while keeping the width fixed) results in a decreased probability of full functional dimension, and thus the likely absence of hidden symmetries (cf. Lemma F.6). By contrast, increased width

Hidden Symmetries of Re LU Networks

(with n0 = n1 = = nd 1, i.e. varying the input dimension and width together, while keeping depth fixed) is associated with an increasing probability of full functional dimension.

We offer the following explanations of possible drivers of these observed phenomena. Increasing the depth increases the variance and higher moments associated with properties such as the activation of individual neurons (Hanin & Rolnick, 2018; Hanin, 2018), thereby increasing the likelihood that functional dimension is decreased via mechanisms (i) or (iii). By contrast, increasing the input dimension and width increases the chance that the bent hyperplanes associated with two neighboring neurons will intersect. (Intuitively, this is because a hyperplane will intersect a bent hyperplane unless the latter curves away in all dimensions, which becomes exponentially unlikely as the dimension increases, in the same way that the probability a matrix is positive definite decays exponentially with dimension (Dauphin et al., 2014)). However, further study is required, and it is worth noting that a counteracting factor as the width increases may be that the maximum functional dimension increases, so the support of the distribution also increases and the probability assigned to any individual functional dimension, including the maximum, is less than it might be for a distribution with smaller support.

We also note that the distributions of approximate functional dimensions appear to approach smooth unimodal curves if the probability of full functional dimension is low (as in the Depth 6 plots), but are strongly multimodal when there is a high probability of full functional dimension. In the inset panels of the figure, we show zoomed-in versions of the upper ends of certain distributions, detailing the multiple peaks in the distribution. We note that for each such multimodal distribution, the peaks appear to be spaced by a value equal to the width of the network. We note that of the mechanisms we consider for hidden symmetries, mechanism (i) (a stably inactivated neuron) should reduce functional dimension by 2 width (the number of incoming and outgoing weights of the neuron), while (ii) (two neurons that are never co-active) reduces functional dimension by one (the weight between the neurons). Thus, neither of these mechanisms should apply in this case, and mechanisms (iii) or (iv) may apply; the phenomenon bears further investigation.

In Figure 7, we show the results of a similar set of experiments, where the input dimension n0 is instead kept fixed at 5 as the widths n1 = n2 = = nd 1 of the hidden layers vary. While we again observe in this case that the probability of full functional dimension decreases with depth, this probability slightly decreases with width, unlike the previous scenario (Figure 6).

As in Figure 6, we again note that when the distributions are multimodal, the gaps between the modes are spaced

according to the width n1 = = nd 1.

7. Conclusions and Further Questions

We have performed both a theoretical and empirical investigation of the following question: How faithfully does the parameter space of a feedforward Re LU network architectures model its associated function class?

Our investigation centers on a relationship between wellestablished symmetries of parameter space (operations on parameters that leave the resulting function unchanged) and the functional dimension of a parameter (informally, the true dimension of the local search space for any gradient-based optimization algorithm). It was established in Grigsby et al. (2022b) that the functional dimension is inhomogeneous across parameter space, but the prevalence of this inhomogeneity and specifics about its dependence on architecture was previously unknown.

In the theoretical component of this work, we significantly expand the collection of architectures containing parameters with no hidden symmetries beyond the restricted classes considered in Bui Thi Mai & Lampert (2020) and Rolnick & Kording (2020). We also provide a partial list of geometric mechanisms that give rise to positive-dimensional spaces of hidden symmetries.

Our empirical investigation strongly suggests that the probability distribution on the functional dimension at initialization is both interesting and architecture-dependent. In particular, under standard assumptions on the probability distribution on the parameters, the expected value of the functional dimension appears to scale positively with width and negatively with depth. It also appears to be multimodal when the ratio of the width to the depth is high, with modes separated by integer multiples of the width. Further investigation of these effects may help us understand which mechanisms dominate in producing hidden symmetries, at various depth vs. width scales.

In future work, we hope to investigate how functional dimension evolves during training, since parameters with lower functional dimension are associated to lower-complexity functions that are more likely to generalize well to unseen data. Since lower functional dimension corresponds to higher-dimensional spaces of local symmetries, low functional dimension should induce local flatness of the loss landscape. Comparing this conjecture to recent work suggesting that stochastic gradient descent favors flat minima of the loss landscape,4 this could at least partially explain any implicit regularization behavior of SGD for feedforward Re LU neural network architectures.

4Critical points for which the Hessian of the loss has many eigenvalues close to 0.

Hidden Symmetries of Re LU Networks

Figure 7. This figure shows similar experiments as in Figure 6, but with input dimension fixed at 5 instead of equal to the width of hidden layers. Here, we observe that the proportion of networks with full functional dimension decreases with depth as in Figure 6, while slightly decreasing with width.

Acknowledgments

J.E.G. acknowledges support from Simons Collaboration grant 635578 and NSF grant DMS - 2133822. K.L. acknowledges support from NSF grants DMS-2133822 and DMS-1901247. D.R. acknowledges support from the Canada CIFAR AI Chairs Program and an NSERC Discovery Grants.

Ainsworth, S. K., Hayase, J., and Srinivasa, S. Git Re Basin: Merging models modulo permutation symmetries. In International Conference on Learning Representations (ICLR), 2023.

Armenta, M., Judge, T., Painchaud, N., Skandarani, Y., Lemaire, C., Gibeau Sanchez, G., Spino, P., and Jodoin, P.-M. Neural teleportation. Mathematics, 11(2):480, 2023.

Armenta, M. A. and Jodoin, P. The representation theory of neural networks. Mathematics, 9(24):3216, 2021.

Badrinarayanan, V., Mishra, B., and Cipolla, R. Symmetry-

invariant optimization in deep networks. Preprint ar Xiv:1511.01754, 2015.

Brea, J., Simsek, B., Illing, B., and Gerstner, W. Weightspace symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape. Preprint ar Xiv:1907.02911, 2019.

Bui Thi Mai, P. and Lampert, C. Functional vs. parametric equivalence of Re LU networks. In International Conference on Learning Representations (ICLR), 2020.

Carlini, N., Jagielski, M., and Mironov, I. Cryptanalytic extraction of neural network models. In Advances in Cryptology (CRYPTO), 2020.

Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2014.

Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations (ICLR), 2022.

Hidden Symmetries of Re LU Networks

Ganev, I. and Walters, R. Quiver neural networks. Preprint ar Xiv:2207.12773, 2022.

Grigsby, J. E. and Lindsey, K. On transversality of bent hyperplane arrangements and the topological expressiveness of Re LU neural networks. SIAM Journal on Applied Algebra and Geometry, 6(2):216 242, 2022.

Grigsby, J. E., Lindsey, K., and Masden, M. Local and global topological complexity measures of Re LU neural network functions. Preprint ar Xiv:2204.06062, 2022a.

Grigsby, J. E., Lindsey, K., Meyerhoff, R., and Wu, C. Functional dimension of feedforward Re LU neural networks. Preprint ar Xiv:2209.04036, 2022b.

Guillemin, V. and Pollack, A. Differential Topology. AMS Chelsea Publishing. AMS Chelsea Publishing, 2010. ISBN 9781470411350.

Hanin, B. Which neural net architectures give rise to exploding and vanishing gradients? In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Hanin, B. and Rolnick, D. How to start training: The effect of initialization and architecture. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Hanin, B. and Rolnick, D. Deep Re LU networks have surprisingly few activation patterns. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on Image Net classification. In IEEE International Conference on Computer Vision (ICCV), 2015.

Huang, L., Liu, X., Qin, J., Zhu, F., Liu, L., and Shao, L. Projection based weight normalization: Efficient method for optimization on oblique manifold in DNNs. Pattern Recognition, 105:107317, 2020.

Jordan, K., Sedghi, H., Saukh, O., Entezari, R., and Neyshabur, B. REPAIR: Renormalizing permuted activations for interpolation repair. In International Conference on Learning Representations (ICLR), 2023.

Kuditipudi, R., Wang, X., Lee, H., Zhang, Y., Li, Z., Hu, W., Ge, R., and Arora, S. Explaining landscape connectivity of low-cost solutions for multilayer nets. Advances in Neural Information Processing Systems (Neur IPS), 2019.

Kunin, D., Sagastuy-Bre na, J., Ganguli, S., Yamins, D. L. K., and Tanaka, H. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In International Conference on Learning Representations ICLR, 2021.

Masden, M. Algorithmic determination of the combinatorial structure of the linear regions of Re LU neural networks. Preprint ar Xiv:2207.07696, 2022.

Meng, Q., Zheng, S., Zhang, H., Chen, W., Ma, Z.-M., and Liu, T.-Y. }-SGD: Optimizing Re LU neural networks in its positively scale-invariant space. In International Conference on Learning Representations (ICLR), 2019.

Milli, S., Schmidt, L., Dragan, A. D., and Hardt, M. Model reconstruction from model explanations. In Conference on Fairness, Accountability, and Transparency (FAcc T), 2019.

Neyshabur, B., Salakhutdinov, R. R., and Srebro, N. Path SGD: Path-normalized optimization in deep neural networks. Advances in Neural Information Processing Systems (Neur IPS), 2015.

Rolnick, D. and Kording, K. P. Reverse-engineering deep Re LU networks. In International Conference on Machine Learning (ICML), 2020.

Schrijver, A. Theory of linear and integer programming. Wiley-Interscience Series in Discrete Mathematics. John Wiley & Sons, Ltd., Chichester, 1986.

Stanley, R. P. An introduction to hyperplane arrangements. In Geometric combinatorics, volume 13 of IAS/Park City Math. Ser., pp. 389 496. Amer. Math. Soc., Providence, RI, 2007.

Zhao, B., Dehmamy, N., Walters, R., and Yu, R. Symmetry teleportation for accelerated optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

Hidden Symmetries of Re LU Networks

A. Combinatorial/Geometric Background and Notation

Let O 0 := {(x1, . . . , xn) Rn | xi 0 i}

denote the non-negative orthant in Rn. Letting

Rn k := {(x1, . . . , xn) Rn | xk+1 = . . . xn = 0}

denote the initial coordinate k plane, we will denote by

O 0 k := O 0 Rn k

the distinguished k-face of the non-negative orthant which is obtained by intersecting with the initial coordinate k plane.

Let θ Ωbe a parameter in the parameter space of a feedforward Re LU network architecture (n0, . . . , nd), and let Fθ be defined as in Equations 1 and 2.

Letting zℓ i = πi(W ℓx + bℓ) denote the ith component of the pre-activation output of the ℓth layer map F ℓ: Rnℓ 1 Rnℓ of Fθ, we denote its zero set by Hℓ i := (zℓ i) 1{0} Rnℓ 1. Note that for almost all parameters, Hℓ i is an affine hyperplane.

Accordingly, we associate to each layer map F ℓthe set

Aℓ= {Hℓ 1, . . . , Hℓ nℓ} Rnℓ 1,

which for almost all parameters is a hyperplane arrangement. In the course of the inductive proof of our main theorem, we will need notation for the preimages of the hyperplanes in the previous layer:

Aℓ= { Hℓ i }nℓ i=1 := F 1 ℓ Hℓ i nℓ i=1 Rnℓ 2.

and in the domain (the reader easily checks that the latter are precisely the bent hyperplanes defined in Equation 3):

ˆ Aℓ= { ˆHℓ i }nℓ i=1 := n F 1 (ℓ) (Hℓ i ) onℓ

Aℓis said to be generic if for all subsets {Hℓ i1, . . . , Hℓ ip} Aℓ,

it is the case that Hℓ i1 . . . Hℓ ip is an affine-linear subspace of Rnℓ 1 of dimension nℓ 1 p, where a negative-dimensional intersection is understood to be empty.

A layer map F ℓis said to be generic if Aℓis generic. A parameter θ or the corresponding network map Fθ is said to be generic if all of its layer maps are generic.

It is well-established in the hyperplane arrangement literature (cf. Stanley (2007)) that generic arrangements are full measure. It follows (Grigsby & Lindsey, 2022) that generic network maps are full measure in parameter space.

A.1. Decompositions of Polyhedral Sets

Recalling that a polyhedral set in Rnℓ 1 is an intersection of finitely many closed half spaces, a hyperplane arrangement in Rnℓ 1 induces a polyhedral decomposition of Rnℓ 1 into finitely many polyhedral sets. The face structure on these polyhedral sets gives the decomposition the structure of a polyhedral complex. By pulling back these polyhedral complexes to the domain, Rn0, and taking intersections, we inductively obtain the canonical polyhedral complex C(Fθ) as follows.

For ℓ {1, . . . , d}, denote by Rℓthe polyhedral complex on Rnℓ 1 induced by the hyperplane arrangement associated to the ℓth layer map, F ℓ. Inductively define polyhedral complexes C F(1) , . . . , C F(ℓ) on Rn0 as follows: Set C F(1) = F 1 := R1 and for i = 2, . . . , m, set

C(F(ℓ)) := n S F(ℓ) 1 (Y ) | S C F(ℓ) , Y Rℓo .

Set C(Fθ) := C(Fθ = F(d)). See Grigsby & Lindsey (2022) and Masden (2022) for more details.

Hidden Symmetries of Re LU Networks

It was proved in Grigsby & Lindsey (2022) that on a full measure subset of parameter space, the n0-cells of the canonical polyhedral complex, C(F), are the closures of the activation regions, and the (n0 1)-skeleton of C(F) is the bent hyperplane arrangement associated to Fθ.

We will also need the following terminology and results (cf. Schrijver (1986)) pertaining to the structure of polyhedral sets. See Grigsby & Lindsey (2022) and Grigsby et al. (2022a) for additional details.

A polyhedral set P Rn is said to be pointed if it has a face of dimension 0. The convex hull of a set S Rn is the intersection of all convex subsets of Rn that contain S. A cone in Rn is a set C such that if x, y C and λ, µ 0, then λx + µy C. Let P and Q be polyhedral sets embedded in Rn. The Minkowski sum of P and Q is P + Q := {p + q | p P, q P}. The characteristic cone of P, denoted Cone(P), is the maximal set {y | x + y P for all x P} that also has the structure of a cone. Note that a polyhedral set is unbounded iff its characteristic cone is non-empty. Moreover, every polyhedral set P has a decomposition as the Minkowski sum of a bounded polyhedral set (polytope) PB and Cone(P). If P is pointed, we may take PB to be the convex hull of its 0 cells, cf. Theorem 8.5 of Schrijver (1986).

Definition A.1. A ternary activation pattern (aka ternary neural code or ternary sign sequence) for a network architecture (n0, . . . , nd) with N neurons is a ternary tuple s { 1, 0, +1}N. The ternary labeling of a point x Rn0 is the sequence of ternary tuples

sx := s1 x, . . . , sd x { 1, 0, +1}n1+...+nd

indicating the sign of the pre-activation output of each neuron of Fθ at x.

Explicitly, letting Fθ be defined as in Equations 1 and 2, and x Rn0 any input vector, and the pre-activation output z(ℓ),i(x) of the ith neuron in the ℓth layer at x is as in Equation 3, sℓ x = sℓ 1, . . . , sℓ nℓ are defined by sℓ x,i = sgn(z(ℓ,i)(x)) (using the convention sgn(0) = 0).

Moreover, for all parameters θ it follows immediately from the definitions that the ternary labeling is constant on the interior of each cell of C(Fθ), inducing a ternary labeling s C on each cell C of C(Fθ) (Masden, 2022). If sℓ x,i 0 at an input vector x (resp., sℓ C,i 0 on a cell C), we say that the ith neuron in the ℓth layer is off or turned off at x (resp., on C).

Definition A.2 (Grigsby et al. (2022b)). Fix a parameter θ Ω. A neuron (say it is the ith neuron of index ℓ) is said to be stably unactivated at θ if there exists an open neighborhood U Ωof θ such sℓ x,i(u) 0 for every x Rn0 and every u U, where sℓ x,i(u) denotes the corresponding coordinate of ternary coding with respect to the parameter u.

Definition A.3. The activation region of Fθ corresponding to a ternary activation pattern s is a maximal connected component of the set of input vectors x Rn0 for which the ternary labeling sx equals s.

Definition A.4. A -activation pattern is a ternary activation pattern in which every coordinate is nonzero, and a - activation region is an activation region associated to a -activation pattern.

Note that any -activation region is an open set.

Definition A.5. Neuron i of layer ℓand neuron j of layer ℓ+1 are called never coactive if {x Rn0 | sℓ x,i = sℓ+1 x,j = 1} = .

Remark A.6. For generic, supertransversal (Definition B.7) networks, it follows from Grigsby & Lindsey (2022) that the -activation regions of C(Fθ) are precisely the interiors of the n0 cells of C(Fθ).

Definition A.7. For s = (s1, . . . , sd) { 1, 0, +1}d a ternary d tuple let

O 0 s := {x O 0 | xi = Re LU(sixi)}

denote the face of the non-negative orthant consisting of points whose ith component is 0 when si 0.

The following lemma is immediate from the definitions.

Lemma A.8. Let F be a Re LU neural network map of architecture (n0, . . . , nd), C(F) its canonical polyhedral complex, and C a cell of C(F) with ℓth ternary label sℓ C. Then F(ℓ)(C) is contained in O 0 sℓ C .

Definition A.9. The dimension of a ternary label s, denoted dim(s), is the dimension of the face O 0 s . Equivalently, dim(s) is the number of +1 s in the tuple s.

Hidden Symmetries of Re LU Networks

B. Transversality

Recall the following classical notions (cf. Guillemin & Pollack (2010) and Section 4 of Grigsby & Lindsey (2022)):

Definition B.1. (Guillemin & Pollack, 2010) Let X be a smooth manifold with or without boundary, Y and Z smooth manifolds without boundary, Z a smoothly embedded submanifold of Y , and f : X Y a smooth map. We say that f is transverse to Z and write f Z if dfp(Tp X) + Tf(p)Z = Tf(p)Y (5)

for all p f 1(Z).

Definition B.2 (Definition 4.2 of Grigsby & Lindsey (2022)). Let C be a polyhedral complex in Rn, let f : |C| Rr be a map which is smooth on all cells of C, and let Z be a smoothly embedded submanifold (without boundary) of Rr. We say that f is transverse on cells to Z and write f c Z if the restriction of f to the interior, int(C), of every cell C of C is transverse to Z (in the sense of Definition B.1).

Note that we use the convention that the interior of a 0 cell is the 0 cell itself.

The definition above can be extended so that Z is the domain of a polyhedral complex in Rr. This was essentially carried out in Masden (2022):

Definition B.3. (Masden, 2022) Let C be a polyhedral complex in Rn, let f : |C| Rr be a map which is smooth on all cells of C, and let Z be a polyhedral complex in Rr. We say that f and Z are transverse on cells and write f c |Z| if the restriction of f to the interior of every cell of C is transverse to the interior of every cell of |Z| (in the sense of Definition B.2).

Transversality for intersections of polyhedral complexes implies that each non-empty cell in the intersection complex has the expected dimension. The following extension of the classical Map Transversality Theorem to polyhedral complexes is immediate (cf. Grigsby & Lindsey (2022) Cor. 4.7, Masden (2022)):

Corollary B.4. Let C be a polyhedral complex in Rn, let f : |C| Rr be a map which is smooth on all cells of C, and let Z be a polyhedral complex in Rr for which f c Z. Then for every pair of cells C C and Z Z, f 1(Z) int(C) is a (possibly empty) smoothly embedded submanifold of int(C) whose codimension in int(C) equals the codimension of Z in Rr.

We use the standard convention that a manifold of negative dimension is empty. In particular, if f : |C| Rr and Z Rr

are transverse on cells as above, and C C is a 0 cell, then f 1(Z) C = for all cells Z of positive codimension.

Lemma B.5. Let C be a polyhedral complex with domain |C| Rn, and F : |C| Rr a map that is affine-linear on cells. Then F(C) is a polyhedral complex in Rr.

Note that F(C) need not be imbedded, nor even immersed.

Proof. We begin by showing that the image, F(C), of a k dimensional cell (polyhedral set) C C is itself a polyhedral set in Rr. By definition, C is the solution set of finitely many affine-linear inequalities. That is, there exists (for some m) an m n matrix A and a vector b Rn such that

C := {x Rn | Ax b}. (6)

Let V = aff(C) be the k dimensional affine hull of C and choose any point p V . Noting that F is affine-linear on V , let j denote the rank of F restricted to V .

Now choose an (affine) basis B = {v1, . . . , vk} for V whose final (k j) vectors form a basis for the affine kernel of the map F restricted to V . That is, all vectors v of the form

satisfy F(v) = F(p).

Let W = Span{v1, . . . , vj}. By construction, F|V = F πV W

Hidden Symmetries of Re LU Networks

can be realized as the composition of the projection map πV W : V W and an affine-linear isomorphism F : W F(V ). It can be seen using the Fourier-Motzkin elimination method (cf. Sec. 12.2 in (Schrijver, 1986)) that the image of C under πV W is a polyhedral set, and it is immediate that the image of a polyhedral set under an affine-linear isomorphism is a polyhedral set. It follows that the image of C under F is a polyhedral set in Rr. The continuity of F ensures that if C is a face of C, then F(C ) will be a face of F(C), so the image of |C| under F will be the domain of a polyhedral complex, as desired.

Lemma B.6. Let C be a polyhedral complex in Rn, F : |C| Rr a map that is affine-linear on cells, and Z a polyhedral complex in Rr. Let F(C) denote the polyhedral complex in Rr that is the image of C. If i : |F(C)| Rm is the inclusion map, then i c Z iff F c Z.

Proof. Let C be a cell of C with image F(C) F(C), and let Z Z. We will show that when F (resp., i) is restricted to int(C) (resp., to int(F(C))), F|int(C) int(Z) iff i|int(F (C)) int(Z).

In the following, choose an affine-linear extension of F to all of Rn and call it F|Rn. Note that any such extension can be decomposed as a projection onto the affine hull of C followed by an affine isomorphism onto the affine hull of F(C), as described in the proof of Lemma B.5. Let c := dim(C), c = dim(F(C)), and z = dim(F 1(Z)), z = dim(Z). Begin by noting that i|int(F (C)) int(Z) iff F(C) Z is a polyhedral set of dimension (c + z) r iff either they have empty intersection or the interiors of F(C) and Z have non-empty intersection and the affine hulls of F(C) and Z intersect in an affine-linear space of codimension (r c ) + (r z ) (that is, of dimension (c + z ) r).

The statement that F|C Z iff i|F (C) Z is vacuous in the empty intersection case.

In the non-empty intersection case, the affine-linear map F restricted to int(C) is either a homeomorphism onto its image or a linear projection map onto a set homeomorphic to its image, int(C) int(Z) = iff int(C) int(F 1(Z)). Moreover, the rank-nullity theorem applied to F|Rn tells us that aff(F(C)) aff(Z) has dimension (c + z ) r iff aff(C) F 1(aff(Z)) has dimension (c + z) n.

It follows that for all cells C of C and Z of Z, F|int(C) Z iff i|F (C) Z, and the statement follows.

Definition B.7 (Definition 11 of Masden (2022)). Let F be a Re LU neural network of depth d, and let the (i 1)st layer map, F i 1 : Rni 2 Rni 1 (resp., the composition of maps from the ith layer map, F (i) : Rni 1 Rnd) be viewed as maps that are affine-linear on cells of their respective canonical polyhedral decompositions, C(F i 1) (resp. C(F (i))). If, for all 2 i d, we have F i 1 c C(F (i)),

then we call F a supertransversal neural network.

Informally, we can think of supertransversality as the right generalization of the genericity condition for hyperplane arrangements to bent hyperplane arrangements. Recall that a hyperplane arrangement is generic if every k fold intersection of hyperplanes in the arrangement is an affine linear subspace of dimension n k. Analogously, it follows from the definitions that every k fold intersection of bent hyperplanes associated to a generic, supertransversal network intersect in a (possibly empty) polyhedral complex of dimension n k.

An important result proved in Masden (2022) (see also Theorem 3 of Grigsby & Lindsey (2022)) is the following:

Proposition B.8 (Lemma 12 of Masden (2022)). For any neural network architecture, the set of parameters associated to generic, supertransversal marked neural network functions is full measure in parameter space, RD.

Definition B.9. Let s (Ω= RD) be a generic, supertransversal (Definition B.7) parameter for a Re LU neural network of architecture (n0, . . . , nd). We say that s satisfies the transverse pairwise intersection condition (TPIC) for all adjacent layer maps if ˆHℓ i c ˆHℓ+1 j = for all i, j, ℓ. That is, every pair of bent hyperplanes in adjacent layers has non-empty transverse intersection.

In the language of Rolnick & Kording (2020) and Bui Thi Mai & Lampert (2020), a generic, supertransversal parameter s satisfies the transverse pairwise intersection condition (TPIC) for all adjacent layer maps iff every pair of nodes in every pair of adjacent layers of the dependency graph is connected by an edge.

Hidden Symmetries of Re LU Networks

C. Unbounded Solyhedral Sets and Sufficiently High-Bias Positive-Axis Hyperplanes

In order to choose parameters whose bent hyperplane arrangement satisfies (TPIC), we will need to establish some results about the images of unbounded polyhedral sets under generic, supertransversal Re LU neural network layer maps. We will also need to understand the intersections of these images with sufficiently high-bias positive-axis hyperplanes.

The following proposition ensures that the images of the nested unbounded polyhedral sets S1 S2 . . . Sd referenced in the proof of the main theorem are unbounded in the layers of the neural network.

Proposition C.1. Let Fθ : Rn0 Rnd be a generic, supertransversal Re LU network map of architecture (n0 = k, n1, . . . , nd) with nℓ k for all ℓ, and let S be an unbounded polyhedral set of dimension k in the canonical polyhedral complex C(Fθ). If the sign sequence s S = (s1, . . . , snd) associated to S satisfies sℓ= (+1, . . . , +1 | {z } k , 1, . . . 1 | {z } nℓ k ) for all

i ℓ, then F(ℓ)(S) is an unbounded polyhedral set of dimension k contained in O+ k Rnℓ.

Proof. The fact that F(ℓ)(S) is a polyhedral set contained in O 0 k Rnℓis a consequence of Lemmas A.8 and B.5, so we need only prove that its image is unbounded, of dimension k.

We will prove this by induction on ℓ. When ℓ= 1, C(Fθ) = C(A1) for the generic hyperplane arrangement A1 associated to F 1. Since S is a pointed (since n1 k) unbounded polyhedral set of dimension k, its boundary contains unbounded 1 cells (rays) Ri = {xi + tvi | t 0} based at {xi} with slopes {vi}. Note that because A1 is generic, each 0 cell xi is a k fold intersection of distinct hyperplanes from A1, and each Ri is contained in a (k 1) fold intersection of distinct hyperplanes from A1. It follows that S has at least k unbounded facets, each contained in a different hyperplane of A1. Reindex if necessary so k of these are H1 1, . . . , H1 k and then flip co-orientations so that the sign sequence on S is

s1 S = (+1, . . . , +1 | {z } k , 1, . . . 1 | {z } n1 k ).

We have chosen the first k neurons of F 1 to be on on (the interior of) S. This implies that if we let w1, . . . , wn1 be the weight vectors and b1, . . . , bn1 the biases associated to F 1, we have wi xj + bi 0 for all 1 i k, with wi xj + bi = 0 iff xj Hi.

As to the slopes {vi} of the unbounded 1 cells (rays) {Ri}, it is immediate that each {vi} Cone(S) (cf. Schrijver (1986)), and by reindexing if necessary we may assume {v1, . . . , vk} is a basis for Rn0=k since S has dimension k. Moreover, we claim that wi vj 0 for all 1 i, j k with wi vj = 0 iff Rj Hi.

To see this claim, note first that if wi xj + bi > 0, then xj Hi, so Ri Hi. We also see that wi vj 0, since otherwise there would be some t > 0 for which wi (xj + tvj) = 0, contradicting the unboundedness of Ri. But wi vj = 0 since this would imply vj Hi1 . . . Hik 1 w i , which contradicts the genericity of A1.

If wi xj + bi = 0, then xj Hi and hence Rj Hi iff wi vj = 0, as desired.

w T 1 ... w T nℓ

be the matrix whose row vectors are the weight vectors {wi} associated to A1 and let

V := v1 vk .

Then the ith column of WV is precisely the pre-activation image of the vector vi in Rn1. Moreover, since A is generic, each k k minor of WV has rank k. We also just saw above that the initial k k minor of WV is unaffected by the Re LU activation, since all entries are 0.

Since the post-activation rank of WV is the dimension of Cone(F 1(S)), we conclude that F 1(S) is unbounded, of dimension k, as desired, and the base case is complete.

Now suppose S satisfies the assumptions and we know that F(ℓ 1)(S) O 0 k Rnℓ 1 is unbounded of dimension k.

Hidden Symmetries of Re LU Networks

As in the proof of the base case, consider the unbounded rays Ri of F(ℓ 1)(S), their basepoints xi, and their slopes {vi} Cone F(ℓ 1)(S) . As before, assume that v1, . . . , vk gives a basis for the initial k plane of Rnℓ 1 and let w1, . . . , wnℓbe the weight vectors of F ℓ, and let W be the matrix whose rows are w T i and V the matrix whose columns are vj. By exactly the same argument as before, we see that the initial k k minor of WV is unaffected by the Re LU activation, and since F ℓis generic and super-transversal to all previous layer maps, each k k minor of WV is rank k, which as before implies that F(ℓ)(S) = F ℓ(F(ℓ 1)(S)) is also unbounded, of dimension k, as desired.

Corollary C.2. Let Fθ : Rn0 Rnd be a generic, supertransversal Re LU network map of architecture (n0 = k, n1, . . . , nd) with nℓ k for all ℓ, and let S be a non-empty unbounded polyhedral set of dimension k in the canonical polyhedral complex C(Fθ). If the sign sequence s S = (s1, . . . , snd) associated to S satisfies sℓ= (+1, . . . , +1 | {z } k , 1, . . . 1 | {z } nℓ k ) for all

i ℓ, then F(ℓ) restricted to the (non-empty) interior of S is a rank k affine-linear map, hence a homeomorphism onto its image for all i ℓ.

Proposition C.3. Let S be a non-empty unbounded polyhedral set of dimension k in O 0 k Rnℓ 1, viewed as the domain of a polyhedral complex that also contains all of its faces. Suppose H Rnℓ 1 is a hyperplane for which H c S = . For any nℓ k, we can always find nℓhyperplanes H1, . . . , Hnℓsatisfying:

i. Hi c S = ,

ii. the restricted hyperplane arrangement K = {K1, . . . Knℓ} is generic, and

iii. the bounded subcomplex of the restricted hyperplane arrangement {K1, . . . Knℓ} is non-empty and contained in the interior of S.

Proof. Let H1 := H. Choose a point p H S in the (non-empty) interior of S, and a small open neighborhood N(p) S. We can pick slight (non-generic) perturbations H2, . . . , Hnℓof H so that p Hi for all i. Since transverse intersection is an open condition, we can by further perturbation insure that H2, . . . , Hnℓstill intersect S transversely. Because generic hyperplane arrangements are dense and open in parameter space, we can by further perturbation insure that the restricted hyperplane arrangement {K1, . . . , Knℓ} is generic in the initial coordinate k plane Rnℓ k and that the bounded subcomplex of this arrangement is contained in N(p), as desired.

Lemma C.4. Let F : Rnℓ 1 Rnℓbe a generic Re LU neural network layer map with associated co-oriented generic hyperplane arrangement A, and let C(F) = C(A) be the associated polyhedral decomposition of Rn. For almost all positive weight vectors w Rnℓthere exists a negative bias b R such that the corresponding positive-axis hyperplane:

H := { x Rnℓ| w x + b = 0}.

has non-empty transverse intersection with F(C) for each unbounded nℓ 1 cell C in C(A) whose ternary labeling has dimension 1.

Proof. By an argument analogous to the one in the proof of Proposition C.1, we see that the image of any unbounded polyhedral set C in C(A) of dimension d 1 whose associated sign sequence s C has dimension dim(s C) (Definition A.9) is an unbounded polyhedral set of dimension min(d, dim(s C)), hence contains an unbounded ray in the non-negative orthant O 0. But for every ray R contained in the non-negative orthant and every positive weight vector there exists a sufficiently high negative bias such that the corresponding sufficiently-high bias positive-axis hyperplane intersects R. Since transverse intersection is an open condition, we can perturb w slightly so this intersection is transverse.

Lemma C.5. Let B Rn be a bounded set, and let S Rn be any unbounded pointed polyhedral set of dimension n. There exists v Cone(S) such that the translate of B by v is in S. That is, there exists v Cone(S) such that b + v S for all b B.

Proof. Let P be any polytope containing B and let V be its (finite) set of 0 faces, so P is the convex hull of V .

Hidden Symmetries of Re LU Networks

It suffices to prove that when Cone(S) is non-empty and has dimension n (as is the case for any unbounded polyhedral set S of dimension n), the set S + Cone(S) is all of Rn. This will imply that P, being the convex hull of finitely many points, can be realized as P + w for P the convex hull of finitely many points in S and w Cone(S).

But the fact that Rn = S + Cone(S) follows immediately from the fact that there exist vectors v1, . . . , vn Cone(S) that form a basis for Rn.

The following result is well-known (cf. Stanley (2007)), but we include a proof here for completeness.

Lemma C.6. Let RD be the parameter space for a feedforward Re LU network architecture. There exists an algebraic set S of positive codimension (and Lebesgue measure 0) such that every parameter θ RD \ S is generic.

Proof. A parameter is generic iff each hyperplane arrangement associated to each layer map is generic. Since there are finitely many layer maps, and the union of finitely many algebraic sets is an algebraic set, we need only prove that the statement of the lemma holds for a parameter associated to a single layer map.

Classical results in linear algebra relating the solution sets of homogeneous and inhomogeneous linear systems tell us that an arrangement of nℓaffine hyperplanes in Rnℓ 1 is generic iff the associated bias-free (central) hyperplane arrangement has the property that every k fold intersection of (non-affine) hyperplanes intersects in a linear subspace of codimension k (where a linear subspace of codimension > nℓis by definition empty).

The rank-nullity theorem tells us that this happens iff every weight matrix associated to a k fold subset of the central arrangement has full rank, min{k, nℓ 1}. Noting that the rank of a k nℓ 1 matrix is the maximum m for which some m m minor has nonzero determinant, and the determinant is a polynomial equation in the matrix entries, we conclude that away from an algebraic set the parameters are generic. Since a non-empty algebraic set always has positive codimension (and hence Lebesgue measure 0), the conclusion follows.

Corollary C.7. Let θ0 RD be a generic parameter. There exists an open neighborhood U of θ0 such that every parameter θ U is generic.

Proof. An algebraic set is closed, hence its complement is open.

D. Linear Regions Assumption (LRA) and Transverse Parwise Intersection Condition (TPIC)

Definition D.1. A linear region of a continuous, finitely piecewise linear function f : Rn0 Rnd is any maximal connected set S Rn0 such that the restriction of f to S is affine-linear.

Note that each linear region is a closed set.

Definition D.2. Let T Rn0 be a union of the closures of activation regions for a Re LU network map Fθ : Rn0 Rnd. Fθ is said to satisfy the Linear Regions Assumption (LRA) on T if each linear region of the restriction of Fθ to T is its intersection with the closure of a single -activation region of Fθ.

The following Lemma is an immediate consequence of the definition of LRA. (Compare Remark A.6.)

Lemma D.3. For a parameter that satisfies LRA on T Rn 0 as above, the intersection of T with the union of the bent hyperplanes coincides with the domain of the (n0 1)-skeleton of the canonical polyhedral complex.

In order to deduce that there are no hidden symmetries on a positive measure subset of parameter space, it will be important for us to know that the required LRA condition is satisfied not just for the construction we give in Section A but also on a full measure subset of an open neighborhood of the associated parameter. We turn to establishing the foundations for proving this now.

Recall the following definition and result (cf. Hanin & Rolnick (2019), Grigsby et al. (2022b)), which tell us that for most pairs (θ, x) Ω Rn0, each coordinate of the realized function Fθ(x) is expressible as a polynomial in the coordinates of the parameter and the input.

Definition D.4. Let Ωbe the parameter space for Re LU network architecture (n0, . . . , nd). Denote by F : Ω Rn0 Rnd the function F(θ, x) := Fθ(x).

Hidden Symmetries of Re LU Networks

Figure 8. An augmented computational graph for architecture (2,3,3,1). The ordinary vertices are black, and the distinguished vertices are red. Black edges are labeled with weights, and red edges are labeled with biases. A complete path is one that ends at an output vertex and begins either at an input vertex or at one of the distinguished vertices. In the diagram above, we have blurred out vertices corresponding to inactive neurons associated to an input vector x with ternary label sx = (s1 x, s2 x, s3 x) = (( 1, 0, +1), (+1, 0, 0), (+1)). The open paths associated to this ternary label are the ones in the diagram above with solid (non-dashed) edges. The reader can check that there are three open, complete paths γ, γ , γ Γx, , whose monomials are m(γ) = b1 3W 2 13W 3 11, m(γ ) = b2 1W 3 11, and m(γ ) = b3 1. There is a unique open, complete path γ1 Γ1, with monomial m(γ1) = W 1 31W 2 13W 3 11 and a unique open, complete path γ2 Γ2, with monomial m(γ2) = W 1 32W 2 13W 3 11.

Lemma D.5 (Grigsby et al. (2022b), Lemma 3.4). Let Ωbe the parameter space for Re LU network architecture (n0, . . . , nd). Let θ Ω, and suppose x Rn0 is in a activation region for Fθ. Then there is an open neighborhood of (θ, x) Ω Rn0 on which each coordinate of F is a polynomial in the coordinates of θ and x.

Indeed, if θ is moreover a generic and supertransversal parameter and x Rn0 is any point in the domain, it is well-known that we can calculate this polynomial explicitly in terms of the ternary labeling on x and the parameters of θ, cf. Lemma 8 of Hanin & Rolnick (2019). For completeness, we describe this process here. We first establish some notation, then describe how the polynomial is calculated in Lemma D.9.

Definition D.6. The augmented computational graph G for the feedforward Re LU network architecture (n0, . . . , nd) is the graded oriented graph:

with nℓordinary vertices and 1 distinguished vertex of grading ℓfor ℓ= 0, . . . , d 1, and nd ordinary vertices of grading d,

for every ℓ= 0, . . . , d 1, every vertex of grading ℓis connected by a single oriented edge to every ordinary vertex of grading ℓ+ 1, oriented toward the vertex of grading ℓ+ 1.

One obtains the augmented computational graph for an architecture from the standard computational graph for the architecture by adding an extra marked vertex for each non-output layer, whose purpose is to record the bias term in each affine-linear map. Accordingly, given a parameter θ one obtains a labeling of the edges of the augmented computational graph:

the edge from the distinguished vertex of layer ℓto the kth ordinary vertex of layer ℓ+ 1 is labeled with bℓ+1 k , the kth component of the bias vector for F ℓ+1,

the edge from the ith ordinary vertex of layer ℓ 1 to the jth ordinary vertex of layer ℓis labeled with W ℓ ji.

Associated to every oriented path γ is a corresponding monomial, m(γ), in the parameters obtained by taking the product of the parameters on the edges traversed along γ. See Figure 8.

Definition D.7. Let θ Ωbe a generic, supertransversal parameter, and let x Rn0 be any point in the domain, with associated ternary labeling sx = (s1 x, . . . , sd x). A path γ is said to be open at x for parameter θ if every node along γ has ternary labeling +1.

Definition D.8. Let G be the augmented computational graph for the Re LU network architecture (n0, . . . , nd). A path γ is said to be complete if it ends at a vertex in the output layer and begins at either a vertex of the input layer or at one of the distinguished vertices in a non-input layer. For θ Ωand each 1 k nd we will denote by

Γθ,k x the set of complete paths that are open at x for the parameter θ and end at the kth node of the output layer,

Hidden Symmetries of Re LU Networks

by Γθ,k x,i Γθ,k x the subset of Γθ,k x beginning at input node i, and

by Γθ,k x, Γθ,k x the subset of Γθ,k x beginning at one of the distinguished vertices.

The following lemma is well-known to the experts (e.g. Lemma 8 of Hanin & Rolnick (2019)).

Lemma D.9. Let θ Ωbe a generic, supertransversal parameter, and let x = (x1, . . . , xn0) Rn0 be any point in the domain. For k = 1, . . . , nd, the polynomial associated to the kth output component of Fθ at x is given by

Fk(θ, x) = X

Remark D.10. Fix θ and any ternary labeling s. Define

(Rn0 s Ω)s := {(x, θ) Rn0 Ω| the ternary labeling of x with respect to θ is s}.

Then F restricted to (Rn0 s Ω)s is a vector of polynomial functions. In the proof of the Lemma below, we will use the notation Fs for this vector of polynomial functions.

In the Lemma below, recall from Section A that ˆHℓ i denotes the bent hyperplane in the domain Rn0 associated to the ith neuron of the ℓth layer map.

Lemma D.11. There exists an algebraic, measure 0 set B Ωsuch that for any generic, supertransversal parameter θ Ω\ B and any point x ˆHℓ 1 i ˆHℓ j (for parameter θ), if Γx \ Γx, is non-empty, then the LRA is satisfied on the union, T ℓ ij, of the closures of the four -activation regions adjacent to x = pℓ ij.

Proof. Fix i, j. Fix a ternary labeling s { 1, 0, 1}n1 . . . { 1, 0, 1}nd. Suppose there exists a point x ˆHℓ 1 i ˆHℓ j with ternary labeling s = sx.

By supertransversality, every non-empty bent hyperplane besides ˆHℓ 1 1 and ˆHℓ 1 has positive minimal distance to x. It follows that a sufficiently small neighborhood of x contains points only in the closures of the four -activation regions adjacent to x. By permuting neurons in layers ℓ 1 and ℓif necessary, we may assume without loss of generality that i = j = 1. It follows that the first component of each of the ternary labels sℓ 1 x and sℓ x is 0. Accordingly, the ternary labelings of the four -activation regions adjacent to x agree with those of x except at the first coordinates of sℓ 1, sℓ. It is therefore natural to label the adjacent -activation regions by ++, + , +, according to whether the 1st coordinate of sℓ 1, sℓis 1. Let x++, x+ , x +, x be points in the corresponding -activation regions adjacent to x, and let s++, s+ , s +, s be the corresponding ternary labelings.

The assumption that Γx \ Γx, is non-empty tells us that there is at least one open, complete path at x, which implies that each of the ternary labelings s1 x, . . . , sd x has at least one +1 component. In other words, there is at least one neuron active in each layer at the input x = pℓ ij.

Let Fs++, Fs+ , Fs +, Fs be the vector of polynomials as in Remark D.10. Lemma D.9 tells us how to compute these four polynomials. We will show that the vectors of polynomials Fs++, Fs+ , Fs +, Fs are pairwise distinct. I.e., viewing the summands of the polynomial components as consisting of a coefficient that is an algebraic expression in θ and a variable xi, we will show that different polynomials have different coefficients. Then we let Bi,j,s denote the set of parameters θ such that two or more of the restrictions Fs++(θ, ), Fs+ (θ, ), Fs +(θ, ), Fs (θ, ) coincide. The set Bi,j,s is an algebraic set (a finite union of solutions to polynomials) and hence has measure 0. It follows that the set B := S

i,j,s Bi,j,s is an algebraic set of 0 measure that has the desired properties.

Thus, we turn to proving that Fs++, Fs+ , Fs +, Fs are pairwise distinct polynomials. Explicitly, let vℓ 1 1 (resp., vℓ 1) be the first ordinary vertex in layer ℓ 1 (resp., in layer ℓ). Consider the set of paths in Γx++ passing through an ordinary vertex from layer ℓ 1 and an ordinary vertex from layer ℓ. It is immediate that this set can be decomposed into the disjoint union of:

the set of paths through both vℓ 1 1 and vℓ 1, which we will denote by Γ11

the set of paths through vℓ 1 1 and not vℓ 1, which we will denote by Γ1

Hidden Symmetries of Re LU Networks

the set of paths through vℓ 1 and not vℓ 1 1 , which we will denote by Γ 1

the set of paths through neither vℓ 1 1 nor vℓ 1, which we will denote by Γ

Now note that Γx = Γx. Since Γx is non-empty each of Γx +, Γx+ , Γx++ is non-empty as well, which implies that the polynomial in (θ, x) associated to each of these sets is nonzero.

Next, note that Γx+ = Γx Γ1 . Since there is at least one neuron in each layer active at x+ , the set Γ1 is non-empty, and hence the polynomial X

is nonzero. This tells us that the polynomials Fs and Fs+ associated to Γx and Γx+ are distinct (as functions of two variables θ and x, and affine-linear in x). Similarly, the polynomial associated to Γx + is distinct from that associated to Γx .

Indeed, the fact that each of Γ11, Γ1 , Γ 1, and Γ contains a path (and hence a monomial containing a distinct weight) not present in the others implies, by an analogous argument, that the polynomials associated to Γx++, Γx+ , Γx +, Γx are all pairwise distinct.

In Masden (2022), it is proved that for a generic, supertransversal parameter θ, the map

s : C(Fθ) { 1, 0, +1}n1+...+nd

that assigns to each cell C of the polyhedral complex, C(Fθ), its ternary activation pattern, s C, is well-defined, injective, and has the property that C is a k-cell of C(F) if and only if s(C) has exactly n0 k entries which are 0.

That is, there is no ambiguity in defining a ternary activation pattern for a cell C of C(Fθ), each possible ternary activation pattern s { 1, 0, +1}n1+...+nd is in the image of at most one polyhedral set C in C(Fθ), and the dimension of C as a polyhedral set is n0 k, where k is the number of 0 s in s C.

Moreover, we state the following additional result (implicit in Masden (2022)), which tells us that the presence of a ternary activation pattern is stable under (almost all) sufficiently small perturbations of the parameter:5

Proposition D.12. Let θ Ωbe a generic, supertransversal parameter, C C(Fθ ) a cell in the corresponding polyhedral complex and s C its associated ternary activation pattern. There exists an open neighborhood N of θ Ωsuch that for each θ N, θ is generic, supertransversal, and there exists a non-empty cell C in C(Fθ) with ternary activation pattern s C = s C .

Proof. The assumption that θ is generic and supertransversal tells us that the ternary activation pattern of a cell C in C(Fθ ) gives us a precise recipe for realizing C as the intersection of bent hyperplanes and bent half-spaces.6 Moreover, if C has dimension (n0 k), s C will have k 0 s (corresponding to k intersecting bent hyperplanes) and (n0 k) 1 s (corresponding to (n0 k) intersecting bent half-spaces). We also note (cf. Lem. 12 of Masden (2022) and Lemma C.6 that the set of generic, supertransversal parameters is full measure in parameter space.

Now suppose that C is a non-empty cell in C(Fθ ), and p is a point in int(C ). Let H0 (resp., H ) denote the set of bent hyperplanes of Fθ associated to ternary label 0 (resp., ternary labels 1) in s C .

By transversality on cells, there is an open neighborhood N0 of the subset of the parameters defining the bent hyperplanes in H0 for which every parameter θ N0 defines a collection of |H0| bent hyperplanes with intersection that is both transverse on cells and non-empty.

Moreover, since p has positive minimal distance to every bent hyperplane in H , and every such bent hyperplane is closed (though not necessarily compact), there is some positive δ for which a neighborhood of p of radius δ contains only the bent

5Note that the absence of a ternary activation pattern is not necessarily stable in this way. See the definition of combinatorial stability and related discussion in (Grigsby et al., 2022b). 6The complement of a bent hyperplane is the union of at most two open connected components. These are what we mean by bent half-spaces. Note that a bent hyperplane may be empty, in which case exactly one of the two bent half-spaces is also empty.

Hidden Symmetries of Re LU Networks

hyperplanes in H0. This implies that there is a sufficiently small open neighborhood N of the parameters defining the bent hyperplanes in H for which C is in the same bent half-space for the bent hyperplanes in H for every parameter in N .

Letting N = N0 N and further restricting to a neighborhood with generic parameters (Lemma C.6) if necessary, we obtain a neighborhood of θ Ωfor which every θ N is generic, supertransversal, and contains a non-empty cell C with s C = s C, as desired.

Recall (Definition B.9 that a generic, supertransversal parameter θ satisfies TPIC if every pair of bent hyperplanes in adjacent layers has non-empty transverse intersection.

Corollary D.13. For any architecture, TPIC is an open condition. That is, if θ Ωis a generic, supertransversal parameter satisfying TPIC, then there exists an open neighborhood of θ Ωon which all parameters satisfy TPIC.

The proof of Lemma D.11 showed that for a parameter θ and point x in the intersection of two bent hyperplanes, the polynomials for the four associated sign sequences are distinct. However, it did not show that there is a neighborhood N Ωof θ such that all four sign sequences are actually realized on Rn0 near x for all θ N. The next Lemma combines the persistence of cells realizing sign sequences given by Proposition D.12 with Lemma D.11.

Lemma D.14. Suppose θ Ωis a generic, supertransversal parameter, let x = pℓ ij be a point in ˆHℓ 1 i ˆHℓ j, and let T ℓ ij be the union of the closures of the four -activation regions adjacent to x = pℓ ij (as in Lemma D.11). If Γx \ Γx, is non-empty, then there is an open neighborhood N of θ Ωand an algebraic, measure 0 set S N for which every θ N \ S satisfies:

(i) There are four non-empty -activation regions of Fθ with the same ternary labelings as those in T ℓ ij

(ii) Letting T ℓ ij(θ) denote the (non-empty) closure of these four activation regions, LRA is satisfied on T ℓ ij(θ).

Proof. Proposition D.12 tells us that there is an open neighborhood of θ for which each θ in the neighborhood satisfies (i). Moreover Lemma 14 of Masden (2022) and Lemmas C.6 and D.11 tell us that away from a closed (algebraic) set of measure 0, the parameters θ in this open neighborhood satisfy LRA on T ℓ ij(θ), as desired.

Lemma D.15 (Rolnick & Kording (2020), Thm. 2). Let θ Ωbe a generic, supertransversal parameter satisfying TPIC. For each i, j, ℓ, choose a point pℓ ij ˆH(ℓ 1) i ˆHℓ j. Let T ℓ i,j denote the union of the closures of the four activation regions adjacent to pℓ ij, and let T = S

i,j,ℓT ℓ i,j. If Fθ satisfies LRA on T, then θ can be recovered from Fθ up to permutation and positive-rescaling.

Proof. The proof of Theorem 2 and associated algorithm in Rolnick & Kording (2020) require only that the LRA is satisfied in the (closures of) the four adjacent -activation regions for one point in each relevant transverse pairwise intersection.

Remark D.16. Lemma D.15 tells us that in order for a parameter satisfying TPIC to admit no hidden symmetries, it need not satisfy LRA everywhere but only on the union of the closures of all activation regions near the intersection points (the set T in the statement of Lemma D.15. Accordingly, we will say that a parameter θ Ωsatisfying the assumptions of Lemma D.15 satisfies TPIC and LRA on a neighborhood of the pairwise intersections.

E. Main Theorem and Proof

Theorem E.1. Let (n0, . . . , nd) be a neural network architecture satisfying (n0 = k) nℓfor all ℓ, and let Ω= RD

denote its parameter space. There exists a positive measure subset Y Ωfor which each θ Y satisfies TPIC and LRA on a neighborhood of the pairwise intersections, hence has no hidden symmetries.

Proof of Theorem E.1. In the course of the proof we will need the following additional notation. Let Aℓ= {Hℓ 1, . . . , Hnℓ} be a hyperplane arrangement in Rnℓ 1. If the initial coordinate k plane Rnℓ k is transverse to Hℓ i , then the intersection, Hℓ i Rnℓ 1 k , is a hyperplane in Rnℓ 1 k . In this case we will use:

Kℓ i := Hℓ i Rnℓ 1 k (7)

Hidden Symmetries of Re LU Networks

to denote this hyperplane in Rnℓ 1 k . If Aℓ c Rnℓ 1 k , then this implies that all of the hyperplanes in Aℓintersect Rnℓ 1 k transversely, and we will use Kℓto denote the corresponding hyperplane arrangement in Rnℓ 1 k .

We now proceed to prove the theorem by construction. Our strategy will be to find a particular generic, supertransversal parameter θ Ωsatisfying TPIC and LRA on a neighborhood of the intersections. It will then follow by Lemmas D.14 and D.15 that every parameter θ away from a measure zero set in a neighborhood of θ satisfies TPIC and LRA on a neighborhood of the intersections. The conclusion of the theorem will then follow.

Our construction will be by induction on d. We will find it convenient to prove the following strictly stronger (and unavoidably technical) conclusion, since it helps with the inductive construction: There exists a generic, supertransversal choice of parameters whose associated sequence of polyhedral refinements C(F1 = F(1)) . . . C(F = F(d)) contains a nested sequence of unbounded k cells S1 . . . Sd for which the ℓ-th ternary labeling on Sℓis sℓ= (+1, . . . , +1 | {z } k , 1, . . . 1 | {z } nℓ k )

for all ℓ d and which satisfies the additional conditions that

(i) Sℓ c ( ˆHℓ+1 i c ˆHℓ+2 j ) = for all i, j when ℓ d 2, and

(ii) Rnℓ k c C(Aℓ+1), and the preimage of the bounded subcomplex of C(Kℓ+1) under the map F(ℓ) is contained in the interior of Sℓfor all ℓ d 1.

Note that condition (i) above implies TPIC, and the assumption that the ℓth ternary labeling on Sℓhas k +1 s for all ℓis enough to guarantee that for each pairwise intersection x = pℓ ij the set Γx \ Γx, referenced in Lemma D.11 is non-empty, hence LRA will be satisfied in a neighborhood of the intersections.

When d = 1, choose A1 = {H1 1, . . . , H1 n1} to be any generic arrangement of hyperplanes. Choose any unbounded k cell of C A1 and call it S1. Note that we can alter the ordering and co-orientations on the hyperplanes of A1 (without affecting the arrangement) to ensure that the first k neurons of F 1 are active on S1 and the remainder are inactive. That is, the ternary labeling on S1 is s1 = (+1, . . . , +1 | {z } k , 1, . . . 1 | {z } n1 k ). The rest of the conditions are vacuously true.

Now suppose d = 2. By Corollary C.2, we know that F 1 restricted to the interior of S1 is a homeomorphism onto the interior of the k dimensional polyhedral set F 1(S1) in O 0 k Rn1. Moreover, F 1(S1) is unbounded by Proposition C.1. Lemma C.4 then tells us that there exists a positive-axis sufficiently high-bias hyperplane H Rn1 for which F 1(S1) c H = .

Proposition C.3 ensures we can choose sufficiently small perturbations H2 1, . . . , H2 n2 of H so that all of the hyperplanes in A2 are transverse to S1 Rn1, the parameters associated to A1, A2 are generic and supertransversal, and the bounded subcomplex of the restricted hyperplane arrangement K2 = A2 Rn1 k is contained in int(F 1(S1)). Another application of Corollary C.2 allows us to conclude that the preimage of the bounded subcomplex of C(K2) is contained in the interior of S1, and hence the technical inductive conclusion (ii) is satisfied. Since d = 2, the technical inductive conclusion (i) is vacuous, so the d = 2 case is proven.

Now assume d > 2. By the inductive assumption, there exists a generic, supertransversal choice of parameters for F 1, . . . , F d 1 for which the technical inductive conditions above are satisfied for all ℓup through d 1. Therefore, we need only choose generic parameters for F d that are supertransversal to all previous choices of parameters and for which in the polyhedral refinement C(F(d 1)) of C(F(d 2)) we have identified an unbounded k cell Sd 1 Sd 2 with ternary labeling sd 1 = (+, . . . , + | {z } k , , . . . | {z } nd 1 k ),

i. Sd 2 c ( ˆHd 1 i c ˆHd j ) = for all i, j, and

ii. Rnd 1 k c C(Ad), and the preimage of the bounded subcomplex of C(Kd) under the map F(d 1) is contained in the interior of Sd 1.

Proceed by choosing any unbounded k cell contained in Sd 2 in the polyhedral refinement C(F(d 1)) and call it Sd 1. As before, we are free to choose the ordering and co-orientations on the hyperplanes of Ad 1 without altering ˆ Ad 1 or the polyhedral refinement C(F(d 1)). So we can arrange for Sd 1 to have the desired ternary labeling.

Hidden Symmetries of Re LU Networks

By Corollary C.2 we know that F(d 1) restricted to the interior of Sd 1 is a homeomorphism onto the interior of an unbounded k dimensional polyhedral set in O 0 k Rnd 1, and we can therefore choose a sufficiently high bias positiveaxis hyperplane H in Rnd 1 so that F(d 1)(Sd 1) c H = . By Proposition C.3 we can find small perturbations Hd 1, . . . , Hd nd such that the preimage of the bounded subcomplex of the restricted generic hyperplane arrangement is contained in F(d 1)(Sd 1), which we may assume is unbounded by Proposition C.1.

Moreover, Lemma C.4 tells us that since H Rnd 1 is a sufficiently high bias positive-axis hyperplane,

F d 1(Hd 1 i ) H =

for all i. Since non-empty transverse intersection is an open condition, we also know that

F d 1(Hd 1 i ) c Hd j =

for all i, j. An application of Lemma B.6 then tells us that

Hd 1 i Hd j =

for all i, j.

We would now like to conclude that condition (i) is satisfied for ℓ= d 2, but although we know that F(d 1)(Sd 2) c Hd 1 i = and Hd 1 i c Hd j = we don t yet know that Hd 1 i Hd j are contained in the unbounded polyhedral set F(d 2)(Sd 2), so we cannot yet conclude that the three-fold intersections Sd 2 c ( ˆHd 1 i c ˆHd j ) are non-empty.

To arrange for this, choose one point of intersection from Hd 1 i Hd j for each i, j. This is a finite, hence bounded, set. Call it B.

We now appeal to Lemma C.5, which tells us that there is some vector v Cone(F(d 2)(Sd 2)) that can be added to B so that B + v Cone(F(d 2)(Sd 2)). Since we can achieve this translation by altering just the biases of F d and F d 1, we have now arranged that condition (i) is satisfied. The inductive proof that our construction satisfies TPIC is now complete, and hence a positive measure subset of Ωhas no hidden symmetries, as desired.

F. Functional dimension, Fibers, and Symmetries

Roughly speaking, the functional dimension of a parameter θ Ωn0,...,nd is the dimension of the space of functions Fθ realizable by infinitesimally perturbing θ. We will now make this more precise. Suppose Z = {z1, . . . , zk} is a finite collection of points in the domain Rn0 of Fθ. For any θ, we may record the result of evaluating the map Fθ at all k points of Z as a single unrolled vector, i.e. EZ(θ) := (Fθ(z1), . . . , Fθ(zk)) Rknd.

We can then measure how many degrees of freedom we have to vary this data locally at θ by considering the rank of the total (Jacobian) derivative of the map θ 7 EZ(θ), rank(JEz(θ)).

Of course, because Re LU is not differentiable at 0, there is a (Lebesgue) measure 0 set of pairs (θ, x) in Ωn0,...,nd Rn0 at which the total derivative does not exist. Consequently, we will wish to restrict our attention to pairs (θ, x) at which all relevant partial derivatives exist.

Definition F.1 (Grigsby et al. (2022b)). A point x Rn0 is parametrically smooth for a parameter θ Ωn0,...,nd if (θ, x) is a smooth point for the parameterized family F : Ωn0,...,nd Rn0 Rnd defined by F(θ, x) = Fθ(x), where Fθ : Rn0 Rnd is the neural network map determined by the parameter θ.

Definition F.2. The functional dimension of a parameter θ Ωn0,...,nd is

dimfun(θ) := sup {rank(JEZ(θ)) | Z Rn0 is a finite set of parametrically smooth points for θ}

where the supremum is taken over all sets Z consisting of finitely many points in Rn0.

Hidden Symmetries of Re LU Networks

In practice, in experiments such as those in 6, we ignore the issue of differentiability, assuming that all points in a randomly selected set Z Rn0 are parameterically smooth for the parameter; the assumption is supported by the fact that F is smooth except on a set of 0 measure,

Definition F.3. The fiber (with respect to the realization map) of a parameter θ Ωis the set { θ Ω| F θ = Fθ}.

Recall that elements of (P), the set of permutations, and elements of (S), the scalings, act on Ω. Moreover, these actions commute, so the semigroup generated by (P) and (P) equals {p s | p (P), s (S)}. The following definition formalizes the notion that a hidden symmetry describes two parameters that define the same function, but do not differ by elements of (P) and (S).

Definition F.4. A parameter θ Ωhas no hidden symmetries if the semigroup generated by (P) and (S) acts transitively on the fiber of θ.

In other words, θ has no hidden symmetries if, given any two parameters θ1 and θ2 such that Fθ1 = Fθ2 = Fθ, there exists p (P) and s (S) such that p s(θ1) = θ2.

Definition F.5.

i. A pointwise symmetry of θ is a permutation (i.e a bijection, not necessarily continuous) of the fiber of θ.

ii. A local symmetry of θ is a homeomorphism of the fiber of θ equipped with the subspace topology (as a subset of Ω).

iii. A global symmetry of Ωis a homeomorphism T : Ω Ωsuch that FT (θ) = Fθ for all θ Ω.

As shown in Grigsby et al. (2022b), there exists a fiber and a permutation of that fiber that cannot be extended to a homeomorphism of an open neighborhood of that fiber. This is because it is possible for a fiber to contain two parameters θ1 and θ2 (of the same architecture) such that any arbitrarily small neighborhood of θ1 gives rise to functions that cannot be realized by parameters in arbitrarily small neighborhoods of θ2. Said otherwise, there exist pointwise symmetries that cannot be extended to local symmetries. Our definition of no hidden symmetries (Definition F.4) is a statement about pointwise symmetries that every pointwise symmetry of θ is the restriction to the fiber of θ of an element p s of the group of global symmetries generated by (P) and (S).

The theoretical upper bound on functional dimension ((Grigsby et al., 2022b)) comes from taking account of (only) the well-known global symmetries (P) and (S). The intuition is that the set of functions realizable by parameters in a small neighborhood U of θ should be modeled by the quotient of U by the equivalence relation defined by (P) and (S).

Lemma F.6. Let U be an open ball in Ωsuch that if any two points u1, u2 U satisfy Fu1 = Fu2 if and only if u2 = p s(u1) for some p (P) and s (S). Then the functional dimension of any parameter θ U attains the theoretical upper bound.

Proof. Denote by the equivalence relation on U defined by the semigroup generated by (P) and (S). Denote by F|U the set of functions realizable by parameters in U, equipped with the compact-open topology. The condition that Fu1 = Fu2 if and only if u2 = p s(u1) for some p (P) and s (S) is equivalent to the statement that the quotient space U/ is homeomorphic to FU. It follows that for any parameter θ U, dimfun(θ) is the Euclidean dimension of the quotient space U/ . The result follows.

F.1. Proof of Proposition 5.1

The following Lemma will be used to prove Proposition 5.1.

Lemma F.7. Let S be a hyperplane in Rn. Let A : Rn R1 be an affine-linear map given by

A( x) = x n H b H.

Let o Rn be a vector orthogonal to S (i.e. if s1, s2 are two points in S, then ( s2 s1) o = 0). For each t R, define the affine-linear map At : Rn R by At( x) = x ( n H + t o) b H t s H o

for any fixed point s H S {x | A(x) = 0}. Then A and At coincide on S.

Hidden Symmetries of Re LU Networks

Proof. Since A, At are affine-linear maps, it suffices to show that they agree at a point (in particular, at the point s H) and that A( s1) A( s2) = At( s1) At( s2)

for all s1, s2 S. First, observe that A( s H) = s H n H b H = 0

by definition, and

At( s H) = s H ( n H + t o) b H t s H o = ( s H n H b H) + s H t o t s H o = 0.

Next, for any points s1, s2 S, A( s1) A( s2) = ( s1 s2) n H

and At( s1) At( s2) = ( s1 s2) n H + ( s1 s2) ot = ( s1 s2) n H

since o is perpendicular to ( s1 s2).

Proof of Proposition 5.1. Fix a nonzero vector o that is orthogonal to the hyperplane S. For t R, let At be the map constructed in Lemma F.7, set f Ht := σ At, and denote the co-oriented hyperplane associated to At by Ht. By construction, f Ht and η coincide on S.

Denote by H+ t (resp. H+) the closed nonnegative half-spaces associated to At (resp. A). It suffices to show that there exists ϵ > 0 such that |t| < ϵ implies H+ t Im(k) = H+ Im(k). (8)

(The desired one-parameter family is then the family parametrized by t with |t| < ϵ).

For convenience, define Gk := F k . . . F 1. Consider the complex

M := Gk(C(Gk)) H .

(Here C(Gk) denotes the canonical polyhedral complex for Gk; we take the image (in Rnk) of this complex under Gk, which is itself a polyhedral complex, and then intersect it with the closed half-space H .) By condition (iva),

Im(k) {x | η(x) = 0} = Im(k) H S.

Consequently, each cell of M either contains a cell in H S as a face, or is a positive distance away from H. Consequently, there exists ϵ1 > 0 such that |t| < ϵ1 implies the intersection of any bounded cell of M with H+ t is contained in S H or is empty. By condition (ivc), |M| does not contain any unbounded geometric rays parallel to H. Consequently, there exists ϵ2 such that |t| < ϵ2 implies the intersection of any unbounded bounded cell of M with H+ t is empty. Set ϵ = min{ϵ1, ϵ2}. Then when |t| < ϵ, H+ t Im(k) = H+ Im(k).

G. Supplementary Experiments

In this appendix, we consider the effect of varying the number m of sample points when approximating the functional dimension. For a fixed network architecture of depth 4 and width 5, Figure 9 shows the fraction of networks attaining the maximum possible value for dimfun(θ), as a fraction of m, where m is shown as a multiple of the maximum possible functional dimension. We observe that the curve is very flat in the region of x = 100 (the value used in Section 6), suggesting that further increasing m would likely not change the results meaningfully.

Figures 10 and 11 below consider the effect of choosing a much smaller value of m. These figures are analogous to Figures 6 and 7, but with m equal to twice instead of 100 times the maximum possible value for dimfun(θ). Each figure aggregates results for 20,000 different choices of θ Ω. We note that the bounds obtained on the functional dimension are unsurprisingly somewhat weaker for this very low value of m, but the distributions of approximate function dimension still show the same patterns as observed in Section 6, indicating the robustness of our conclusions.

Hidden Symmetries of Re LU Networks

Figure 9. This figure shows the fraction of networks attaining the maximum possible value for dimfun(θ), as a fraction of the number of sample points m, where m is shown as a multiple of the maximum possible functional dimension, according to the upper bound given in Grigsby et al. (2022b).

Figure 10. For each of various network architectures, we approximate the distribution of functional dimensions as the parameter θ Ω varies. Different plots show networks of different depths, while different colors in a plot correspond to varying the input dimension and width n0 = n1 = = nd 1. Black dots show the fraction of networks with full functional dimension (no hidden symmetries). We observe as in Figure 6 that the proportion of networks with full functional dimension increases with width and decreases with depth. Insets in the figure zoom in on certain multimodal distributions, showing that modes are again spaced apart by approximately the width of the network.

Hidden Symmetries of Re LU Networks

Figure 11. This figure shows similar experiments as in Figure 10, but with input dimension fixed at 5 instead of equal to the width of hidden layers. We observe as in Figure 7 that the proportion of networks with full functional dimension decreases with depth, while slightly decreasing with width.