# symmetric_single_index_learning__f8909ad1.pdf

Published as a conference paper at ICLR 2024

SYMMETRIC SINGLE INDEX LEARNING

Aaron Zweig Courant Institute of Mathematical Sciences New York University New York, NY 10012, USA az831@nyu.edu

Joan Bruna Courant Institute of Mathematical Sciences Center for Data Science New York University New York, NY 10012, USA bruna@cims.nyu.edu

Few neural architectures lend themselves to provable learning with gradient based methods. One popular model is the single-index model, in which labels are produced by composing an unknown linear projection with a possibly unknown scalar link function. Learning this model with SGD is relatively well-understood, whereby the so-called information exponent of the link function governs a polynomial sample complexity rate. However, extending this analysis to deeper or more complicated architectures remains challenging. In this work, we consider single index learning in the setting of symmetric neural networks. Under analytic assumptions on the activation and maximum degree assumptions on the link function, we prove that gradient flow recovers the hidden planted direction, represented as a finitely supported vector in the feature space of power sum polynomials. We characterize a notion of information exponent adapted to our setting that controls the efficiency of learning.

1 INTRODUCTION

Quantifying the advantage of neural networks over simpler learning systems remains a primary question in deep learning theory. Specifically, understanding their ability to discover relevant low-dimensional features out of highdimensional inputs is a particularly important topic of study. One facet of the challenge is explicitly characterizing the evolution of neural network weights through gradient-based methods, owing to the nonconvexity of the optimization landscape.

The single index setting, long studied in economics and biostatistics (Radchenko, 2015) offers the simplest setting where non-linear feature learning can be characterized explicitly. In this setting, functions of the form x 7 f( x, θ ) where θ Sd 1 represents a hidden direction in high-dimensional space, and f a certain non-linear link function, are learned via a student with an identical architecture x 7 f( x, θ ), under certain data distribution assumptions, such as Gaussian data. Gradient flow and gradient descent (Yehudai & Ohad, 2020; Arous et al., 2021; Dudeja & Hsu, 2018) in this setting can be analyzed by reducing the high-dimensional dynamics of θ to dimension-free dynamics of appropriate summary statistics, given in this case by the scalar correlation θ, θ .

The efficiency of gradient methods in this setting, measured either in continuous time or independent samples, is controlled by two main properties. First, the correlation initialization, which typically scales as 1

d for standard assumptions. Second, the information exponent sf of f (Arous et al., 2021; Dudeja & Hsu, 2018; Bietti et al., 2022; Damian et al., 2023; 2022; Abbe et al., 2023), which measures the number of effective vanishing moments of the link function leading to a sample complexity of the form O(ds 1) for generic values of s.

While this basic setup has been extended along certain directions, e.g. relaxing the structure on the input data distribution Yehudai & Ohad (2020); Bruna et al. (2023), considering the multi-index counterpart Damian et al. (2022); Abbe et al. (2022; 2023); Arnaboldi et al. (2023), or learning the link function with semi-parametric methods Bietti et al. (2022); Mahankali et al. (2023), they are all fundamentally associated with fully-connected shallow neural networks. Such architecture, for all its rich mathematical structure, also comes with important shortcomings. In particular, it is unable to account for predefined symmetries in the target function that the learner wishes to exploit. This requires specialized neural architectures enforcing particular invariances, setting up novel technical challenges to carry out the program outlined above.

In this work, we consider arguably the easiest form of symmetry, given by permutation invariance. The primary architecture for this invariance is Deep Sets (Zaheer et al., 2017), which is necessarily three layers by definition and

Published as a conference paper at ICLR 2024

therefore not a simple extension of the two layer setting. In order to quantify the notion of symmetric feature learning in this setting, we introduce a symmetric single index target, and analyze the ability of gradient descent over a Deep Sets architecture to recover it. Under appropriate assumptions on the model, initialization and data distribution, we combine the previous analyses with tools from symmetric polynomial theory to characterize the dynamics of this learning problem. Our primary theorem is a proof of efficient learning under gradient flow, with explicit polynomial convergence rates controlled by an analogue of information exponent adapted to the symmetric setting. Combined with other contemporary works, this result solidifies the remarkable ability of gradient descent to perform feature learning under a variety of high-dimensional learning problems.

2.1 NOTATION

For z C, we will use z to denote the complex conjugate, with the notation z always being reserved to denote a special value of z rather than an operation. For complex matrices A we will use A to denote the conjugate transpose. The standard inner product on CN is written as , , whereas inner products on L2(γ) spaces for some probability measure γ will be written as , γ. Furthermore, for h a vector and p(x) a vector-valued function, we will use h, p γ as shorthand for the notation h, p( ) γ.

2.2 REGRESSION SETTING AND TEACHER FUNCTION

We consider a typical regression setting, where given samples (x, y) X C with y = F(x), we seek to learn a function Fw with parameter w CM by minimizing some expected loss Ex ν [L(F(x), Fw(x))]. Note that we consider complex-valued inputs and parameters because they greatly simplify the symmetric setting (see Proposition 2.3), hence we will also assume X CN. Both F and Fw will be permutation invariant functions, meaning that F(xπ(1), . . . xπ(N)) = F(x1, . . . , x N) for any permutation π : {1, N} {1, N}.

Typically the single index setting assumes that the trained architecture will exactly match the true architecture (e.g. as in Arous et al. (2021)), but below we will see why it s necessary to consider separate architectures. For that reason, we ll consider separately defining the teacher F and the student Fw.

The first ingredient are the power sum polynomials: Definition 2.1. For k N and x CN, the normalized powersum polynomial is defined as

Let p(x) = [p1(x), p2(x), . . . ] be an infinite dimensional vector of powersums, and consider a fixed vector h C of unit norm. Then our teacher function F will be of the form

F : X C (1) x 7 F(x) := f( h , p(x) ) (2)

for some scalar link function f : C C. F may thus be understood as a single-index function in the feature space of powersum polynomials.

2.3 DEEPSETS STUDENT FUNCTION

Let us remind the typical structure of a Deep Sets network (Zaheer et al., 2017), where for some maps Φ : X CM and ρ : CM C, the standard Deep Sets architecture is of the form:

x 7 ρ (Φ1(x), . . . , ΦM(x)) . (3)

The essential restriction is that Φ is a permutation invariant mapping, typically of the form Φm(x) = PN n=1 ϕm(xn) for some map ϕm : C C. In order to parameterize our student network as a Deep Sets model, we will make the simplest possible choices, while preserving its non-linear essence. To define our student network, we consider the symmetric embedding Φ as a one-layer neural network with no bias terms:

n=1 σ(amxn) , (4)

Published as a conference paper at ICLR 2024

for i.i.d. complex weights sampled uniformly from the complex circle am S1 and some activation σ : C C. And given some link function g : C C, we ll consider the mapping ρ as:

ρw( ) = g( w, ) , (5)

where w CM are our trainable weights. Putting all together, our student network thus becomes

Fw : X C x 7 Fw(x) := g( w, Φ(x) ) . (6)

Explicitly, this corresponds to a Deep Sets network with only one trainable vector w, while the first layer weights {am}M m=1 and the third layer weights that parameterize the function g are frozen.

The first fact we need is that, through simple algebra, the student may be rewritten in the form of a single-index model.

Proposition 2.2. There is matrix A C M depending only on the activation σ and the frozen weights {am}M m=1 such that

g( w, Φ(x) ) = g( Aw, p(x) ) . (7)

2.4 HERMITE-LIKE IDENTITY

In the vanilla single index setting, the key to giving an explicit expression for the expected loss (for Gaussian inputs) is a well-known identity of Hermite polynomials (O Donnell, 2021; Jacobsen, 1996). If hk denotes the Hermite polynomial of degree k, this identity takes the form

hk( , u ), hl( , v ) γn = δklk! u, v k , (8)

where u, v Rn and γn is the standard Gaussian distribution on n dimensions.

In our setting, as it turns out, one can establish an analogous identity, by considering a different input probability measure, and a bound on the degree of the link function. We will choose our input domain X = (S1)N, and the input distribution we will consider is the set of eigenvalues of a Haar-distributed unitary matrix in dimension N (Diaconis & Shahshahani, 1994), or equivalently the squared Vandermonde density over N copies of the complex unit circle (Macdonald, 1998). We ll interchangeably use the notation Ex V [f(x)g(x)] = f, g V .

Proposition 2.3. Consider h, h C with bounded L2 norm. For exponents k, l with k

N, if h is only supported on the first

N elements, then:

h, p k, h, p l V = δklk! h, h k . (9)

The crucial feature of this identity is that the assumptions on support and bounded degree only apply to h, p k, with no restrictions on the other term. In our learning problem, we can use this property to make these assumptions on the teacher function, while requiring no bounds on the terms of the student Deep Sets architecture.

In order to take advantage of the assumptions on the support of h and the degree in the above proposition, we need to make the following assumptions on our teacher link function f and our true direction h :

Assumption 2.4. The link function f is analytic and only supported on the first

N degree monomials, i.e.

αj j!zj (10)

Furthermore, the vector h is only supported on the first

N elements.

Although this assumption is required to apply the orthogonality property for our loss function in the following sections, we note that in principle, including exponentially small terms of higher degree in f or higher index in h should have negligible effect. Moreover, one should interpret this assumption as silently disappearing in the high-dimensional regime N . For simplicity, we keep this assumption to make cleaner calculations and leave the issue of these small perturbations to future work.

Published as a conference paper at ICLR 2024

2.5 INFORMATION EXPONENT

Because Proposition 2.3 takes inner products of monomials, it alludes to a very simple characterization of information exponent. Namely:

Definition 2.5. Consider an analytic function f : C C that can be written in the form

αj j!zj (11)

Then the information exponent is defined as s = inf{j 1 : αj = 0}.

Similar to the Gaussian case (Arous et al., 2021; Bietti et al., 2022), the information exponent s will control the efficiency of learning. Assuming |αs| is some non-negligible constant, the value of s will be far more important in governing the convergence rate.

2.6 CHOOSING A LEARNABLE LOSS

There are two subtleties to choosing an appropriate loss function. Namely, the necessity of a correlational loss (with regularization), and the necessity of choosing the student and teacher link functions to be distinct.

At first glance, it is tempting to simply define a loss of the form

L(w) = Ex V |F(x) Fw(x)|2 = Ex V h |f( h , p(x) ) f( Aw, p(x) )|2i . (12)

However, the Deepsets student model is not degree limited, that is the support of Aw is not restricted to the first

N terms of the powersum expansion. In other words, expanding this loss will require calculating the term f( Aw, p ) 2 V , which will contain high degree terms that cannot be controlled with Proposition 2.3. One could avoid this issue by choosing the activation such that Aw only contains low-index terms, but we want to consider larger classes of activations and enforce fewer restrictions.

One can instead consider a correlational loss. In this case, in order to make the objective have a bounded global minimum, it s necessary to either regularize w, or project at every step of SGD, which is the strategy taken in Damian et al. (2023). In our setting, this projection would correspond to projecting w to the ellipsoid surface Aw = 1. This projection would require solving an optimization problem at every timestep (Pope, 2008). To avoid this impracticality, we instead consider regularization.

Then with complete knowledge of the link function f, specifically its monomial coefficients, we can now define the correlational loss

ˆL(w) = Ex V h Re n f( h , p(x) f( Aw, p(x) ) oi +

2 Aw 2j . (13)

This loss enjoys benign optimization properties, as shown by the following proposition:

Proposition 2.6. If there exist coprimes k, l with αk, αl = 0, and h is in the range of A, then ˆL exclusively has global minima at all w such that Aw = h .

However, unlike the real case, complex weights causes issues for learning this objective. Namely, this objective can be written as a non-convex polynomial in cos θ where θ is the angle of Aw, h in polar coordinates.

Therefore, we consider a different choice of student link function that will enable a simpler analysis of the dynamics. For the choice of g(z) = αs |αs|

s!zs, we instead consider the loss:

L(w) = Ex V h Re n f( h , p(x) g( Aw, p(x) ) oi + |αs|

2 Aw 2s (14)

= |αs| Re{ Aw, h s} + |αs|

2 Aw 2s . (15)

We note that Dudeja & Hsu (2018) used a similar trick of a correlational loss containing a single orthogonal polynomial in order to simplify the learning landscape. The global minima of this loss, and in fact the dynamics of gradient flow on it, will be explored in the sequel.

Published as a conference paper at ICLR 2024

3 RELATED WORK

3.1 SINGLE INDEX LEARNING

The conditions under which single-index model learning is possible have been well-explored in previous literature. The main assumptions that enable provably learning under gradient flow / gradient descent are monotonicity of the link function (Kakade et al., 2011; Kalai & Sastry, 2009; Shalev-Shwartz et al., 2010; Yehudai & Ohad, 2020) or Gaussian input distribution (Arous et al., 2021). The former assumption essentially corresponds to the setting where the information exponent s = 1, as it will have positive correlation with a linear term. Under the latter assumption, the optimal sample complexity was achieved in Damian et al. (2023), with study of learning when the link function is not known in Bietti et al. (2022).

When both assumptions are broken, the conditions on the input distribution of rotation invariance or approximate Gaussianity are nevertheless sufficient for learning guarantees (Bruna et al., 2023). But more unusual distributions, especially in the complex domain that is most convenient for symmetric networks, are not well studied.

3.2 SYMMETRIC NEURAL NETWORKS

The primary model for symmetric neural networks was introduced in Zaheer et al. (2017) as the Deep Sets model. There are many similar models that enforce permutation invariance (Qi et al., 2017; Santoro et al., 2017; Lee et al., 2019), though we focus on Deep Sets because of its relationship with the power sum polynomials and orthogonality (Zweig & Bruna, 2022). We are not aware of any other works that demonstrate provable learning of symmetric functions under gradient-based methods.

4 PROVABLY EFFICIENT RECOVERY WITH GRADIENT FLOW

4.1 DEFINING THE DYNAMICS

The gradient methods considered in Arous et al. (2021); Ben Arous et al. (2022) are analyzed by reducing to a dimension-free dynamical system of the so-called summary statistics. For instance, in the vanilla single-index model, the summary statistics reduce to the scalar correlation between the learned weight and the true weight. In our case, we have three variables, owing to the fact that the correlation is complex and represented by two scalars, and a third variable controlling the norm of the weight since we aren t using projection.

Note that although our weight vector w is complex, we still apply regular gradient flow to the pair of weight vectors w R, w C where w = w R + iw C. Furthermore, we use the notation := w = w R + i w C. With that in mind, we can summarize the dynamics of our gradient flow in the following Theorem.

Theorem 4.1. Given a parameter w, consider the summary statistics m = Aw, h C and v = P h Aw 2 where P h is projection onto the orthogonal complement of h . Let the polar decomposition of m be reiθ.

Then given the preconditioned gradient flow given by

w = 1 s|αs|(A A) 1 L(w) , (16)

the summary statistics obey the following system of ordinary differential equations:

r = (1 δ)rs 1 cos sθ (v + r2)s 1r , (17) d dt cos sθ = (1 δ)srs 2(1 cos2 sθ) , (18)

v = 2δrs cos sθ 2(v + r2)s 1v , (19)

where δ := 1 PAh 2 and PA is the projection onto the range of A.

The proof is in Appendix D. The main technical details come from using Wirtinger calculus to determine how the real and imaginary parts of w evolve under the flow. Additionally, the correct preconditioner (intuitive from the linear transform of w) is crucial for reducing the dynamics to only three summary statistics, and converting to dynamics on cos sθ rather than θ itself simplifies the description of the learning in the next section dramatically.

Published as a conference paper at ICLR 2024

4.2 PROVABLE LEARNING

These dynamics naturally motivate the question of learning efficiency, measured in convergence rates in time in the case of gradient flow. Our main result is that, under some assumptions on the initialization of the frozen weights {am}M m=1 and the initialized weight vector w0, the efficiency is controlled by the initial correlation with the true direction and the information exponent, just as in the Gaussian case. Theorem 4.2. Consider a fixed ϵ > 0. Suppose the initialization of w0 and (am)M m=1 are such that:

(i) Small correlation and anti-concentration at initialization: 0 < r0 1,

(ii) Initial phase condition: cos sθ0 1/2,

(iii) Initial magnitude condition for Aw: Aw0 = 1, or equivalently v0 = 1 r2 0,

(iv) Small Approximation of optimal error: δ min(ϵ/2, O(s sr4 0)).

Then if we run the gradient flow given in Theorem 4.1 we have ϵ accuracy in the sense that: r T 1 ϵ , cos sθT 1 ϵ , v T ϵ (20) after time T, where depending on the information exponent s:

O 2s2r 4s 0 + log 1

ϵ s > 1 . (21)

Remark 4.3. We note that we only recover cos sθ 1, rather than a guarantee that θ 0, and so the hidden direction is only determined up to scaling by a sth root of unity. This limitation is may appear to be an issue with the choice of the student link function g, but it is unavoidable: if the teacher link function f(z) = 1

s!zs, one can calculate that for any choice of g, L(w) is invariant to scaling w by an sth root of unity.

4.3 INITIALIZATION GUARANTEES

In order to apply the gradient flow bound proved in Theorem 4.2, it only remains to understand when the assumptions on initialization are met. Unlike the single-index setting with Gaussian inputs, the initial correlation is not guaranteed to be on the scale of 1

N , but will depend on the activation function and the random weights in the first layer. Let us introduce the assumptions we ll need: Assumption 4.4. We assume an analytic activation σ(z) = P k=0 ckzk, with the notation σ+ := max1 k N |ck|

k and σ := min1 k

k. We further assume:

(i) ck = 0 iff k = 0,

(ii) σ analytic on the unit disk,

(iii) 1/σ = O(poly(N)),

(iv) P k=N+1 k|ck|2 e Ω(

The first two conditions are simply required for the application of Proposition 2.3, as the powersum vector p is built out of polynomials induced by the activation and does not include a constant term. The second two conditions concern the decay of the coefficients of σ, in the sense that the decay must start slow but eventually become very rapid. These conditions are necessary mainly for ensuring the Small Approximation of optimal error condition: Lemma 4.5. Let σ satisfy Assumption 4.4, and assume M = O(N 3). Consider any unit norm h C that is only supported on the first

N elements. If we sample the weights i.i.d. uniformly on the complex unit circle, am S1, with probability 1 2 exp( Ω(N)): 1 PAh 2 e Ω(

Lastly, we can choose an initialization scheme for w which handily ensures the remaining assumptions we need to apply Theorem 4.2. The crucial features of σ are similar to the previous result. Namely, we want the initial correlation r0 to be non-negligible because this directly controls the runtime of gradient flow. Slow initial decay with fast late decay of the σ coefficients directly implies that Aw0 has a lot of mass in the first

N indices and very little mass past the first N indices. These requirements rule out, say, exp as an analytic activation because the coefficients decay too rapidly.

Published as a conference paper at ICLR 2024

Lemma 4.6. Suppose w is sampled from a standard complex Gaussian on M variables. It follows that if we set w0 = w Aw , and use the summary statistics from Theorem 4.1, then with probability 1/3 2 exp( Ω(N)) and any h as in Lemma 4.5

(i) 1 r0 c σ σ+

M for some universal constant c > 0,

(ii) cos sθ0 1/2,

(iii) v0 = 1 r2 0.

Finally, we consider a straightforward choice of σ that meets Assumption 4.4 so that we can arrive at an explicit complexity bound on learning: Corollary 4.7 (Non-asymptotic Rates for Gradient Flow). Consider ξ = 1 1

N and the specific choice of activation

σ(z) = arctan ξz + ξz arctan ξz .

Suppose we initialize w from a standard complex Gaussian in dimension M with M = O(N 3), and {am}M m=1 S1 iid. Furthermore, treat s and ϵ as constants relative to N. Then with probability 1/3 2 exp( Ω(N)), we will recover ϵ accuracy in time

O 2s2N 7s + log 1

ϵ s > 1 . (22)

Proof. By Proposition H.5, the activation σ given in the corollary statement satisfies Assumption 4.4, so we can apply Lemma 4.6 and Lemma 4.5 to satisfy the requirements of Theorem 4.2. In particular, the fourth condition is given by assuming e Ω(

N) min(ϵ/2, O(s sr4 0)) which is true when s is constant, and ϵ and r0 are at most polynomial compared to N.

Note that σ+ = O(1) and σ = Ω 1 N1/4 , so it follows that r0 = Ω 1 N7/4 with probability 1/3 2 exp( Ω(N)). Conditioning on this bound gives the desired bound on the time for ϵ accuracy.

Hence, we have a rate that, for s = O(1), is not cursed by dimensionality to recover the true hidden direction h . As mentioned above, there are two caveats to this recovery: w is only recovered up to an sth root of unity, and to directly make predictions of the teacher model would require using the teacher link function rather than using the student model directly.

Since this result concerns gradient flow over the population loss, a natural question is what barriers exist that stymie the SGD analysis of recent single index papers (Arous et al., 2021; Damian et al., 2023; Bruna et al., 2023). These works treat the convergence of SGD by a standard drift and martingale argument, where the drift follows the population gradient flow, and the martingales are shown to be controlled via standard concentration inequalities and careful arguments around stopping times. Applying these tactics to a discretized version of the dynamics given in Theorem 4.1 mainly runs into an issue during the first phase of training. Unlike in Arous et al. (2021) where the drift dynamics have the correlation monotonically increasing towards 1, at the start of our dynamics the correlation magnitude r and the orthogonal part of the learned parameter v are both decreasing (with high probability over the initialization). Showing that this behavior doesn t draw the model towards the saddle point where r = 0 requires showing that v decreases meaningfully faster than r, i.e. showing that d

v is positive. It s not clear what quality of bounds the martingale concentration inequalities would provide for this quantity, and we leave for future work if the six stage proof of the dynamics behavior could be successfully discretized.

5 EXPERIMENTS

To study an experimental setup for our setting, we consider the student-teacher setup outlined above with gradient descent. We consider N = 25, M = 100, and approximate the matrix A by capping the infinite number of rows at 150, which was sufficient for 1 PAh 2 0.001 in numerical experiments. For the link function f, we choose its only non-zero monomial coefficients to be α3 = α4 = α5 = 1

3. And correspondingly, g simply has α3 = 1 and all other coefficients at zero.

We choose for convenience an activation function such that Akm = N 1

N k ak m. We make this choice because, while obeying all the assumptions required in Assumption 4.4, this choice implies that the action of A on the elementary

Published as a conference paper at ICLR 2024

basis vectors ej for 1 j

N is approximately distributed the same. This choice means that PAh is less dependent on the choice of h , and therefore reduces the variance in our experiments when we choose h uniformly among unit norm vectors with support on the first

N elements, i.e. uniformly from the complex sphere in degree

Under this setup, we train full gradient descent on 50000 samples from the Vandermonde V distribution under 20000 iterations. The only parameter to be tuned is the learning rate, and we observe over the small grid of [0.001, 0.0025, 0.005] that a learning rate of 0.0025 performs best for the both models in terms of probability of r reaching approximately 1, i.e. strong recovery.

As described in Theorem 4.1, we use preconditioned gradient descent using (A A) 1 as the preconditioner, which can be calculated once at the beginning of the algorithm and is an easy alteration to vanilla gradient descent to implement. We use the pseudoinverse for improved stability in calculating this matrix, although we note that this preconditioner doesn t introduce stability issues into the updates of our summary statistics, even in the case of gradient descent. Indeed, even if one considers the loss L(w) under an empirical expectation rather than full expectation, the gradient L(w) can still be seen to be written in the form A v for some vector v. If one preconditions this gradient by (A A) 1, and observes that the summary statistics m and v both depend on Aw rather than w directly, it follows that the gradient update on these statistics is always of the form A(A A) 1A = PA, so even in the empirical case this preconditioner doesn t introduce exploding gradients.

Figure 1: The learning trajectory, over ten independent runs, of the three summary statistics in the case of our chosen loss function L, and the trajectory of the r statistic for the more complicated loss function ˆL

6 DISCUSSION

6.1 EXPERIMENTAL RESULTS

The outcomes of our experiments are given in Figure 1. We observe very high rates of strong recovery using the loss L. For the loss ˆL, we note that r often becomes stuck, indicating the model has reached a local minima.

Published as a conference paper at ICLR 2024

We note that our analysis is somewhat pessimistic, as the experimental gradient descent on L(w) will often achieve near perfect accuracy even if cos sθ0 < 0. This is mainly an issue of proof technique: although cos sθ is always increasing under the dynamics, r is necessarily decreasing for as long as cos sθ is negative. It is quite subtle to control whether cos sθ will become positive before r becomes extremely small, and the initialization of r is the main feature that controls the runtime of the model. However the empirical results suggest that a chance of success > 1/2 is possible under a more delicate analysis.

However, the analysis given in the proof of Theorem 4.2 does accurately capture the brief dip in the value of r in the initial part of training, when the regularization contributes more to the gradient than the correlation until cos sθ becomes positive.

Because we can only run experiments on gradient descent rather than gradient flow, we observe the phenomenon of search vs descent studied in Arous et al. (2021), where the increase in the correlation term r is very slow and then abruptly increases. For the model trained with ˆL, we observe that there is much greater likelihood of failure in the recovery, as r appears to become stuck below the optimal value of 1.

6.2 EXTENSIONS

The success of this method of analysis depends heavily on the Hermite-like identity in Proposition 2.3. In general, many of the existing results analyzing single index models need to assume either Gaussian inputs, or uniformly distributed inputs on the Boolean hypercube (see for example Abbe et al. (2023)). In some sense, this works cements the inclusion of the Vandermonde distribution in this set of measures that enable clean analysis. The proof techniques for these three measures are quite disparate, so it remains open to determine if there is a wider class of nice distributions where gradient dynamics can be successfully analyzed.

Additionally, the success of the multi-layer training in Bietti et al. (2022); Mahankali et al. (2023) suggests that simultaneously training the frozen first layer weights may not prohibit the convergence analysis. The matrix A depends on the first layer weights through a Vandermonde matrix (see X in the proof of Lemma 4.5), and the simple characterization of the derivative of a Vandermonde matrix alludes to further possibilities for clean analysis.

6.3 LIMITATIONS

A first limitation is the focus of this work on complex inputs, analytic activations, and fixed input distribution (namely the squared Vandermonde density). Although complex analytic functions are less commonly studied in the literature, they do still appear in settings like quantum chemistry (Beau & del Campo, 2021; Langmann, 2005). Regarding the focus on the Vandermonde distribution, we note this is similar to the vanilla single-index setting in the restriction to Gaussian inputs, under which the theory is particularly powerful, simplest and understanding of non-Gaussian data is still nascent.

A second limitation is that this work focuses on input distributions over sets of scalars, whereas typically symmetric neural networks are applied to sets of high-dimensional vectors. Proposition 2.3 does not work out of the box for these settings without a high-dimensional analogue of the inner product , V with similar orthogonality properties. It is possible to define such an inner products on the so-called multisymmetric powersums with similar orthogonality (Zweig & Bruna, 2022), and we leave to future work the question of whether such inner products could grant similar guarantees about the learning dynamics in this more realistic setting.

7 CONCLUSION

In this work we ve shown a first positive result that quantifies the ability of gradient descent to perform symmetric feature learning, by adapting and extending the tools of two-layer single index models. In essence, this is made possible by a miracle , namely the fact that certain powersum expansions under the Vandermonde measure enjoy the same semigroup structure as Hermite polynomials under the Gaussian measure (Proposition 2.3) leading to a dimension-free summary statistic representation of the loss. Although the resulting dynamics are more intricate than in the Euclidean setting, we are nonetheless able to establish quantitative convergence rates to escape the mediocrity of initialization, recovering the same main ingredients as in previous works Arous et al. (2021); Abbe et al. (2022), driven by the information exponent. To our knowledge, this is the first work to show how learning with gradient based methods necessarily succeeds in this fully non-linear (i.e. not in the NTK regime) setting. Nevertheless, there are many lingering questions.

Published as a conference paper at ICLR 2024

As discussed, one limitation of the analysis is the reliance on gradient flow rather than gradient descent. We hope that in future work we ll be able to effectively discretize the dynamics, made more challenging by the fact that one must track three parameters rather than simply the correlation. Still, we observe theoretically and empirically that the symmetric single index setting demands a number of unusual choices, such as a correlation loss and distinct student and teacher link function, in order to enable efficient learning. And in a broader scheme, if one remembers the perspective of Deep Sets as a very limited form of a three-layer architecture, the issue of provable learning for deeper, more realistic architectures stands as a very important and unexplored research direction and where Transformers with planted low-dimensional structures appear as the next natural question.

Emmanuel Abbe, Enric Boix-Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. ar Xiv preprint ar Xiv:2202.08658, 2022.

Emmanuel Abbe, Enric Boix-Adsera, and Theodor Misiakiewicz. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. ar Xiv preprint ar Xiv:2302.11055, 2023.

Luca Arnaboldi, Ludovic Stephan, Florent Krzakala, and Bruno Loureiro. From high-dimensional & meanfield dynamics to dimensionless odes: A unifying approach to sgd in two-layers networks. ar Xiv preprint ar Xiv:2302.05882, 2023.

Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non-convex losses from high-dimensional inference. The Journal of Machine Learning Research, 22(1):4788 4838, 2021.

Mathieu Beau and Adolfo del Campo. Parent hamiltonians of jastrow wavefunctions. Sci Post Physics Core, 4(4):030, 2021.

Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for sgd: Effective dynamics and critical scaling. ar Xiv preprint ar Xiv:2206.04030, 2022.

Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks. In Advances in Neural Information Processing Systems, 2022.

Imre Bihari. A generalization of a lemma of bellman and its application to uniqueness problems of differential equations. Acta Mathematica Hungarica, 7(1):81 94, 1956.

Joan Bruna, Loucas Pillaud-Vivien, and Aaron Zweig. On single index models beyond gaussian data. ar Xiv preprint ar Xiv:2307.15804, 2023.

Alex Damian, Eshaan Nichani, Rong Ge, and Jason D Lee. Smoothing the landscape boosts the signal for sgd: Optimal sample complexity for learning single index models. ar Xiv preprint ar Xiv:2305.10633, 2023.

Alexandru Damian, Jason Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Conference on Learning Theory, 2022.

Persi Diaconis and Mehrdad Shahshahani. On the eigenvalues of random matrices. Journal of Applied Probability, 31 (A):49 62, 1994.

Rishabh Dudeja and Daniel Hsu. Learning single-index models in gaussian space. In Conference On Learning Theory, pp. 1887 1930. PMLR, 2018.

Robert FH Fischer. Precoding and signal shaping for digital transmission. John Wiley & Sons, 2005.

Martin Jacobsen. Laplace and the origin of the ornstein-uhlenbeck process. Bernoulli, 2(3):271 286, 1996.

Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. Advances in Neural Information Processing Systems, 24, 2011.

Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.

Edwin Langmann. A method to derive explicit formulas for an elliptic generalization of the jack polynomials. ar Xiv preprint math-ph/0511015, 2005.

Published as a conference paper at ICLR 2024

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744 3753. PMLR, 2019.

Ian Grant Macdonald. Symmetric functions and Hall polynomials. Oxford university press, 1998.

Arvind Mahankali, Jeff Z Haochen, Kefan Dong, Margalit Glasgow, and Tengyu Ma. Beyond ntk with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time. ar Xiv preprint ar Xiv:2306.16361, 2023.

Ryan O Donnell. Analysis of boolean functions. ar Xiv preprint ar Xiv:2105.10386, 2021.

Stephen B Pope. Algorithms for ellipsoids. Cornell University Report No. FDA, pp. 08 01, 2008.

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652 660, 2017.

Peter Radchenko. High dimensional single index models. Journal of Multivariate Analysis, 139:266 282, 2015.

Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. Advances in neural information processing systems, 30, 2017.

Shai Shalev-Shwartz, Ohad Shamir, and Karthik Sridharan. Learning kernel-based halfspaces with the zero-one loss. ar Xiv preprint ar Xiv:1005.3681, 2010.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Gilad Yehudai and Shamir Ohad. Learning a single neuron with gradient methods. In Conference on Learning Theory, pp. 3756 3786. PMLR, 2020.

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017.

Aaron Zweig and Joan Bruna. Exponential separations in symmetric neural networks. Advances in Neural Information Processing Systems, 35:33134 33145, 2022.