# generalizablity_of_memorization_neural_network__e58ee12b.pdf

Generalizability of Memorization Neural Networks

Lijia Yu1, Xiao-Shan Gao2,3, , Lijun Zhang1,3, Yibo Miao2,3

1 Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences 2Academy of Mathematics and Systems Science, Chinese Academy of Sciences Beijing 100190, China 3University of Chinese Academy of Sciences, Beijing 100049, China

The neural network memorization problem is to study the expressive power of neural networks to interpolate a finite dataset. Although memorization is widely believed to have a close relationship with the strong generalizability of deep learning when using over-parameterized models, to the best of our knowledge, there exists no theoretical study on the generalizability of memorization neural networks. In this paper, we give the first theoretical analysis of this topic. Since using i.i.d. training data is a necessary condition for a learning algorithm to be generalizable, memorization and its generalization theory for i.i.d. datasets are developed under mild conditions on the data distribution. First, algorithms are given to construct memorization networks for an i.i.d. dataset, which have the smallest number of parameters and even a constant number of parameters. Second, we show that, in order for the memorization networks to be generalizable, the width of the network must be at least equal to the dimension of the data, which implies that the existing memorization networks with an optimal number of parameters are not generalizable. Third, a lower bound for the sample complexity of general memorization algorithms and the exact sample complexity for memorization algorithms with constant number of parameters are given. It is also shown that there exist data distributions such that, to be generalizable for them, the memorization network must have an exponential number of parameters in the data dimension. Finally, an efficient and generalizable memorization algorithm is given when the number of training samples is greater than the efficient memorization sample complexity of the data distribution.

1 Introduction

Memorization is to study the expressive power of neural networks to interpolate a finite dataset [9]. The main focus of the existing work is to study how many parameters are needed to memorize. For any dataset Dtr of size N and neural networks of the form F : Rn R, memorization networks with O(N) parameters have been given with various model structures and activation functions [31, 50, 30, 29, 26, 47, 56, 11, 65]. On the other hand, it is shown that in order to memorize an arbitrary dataset of size N [64, 56], the network must have at least Ω(N) parameters, so the above algorithms are approximately optimal. Under certain assumptions, it is shown that sublinear O(N 2/3) parameters are sufficient to memorize Dtr [49]. Furthermore, Vardi et al. [55] give a memorization network with optimal number of parameters: O(

Recently, it is shown that memorization is closely related to one of the most surprising properties of deep learning, that is, over-parameterized neural networks are trained to nearly memorize noisy data and yet can still achieve a very nice generalization on the test data [45, 7, 4]. More precisely, the

Corresponding author.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

double descent phenomenon [45] indicates that when the networks reach the interpolation threshold, larger networks tend to have more generalizability [41, 10]. It is also noted that memorizing helps generalization in complex learning tasks, because data with the same label have quite diversified features and need to be nearly memorized [19, 20]. A line of research to harvest the help of memorization to generalization is interpolation learning. Most of recent work in interpolation learning shows generalizability of memorization models in linear regimes [7, 12, 38, 53, 59, 66].

As far as we know, the generazability of memorization neural networks has not been studied theoretically, which is more challenging compared to the linear models, and this paper provides a systematic study of this topic. In this paper, we consider datasets that are sampled i.i.d. from a data distribution, because i.i.d. training dataset is a necessary condition for learning algorithms to have generalizability [54, 44]. More precisely, we consider binary data distributions D over Rn { 1, 1} and use Dtr DN to mean that Dtr is sampled i.i.d. from D and |Dtr| = N. All neural networks are of the form F : Rn R. The main contributions of this paper include four aspects.

First, we give the smallest number of parameters required for a network to memorize an i.i.d. dataset.

Theorem 1.1 (Informal. Refer to Section 4). Under mild conditions on D, if Dtr DN, it holds

(1) There exists an algorithm to obtain a memorization network of Dtr with width 6 and depth

(2) There exists a constant ND Z+ depending on D only, such that a memorization network of Dtr with at most ND parameters can be obtained algorithmically.

ND is named as the memorization parameter complexity of D, which measures the complexity of D under which a memorization network with ND parameters exists for almost all Dtr DN.

Theorem 1.1 allows us to give the memorization network for i.i.d dataset with the optimal number of parameters. When N is small so that

N ND, the memorization network needs at least Ω(

N) parameters as proved in [6] and (1) of Theorem 1.1 gives the optimal construction. When N is large, (2) of Theorem 1.1 shows that a constant number of parameters is enough to memorize.

Second, we give a necessary condition for the structure of the memorization networks to be generalizable, and shows that even if there is enough data, memorization network may not have generalizability.

Theorem 1.2 (Informal. Refer to Section 5). Under mild conditions on D, if Dtr DN, it holds

(1) Let H be a set of neural networks with width w. Then, there exist an integer n > w and a data distribution D over Rn { 1, 1} such that, any memorization network of Dtr in H is not generalizable.

(2) For almost any D, there exists a memorization network of Dtr, which has O(

N) parameters and is not generalizable.

Theorem 1.2 indicates that memorization networks with the optimal number of parameters O(

N) may have poor generalizability, and commonly used algorithms for constructing fixed-width memorization networks have poor generalization for some distributions. These conclusions demonstrate that the commonly used network structures for memorization is not generalizable and new network structures are needed to achieve generalization.

Third, we give a lower bound for the sample complexity of general memorization networks and the exact sample complexity for certain memorization networks.

Theorem 1.3 (Informal. Refer to Section 6). Let ND be the memorization parameter complexity defined in Theorem 1.1. Under mild conditions on D, we have

(1) Lower bound. In order for a memorization network of any Dtr DN to be generalizable, N must be Ω( N2 D ln2(ND))2.

(2) Upper bound. For any memorization network with at most ND parameters for Dtr DN, if N = O(N 2 D ln ND), then the network is generalizable.

2Here, Ωand O mean that certain small quantities are omitted. Also, we keep the logarithm factor of ND for comparison with the upper bound

Notice that the lower bound is for general memorization networks and the upper bound is for memorization networks with ND parameters, which always exist by (2) of Theorem 1.1. In the latter case, the lower and upper bounds are approximately the same, which gives the exact sample complexity O(N 2 D) in this case. In other words, a necessary and sufficient condition for the memorization network in (2) of Theorem 1.1 to be generalizable is N = O(N 2 D).

Remark 1.4. Unfortunately, these generalizable memorization networks cannot be computed efficiently, as shown by the following results proved by us.

(1) If P = NP, then all networks in (2) of Theorem 1.3 cannot be computed in polynomial time.

(2) For some data distributions, an exponential (in the data dimension) number of samples is required for memorization networks to achieve generalization.

Finally, we want to know that does there exist a polynomial time memorization algorithm that can ensure generalization, and what is the sample complexity of such memorization algorithm? An answer is given in the following theorem.

Theorem 1.5 (Informal. Refer to Section 7). There exists an SD Z+ depending on D only such that, under mild conditions on D, if N = O(SD), then we can construct a generalizable memorization network with O(N 2n) parameters for any Dtr DN in polynomial time.

SD is named as the efficient memorization sample complexity for D, which measures the complexity of D so that the generalizable memorization network of any Dtr DN can be computed efficiently if N = O(SD).

The memorization network in Theorem 1.5 has more parameters than the optimal number O(

N) of parameters required for memorization. The main reason is that building memorization networks with O(

N) parameters requires special technical skill that may break the generalization. On the other hand, as mention in [7], over-parametrization is good for generalization, so it is reasonable for us to use more parameters for memorization to achieve generalization.

Remark 1.6. We explain the relationship between our results and interpolation learning [7]. Interpolation learning uses optimization to achieve memorization, which is a more practical approach, while our approach gives a theoretical foundation for memorization networks. Once an interpolation is achieved, Theorem 1.2, (1) of Theorem 1.3, and Theorem 1.5 are valid for interpolation learning. For example, according to (1) of Theorem 1.3, Ω(N 2 D) is a lower bound for the sample complexity of interpolation learning, and by Theorem 1.5, O(SD) is an upper bound for the sample complexity of efficient interpolation learning.

Main Contributions. Under mild conditions for the data distribution D, we have

We define the memorization parameter complexity ND Z+ of D such that, a memorization network for any Dtr DN can be constructed, which has O(

N) or ND parameters. Here, the memorization network has the optimal number of parameters.

We give two necessary conditions for the construction of generalizable memorization networks for any Dtr in terms of the width and number of parameters of the memorization network.

We give a lower bound Ω(N 2 D) of the sample complexity for general memorization networks as well as the exact sample complexity O(N 2 D) for memorization networks with ND parameters. We also show that for some data distribution, an exponential number of samples in n is required to achieve generalization.

We define the efficient memorization sample complexity SD Z+ for D, so that generalizable memorization network of any Dtr DN can be computed in polynomial time, if N = O(SD).

2 Related work

Memorization. The problem of memorization has a long history. In [9], it is shown that networks with depth 2 and O(N) parameters can memorize a binary dataset of size N. In subsequent work,

it is shown that networks with O(N) parameters can be a memorization for any dataset [31, 50, 11, 30, 65, 29, 64, 56, 26, 47] and such memorization networks are approximately optimal for generic dataset [64, 56]. Since the VC dimension of neural networks with N parameters and depth D and with Re LU as the activation function is at most O(ND) [24, 5, 6], memorizing some special datasets of size N requires at least Ω(

N) parameters and there exists a gap between this lower bound Ω(

N) and the upper bound O(N). Park et al. [49] show that a network with O(N 2/3) parameters is enough for memorization under certain assumptions. Vardi et al. [55] further give the memorization network with optimal number of parameters O(

N). In [22], strengths of both generalization and memorization are combined in a single neural network. Recently, robust memorization has been studied [35, 62]. As far as we know, the generazability of memorization neural networks has not been studied theoretically.

Interpolation Learning. Another line of related research is interpolation learning, that is, leaning under the constraint of memorization, which can be traced back to [52]. Most recent works establish various generalizability of interpolation learning in linear regimes [7, 12, 38, 53, 59, 66]. For instance, Bartlett et al. [7] prove that over-parametrization allows gradient methods to find generalizable interpolating solutions for the linear regime. In relation to this, how to achieve memorization via gradient descent is studied in [13, 14]. Results of this paper can be considered to give sample complexities for interpolation learning.

Generalization Guarantee. There exist several ways to ensure generalization of networks. The common way is to estimate the generalization bound or sample complexity of leaning algorithms. Generalization bounds for neural networks are given in terms of the VC dimension [24, 5, 6], under the normal training setting [27, 44, 8], under the differential privacy training setting [1], and under the adversarial training setting [60, 58]. In most cases, these generalization bounds imply that when the training set is large enough, a well-trained network with fixed structure has good generalizability. On the other hand, the relationship between memorization and generalization has also been extensively studied [45, 41, 10, 19, 20]. In [25], sample complexity of neural networks is given when the norm of the transition matrix is limited, in [36], sample complexity of shallow transformers is considered. This paper gives the lower bound and upper bound (in certain cases) of the sample complexities for interpolation learning.

In this paper, we use O(A) to mean a value not greater than c A for some constant c, and O to mean that small quantities, such as logarithm, are omitted. We use Ω(A) to mean a value not less than c A for some constant c, and Ωto mean that small quantities, such as logarithm, are omitted. We say for all (x, y) D there is event A stand means that P(x,y) D(A) = 1.

3.1 Neural network

In this paper, we consider feedforward neural networks of the form F : Rn R and the l-th hidden layer of F(x) can be written as

Xl = σ(Wl Xl 1 + bl) Rnl,

where σ = Relu is the activation function, X0 = x and N0 = n. The last layer of F is F(x) = WL+1XL + b L+1 R, where L is the number of hidden layers in F. The depth of F is depth(F) = L + 1, the width of F is width(F) = max L i=1{ni}, the number of parameters of F is para(F) = PL i=0 ni(ni+1 + 1). Denote H(n) to be the set of all neural networks in the above form.

3.2 Data distribution

In this paper, we consider binary classification problems and use D to denote a joint distribution on D(n) = [0, 1]n { 1, 1}. To avoid extreme cases, we focus mainly on a special kind of distribution to be defined in the following.

Definition 3.1. For n Z+ and c R+, D(n, c) is the set of distributions D on D(n), which has a positive separation bound: inf(x,1),(z, 1) D ||x z||2 c.

The accuracy of a network F on a distribution D is defined as

AD(F) = P(x,y) D(Sgn(F(x)) = y).

We use Dtr DN to mean that Dtr is a set of N data sampled i.i.d. according to D. For convenience, dataset under distribution means that the dataset is i.i.d selected from a data distribution. Remark 3.2. We define the distribution with positive separation bound in for the following reasons. (1) If Dtr DN and D D(n, c), then xi = xj when yi = yj. Such property ensures that Dtr can be memorized. (2) Proposition 3.3 shows that there exists a D such that any network is not generalizable over D, and this should be avoided. Therefore, distribution D needs to meet certain requirements for a dataset sampled from D to have generalizability. Proof of Proposition 3.3 is given in Appendix A. (3) Most commonly used classification distributions should have positive separation bound. Proposition 3.3. There exists a distribution D such that AD(F) 0.5 for any neural network F.

3.3 Memorization neural network

Definition 3.4. A neural network F H(n) is a memorization of a dataset Dtr over D(n), if Sgn(F(x)) = y for any (x, y) Dtr. Remark 3.5. Memorization networks can also be defined more strictly as F(x) = y for any (x, y) Dtr. In Proposition 4.10 of [62], it is shown that these two types of memorization networks need essentially the same number of parameters.

To be more precise, we treat memorization as a learning algorithm in this paper, as defined below.

Definition 3.6. L : n Z+2D(n) n Z+H(n) is called a memorization algorithm if for any n and Dtr D(n), L(Dtr) is a memorization network of Dtr.

Furthermore, a memorization algorithm L is called an efficient memorization algorithm if there exists a polynomial poly : R R such that L(Dtr) can be computed in time poly(size(Dtr)), where size(Dtr) is the bit-size of Dtr. Remark 3.7. It is clear that if L is an efficient memorization algorithm, then para(L(Dtr)) is also polynomial in size(Dtr).

There exist many methods which can construct memorization networks in polynomial times, and all these memorization methods are efficient memorization algorithms, which are summarized in the following proposition. Proposition 3.8. The methods given in [9, 62] are efficient memorization algorithms. The methods given in [55, 49] are probabilistic efficient memorization algorithms, which can be proved similar to that of Theorem 4.1. More precisely, they are Monte Carlo polynomial-time algorithms.

4 Optimal memorization network for dataset under distribution

By the term dataset under distribution , we mean datasets that are sampled i.i.d. from a data distribution, and is denoted as Dtr DN. In this section, we show how to construct the memorization network with the optimal number of parameters for dataset under distribution.

4.1 Memorization network with optimal number of parameters

To memorize N samples, eΩ(

N) parameters are necessary [6]. In [55], a memorization network is given which has O(

N) parameters under certain conditions, where O means that some logarithm factors in N and polynomial factors of other values are omitted. Therefore, O(

N) is the optimal number of parameters for a network to memorize certain dataset. In the following theorem, we show that such a result can be extended to dataset under distribution. Theorem 4.1. Let D D(n, c) and Dtr DN. Then there exists a memorization algorithm L such that L(Dtr) has width 6 and depth (equivalently, the number of parameters) O(

N ln(Nn/c)). Furthermore, for any ϵ (0, 1), L(Dtr) can be computed in time poly(size(Dtr), ln(1/ϵ)) with probability 1 ϵ.

Proof Idea. This theorem can be proven using the idea from [55]. Let Dtr = {(xi, yi)}N i=1. The mainly different is that in [55], it requires ||xi xj|| c for all i = j, which is no longer valid when Dtr is sampled i.i.d. from distribution D. Since D has separation bound c > 0, we have ||xi xj|| c for all i, j satisfying yi = yj, which is weaker. Despite this difference, the idea of [55] can still be modified to prove the theorem. In constructing such a memorization network, we need to randomly select a vector, and each selection has a probability of 0.5 to give the correct vector. So, repeat the selection ln(1/ϵ) times, with probability 1 ϵ, we can get at least one correct vector. Then we can construct the memorization network based on this vector. Detailed proof is given in Appendix B.

Remark 4.2. The algorithm in Theorem 4.1 is a Monte Carlo polynomial-time algorithm, that is, it gives a correct answer with arbitrarily high probability. The algorithm given in [55] is also a Monte Carlo algorithm.

4.2 Memorization network with constant number of parameters

In this section, we prove an interesting fact of memorization for dataset under distribution. We show that for a distribution D D(n, c), there exists a constant ND Z+ such that for all datasets sampled i.i.d. from D, there exists a memorization network with ND parameters.

Theorem 4.3. There exists a memorization algorithm L such that for any D D(n, c), there is an N D Z+ satisfying that for any N > 0, with probability 1 of Dtr DN, we have para(L(Dtr)) N D. The smallest N D of the distribution D is called the memorization parameter complexity of D, written as ND.

Proof Idea. It suffices to show that we can find a memorization network of Dtr with a constant number of parameters, which depends on D only. The main idea is to take a subset D tr of Dtr such that Dtr is contained in the neighborhood of D tr. It can be proven that the number of elements in this subset is limited. Then construct a robust memorization network of D tr with certain budget [62], we obtain a memorization network of Dtr, which has a constant number of parameters. The proof is given in Appendix C.

Combining Theorems 4.1 and 4.3, we can give a memorization network with the optimal number of parameters. Remark 4.4. What we have proven in Theorem 4.3 is that a memorization algorithm with a constant number of parameters can be found, but in most of times, we have N D > ND. Furthermore, if N D is large for the memorization algorithm, the algorithm can be efficient. Otherwise, if N D is closed to ND, the algorithm is usually not efficient.

Remark 4.5. It is obvious that the memorization parameter compelxity ND is the minimum number of parameters required to memorize any dataset sampled i.i.d. from D. ND is mainly determined by the characteristic of D D(n, c), so ND may be related to n and c. It is an interesting problem to estimate ND.

5 Condition on the network structure for generalizable memorization

In the preceding section, we show that for the dataset under distribution, there exists a memorization algorithm to generate memorization networks with the optimal number of parameters. In this section, we give some conditions for the generalizable memorization networks in terms of width and number of parameters of the network. As a consequence, we show that the commonly used memorization networks with fixed width is not generalizable.

First, we show that networks with fixed width do not have generazability in some situations. Reducing the width and increasing depth is a common way for parameter reduction, but it inevitably limits the network s power, making it unable to achieve good generalization for specific distributions, as shown in the following theorem.

Theorem 5.1. Let w Z+ and L be a memorization algorithm such that L(Dtr) has width not more than w for all Dtr. Then, there exist an integer n > w, c R+, and a distribution D D(n, c) such that, for any Dtr DN, it holds AD(L(Dtr)) 0.51.

Proof Idea. As shown in [40, 48], networks with small width are not dense in the space of measurable functions, but this is not enough to estimate the upper bound of the generalization. In order to further measure the upper bound of generalization, we define a special class of distributions. Then, we calculate the upper bound of the generalization of networks with fixed width on this class of distribution. Based on the calculation results, it is possible to find a specific distribution within this class of distributions, such that the fixed-width network exhibits a poor generalization of this distribution. The proof is given in Appendix D.

It is well known that width of the network is important for the network to be robust [2, 17, 18, 37, 67]. Theorem 5.1 further shows that large width is a necessary condition for generalizabity.

Note that Theorem 5.1 is for a specific data distribution. We will show that for most distributions, providing enough data does not necessarily mean that the memorization algorithm has generalization ability. This highlights the importance of constructing appropriate memorization algorithms to ensure generalization. We need to introduce another parameter for data distribution.

Definition 5.2. The distribution D is said to have density r, if Px D(x A)/V (A) r for any closed set A [0, 1]n, where V (A) is the volume of A.

Loosely speaking, the density of a distribution is the upper bound of the density function.

Theorem 5.3. For any n Z+, r, c R+, if distribution D D(n, c) has density r, then for any N Z+ and Dtr DN, there exists a memorization network F for Dtr such that para(F) = O(n +

N ln(Nnr/c)) and AD(F) 0.51.

Proof Idea. We refer to the classical memorization construction idea [55]. The main body includes three parts. Firstly, compress the data in Dtr into one dimension. Secondly, map the compressed data to some specific values. Finally, use such a value to get the label of input. Moreover, we will pay more attention to points outside the dataset. We use some skills to control the classification results of points that do not appear in the dataset Dtr, so that the memorization network will give the wrong label to the points that are not in Dtr as much as possible to reduce its accuracy. The general approach is the following: (1) Find a set in which each point is not presented in Dtr and has the same label under distribution D. Without loss of generality, let they have label 1. (2) In the second step mentioned in the previous step, ensure that the mapped results of the points in the set mentioned in (1) are similar to the samples with label 1. This will cause the third step to output the label 1, leading to an erroneous classification result for the points in the set. The proof is given in Appendix E.

Remark 5.4. Theorem 5.1 shows that the width of the generazable memorization network needs to increase with the increase of the data dimension. Theorem 5.3 shows that when para(F) = O(

N), the memorization network may have poor generalizability for most distributions. The above two theorems indicate that no matter how large the dataset is, there always exist memorization networks with poor generalization. In terms of sample complexity, it means that for the hypotheses of neural networks with fixed width or with optimal number of parameters, the sample complexity is infinite, contrary to the uniform generalization bound for feedforward neural networks [63, Lemma D.16].

Remark 5.5. It is worth mentioning that the two theorems in this section cannot be obtained from the lower bound of the generalization gap [44], and more details are shown in Appendix E.

6 Sample complexity for memorization algorithm

As said in the preceding section, generalization of memorization inevitably requires certain conditions. In this section, we give the necessary and sufficient condition for generalization for the memorization algorithm in Section 4 in terms of sample complexity.

We first give a lower bound for the sample complexity for general memorization algorithms and then an upper bound for memorization algorithms which output networks with an optimal number of parameters. The lower and upper bounds are approximately the same, thus giving the exact sample complexity in this case.

6.1 Lower bound for sample complexity of memorization algorithm

Roughly speaking, the sample complexity of a learning algorithm is the number of samples required to achieve generalizability [44]. The following theorem gives a lower bound for the sample complexity of memorization algorithms based on ND, which has been defined in Theorem 4.3. Theorem 6.1. There exists no memorization algorithm L which satisfies that for any n Z+, c R+, ϵ, δ (0, 1), if D D(n, c) and N v N 2 D ln2(ND)(1 2ϵ δ), it holds

PDtr DN (A(L(Dtr)) 1 ϵ) 1 δ

where v is an absolute constant which does not depend on N, n, c, ϵ, δ.

Proof Idea. The mainly idea is that: for a dataset Dtr [0, 1]n { 1, 1} with |Dtr| = N, we can find some distributions D1, D2, . . . , such that if Dtr,i (Di)N, then with a positive probability, it hold Dtr,i = Dtr. In addition, each distribution has a certain degree of difference from the others. It is easy to see that L(Dtr) is a fixed network for a given L, so L(Dtr) cannot fit all Di well because Di are different to some degree. So, if a memorization algorithm L satisfies the condition in the theorem, we try to construct some distributions {Di}n i=1, and use the above idea to prove that L cannot fit one of the distributions in {Di}n i=1, and obtain contradictions. The proof of the theorem is given in Appendix F. Remark 6.2. In general, the sample complexity depends on the data distribution, hypothesis space, learning algorithms, and ϵ, δ. Since ND is related to n and c, the lower bound in Theorem 6.1 also depends on n and c. Here, the hypothesis space is the memorization networks, which is implicitly reflected in ND.

Remark 6.3. Roughly strictly, if we consider interpolation learning, that is, training network under the constraint of memorizing the dataset, then Theorem 6.1 also provides a lower bound for the sample complexity.

This theorem shows that if we want memorization algorithms to have guaranteed generalization, then about O(N 2 D) samples are required. As a consequence, we show that, for some data distribution, it need an exponential number of samples to achieve generalization. The proof is also in Appendix F. Corollary 6.4. For any memorization algorithm L and any ϵ, δ (0, 1), there exist n Z+, c > 0 and a distribution D D(n, c), such that in order for L to have generalizability on D, that is for all N N0, there is PDtr DN (A(L(Dtr)) 1 ϵ) 1 δ,

N0 must be more than v(2 2[ n c2 ]c4(1 2ϵ δ)/n2), where v is an absolute constant not depending on N, n, c, ϵ, δ.

6.2 Exact sample complexity of memorization algorithm with ND parameters

In Theorem 6.1, it is shown that Ω(N 2 D) samples are necessary for generalizability of memorization. The following theorem shows that there exists a memorization algorithm that can reach generalization with O(N 2 D) samples. Theorem 6.5. For all memorization algorithms L satisfies that L(Dtr) has at most ND parameters, with probability 1 for Dtr DN, we have

(1) For any c R, ϵ, δ (0, 1), n Z+, if D D(n, c) and N v N 2 D ln(ND/(ϵ2δ))

PDtr DN (A(L(Dtr)) 1 ϵ) 1 δ,

where v is an absolute constant which does not depend on N, n, c, ϵ, δ.

(2) If P = NP, then all such algorithms are not efficient.

Proof Idea. For the proof of (1), we need to use the ND to calculate the VC-dimension [6], and take such a dimension in the generalization bound theorem [44] to obtain the result. For the proof of (2), we show that, if such algorithm is efficient, then we can solve the following reversible 6-SAT [43] problem, which is defined below and is an NPC problem. The proof of the theorem is given in Appendix G.

Definition 6.6. Let φ be a Boolean formula and φ the formula obtained from φ by negating each variable. The Boolean formula φ is called reversible if either both φ and φ are satisfiable or both are not satisfiable. The reversible satisfiability problem is to recognize the satisfiability of reversible formulae in conjunctive normal form (CNF). By the reversible 6-SAT, we mean the reversible satisfiability problem for CNF formulae with six variables per clause. In [43], it is shown that the reversible 6-SAT is NPC.

Combining Theorems 6.1 and 6.5, we see that N = O(N 2 D) is the necessary and sufficient condition for the memorization algorithm to generalize, and hence O(N 2 D) is the exact sample complexity for memorization algorithms with ND parameters over the distribution D(n, c).

Unfortunately, by (2) of Theorem 6.5, this memorization algorithm is not efficient when the memorization has no more than ND parameters. Furthermore, we conjecture that there exist no efficient memorization algorithms that can use O(N 2 D) samples to reach generalization in the general case, as shown in the following conjecture.

Conjecture 6.7. If P = NP, there exist no efficient memorization algorithms that can reach generalization with O(N 2 D) samples for all D D(n, c).

Remark 6.8. This result also provides certain theoretical explanation for the over-parameterization mystery [45, 7, 4]: for memorization algorithms with ND parameters, the exact sample complexity O(N 2 D) is greater than the number of parameters. Thus, the networks is under-parameterized and for such a network, even if it is generalizable, it cannot be computed efficiently.

7 Efficient memorization algorithm with guaranteed generalization

In the preceding section, we show that there exist memorization algorithms that are generalizable when N = O(N 2 D), but such an algorithm is not efficient. In this section, we give an efficient memorization algorithm with guaranteed generalization.

First, we define the efficient memorization sample complexity of D. Definition 7.1. For (x, y) D, let L(x,y) = min(z, y) D ||x z||2 and B((x, y)) = B2(x, L(x,y)/3.1) = {z Rn : z x 2 L(x,y)/3.1}. The nearby set S of D is a subset of sample (x, y) which is in distribution D and satisfies: (1) for any (x, y) D, x (z,w) SB((z, w)); (2) |S| is minimum.

Evidently, for any D D(n, c), its nearby set is finite, as shown by Proposition 7.7. SD = |S| is called the efficient memorization sample complexity of D, the meaning of which is given in Theorem 7.3. Remark 7.2. In the above definition, we use L(x,y)/3.1 to be the radius of B((x, y)). In fact, when 3.1 is replaced by any real number greater than 3, the following theorem is still valid. Theorem 7.3. There exists an efficient memorization algorithm L such that for any c R, ϵ, δ (0, 1), n Z+, and D D(n, c), if N SD ln(SD/δ)

PDtr DN (A(L(Dtr)) 1 ϵ) 1 δ.

Moreover, for any Dtr DN, L(Dtr) has at most O(N 2n) parameters.

Proof Idea. For a given dataset Dtr [0, 1]n { 1, 1}, we use the following two steps to construct a memorization network.

Step 1. Find suitable convex sets {Ci} in [0, 1]n such that each sample in Dtr is in at least one of these convex sets. Furthermore, if x, z Ci and (x, yx), (z, yz) Dtr, then yx = yz, and define y(Ci) = yx.

Step 2. Construct a network F such that for any x Ci, Sgn(F(x)) = y(Ci). This network must be a memorization of Dtr, because each sample in Dtr is in at least one of {Ci}. Hence, if x Ci and (x, yx) Dtr, then Sgn(F(x)) = y(Ci) = yx. The proof of the theorem is given in Appendix H.

Remark 7.4. Theorem 7.3 shows that there exists an efficient and generalizable memorization algorithm when N = O(SD). Thus, SD is an intrinsic complexity measure of D on whether it is

easy to learn and generalize. By Theorem 6.1, SD N 2 D for some D, but for some nice D, SD could be small. It is an interesting problem to estimate SD.

Remark 7.5. Theorem 7.3 uses O(N 2n) parameters, highlight the importance of overparameterization [45, 7, 4]. Interestingly, Remark 6.8 shows that if the network has O(

N) parameters, even if it is generalizable, it cannot be computed efficiently.

The experimental results of the memorization algorithm mentioned in Theorem 7.3 are given in Appendix I. Unfortunately, for commonly used datasets such as CIFAR-10, this algorithm cannot surpass the network obtained by training with SGD, in terms of test accuracy. Thus, the main purpose of the algorithm is theoretical, that is, it provides a polynomial-time memorization algorithm that can achieve generalization when the training dataset contains O(SD) samples. In comparison of theoretical works, training networks is NP-hard for small networks [32, 51, 39, 15, 3, 42, 16, 23, 21] and the guarantee of generalization needs strong assumptions on the loss function [46, 27, 34, 61, 60, 58].

Finally, we give an estimate for SD. From Corollary 6.4 and Theorem 7.3, we obtain a lower bound for SD.

Corollary 7.6. There exists a distribution D D(n, c) such that SD ln(SD/δ) Ω( c4

n2 2 2[ n c2 ]).

We will give an upper bound for SD in the following proposition, and the proof is given in Appendix H.1. From the proposition, it is clear that SD is finite. Proposition 7.7. For any D D(n, c), we have SD ([6.2n/c] + 1)n. Remark 7.8. The above proposition gives an upper bound of SD when D D(n, c), and this does not mean that SD is exponential for all D D(n, c). Determining the conditions under which SD is small for a given D is a compelling problem.

8 Conclusion

Memorization originally focuses on theoretical study of the expressive power of neural networks. Recently, memorization is believed to be a key reason why over-parameterized deep learning models have excellent generalizability and thus the more practical interpolation learning approach has been extensively studied. But the generalizability theory of memorization algorithms is not yet given, and this paper fills this theoretical gap in several aspects.

We first show how to construct memorization networks for dataset sampled i.i.d from a data distribution, which have the optimal number of parameters, and then show that some commonly used memorization networks do not have generalizability even if the dataset is drawn i.i.d. from a data distribution and contains a sufficiently large number of samples. Furthermore, we establish the sample complexity of memorization algorithm in several situations, including a lower bound for the memorization sample complexity and an upper bound for the efficient memorization sample complexity.

Limitation and future work Two numerical complexities ND and SD for a data distribution D are introduced in this paper, which are used to describe the size of the memorization networks and the efficient memorization sample complexity for any i.i.d. dataset of D. ND is also a lower bound for the sample complexity of memorization algorithms. However, we do not know how to compute ND and SD, which is an interesting future work. Conjecture 6.7 tries to give a lower bound for the efficient memorization sample complexity. More generally, can we write ND and SD as functions of the probability density function p(x, y) of D?

Corollary 6.4 indicates that even for the nice data distributions D(n, c), to achieve generalization for some data distribution requires an exponential number of parameters. This indicates that there exists data curse of dimensionality , that is, to achieve generalizability for certain data distribution, neural networks with exponential number of parameters are needed. Considering the practical success of deep learning and the double descent phenomenon [45], the data distributions used in practice should have better properties than D(n, c), and finding data distributions with polynomial size efficient memorization sample complexity ED is an important problem.

Finally, finding a memorization algorithm that can achieve SOTA results in solving practical image classification problems is also a challenge problem.

Acknowledgments

This work is supported by CAS Project for Young Scientists in Basic Research, Grant No.YSBR-040, ISCAS New Cultivation Project ISCAS-PYFX-202201, and ISCAS Basic Research ISCAS-JCZD202302. This work is also supported by NKRDP grant No.2018YFA0704705, grant GJ0090202, and NSFC grant No.12288201. The authors thank anonymous referees for their valuable comments.

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308 318, 2016.

[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 242 252. PMLR, 09 15 Jun 2019.

[3] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. ar Xiv preprint ar Xiv:1611.01491, 2016.

[4] Devansh Arpit, Stanisław Jastrz ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233 242. PMLR, 2017.

[5] Peter Bartlett, Vitaly Maiorov, and Ron Meir. Almost linear vc dimension bounds for piecewise polynomial networks. In M. Kearns, S. Solla, and D. Cohn, editors, Advances in Neural Information Processing Systems, volume 11. MIT Press, 1998.

[6] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vcdimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(1):2285 2301, 2019.

[7] Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta numerica, 30:87 201, 2021.

[8] Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33:4381 4391, 2020.

[9] Eric B. Baum. On the capabilities of multilayer perceptrons. Journal of Complexity, 4(3): 193 215, 1988.

[10] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

[11] Sebastien Bubeck, Ronen Eldan, Yin Tat Lee, and Dan Mikulincer. Network size and size of the weights in memorization with two-layers neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4977 4986. Curran Associates, Inc., 2020.

[12] Niladri S Chatterji and Philip M Long. Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. Journal of Machine Learning Research, 22(129):1 30, 2021.

[13] Amit Daniely. Neural networks learning and memorization with (almost) no overparameterization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9007 9016. Curran Associates, Inc., 2020.

[14] Amit Daniely. Memorizing gaussians with no over-parameterizaion via gradient decent on neural networks. ar Xiv preprint ar Xiv:2003.12895, 2020.

[15] Bhaskar Das Gupta, Hava T. Siegelmann, and Eduardo Sontag. On a learnability question associated to neural networks with continuous activations (extended abstract). In Proceedings of the Seventh Annual Conference on Computational Learning Theory, COLT 94, page 47 56, New York, NY, USA, 1994. Association for Computing Machinery.

[16] Santanu S. Dey, Guanyi Wang, and Yao Xie. Approximation algorithms for training one-node relu neural networks. IEEE Transactions on Signal Processing, 68:6696 6706, 2020.

[17] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1675 1685. PMLR, 09 15 Jun 2019.

[18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.

[19] Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, page 954 959, New York, NY, USA, 2020. Association for Computing Machinery.

[20] Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2881 2891. Curran Associates, Inc., 2020.

[21] Vincent Froese, Christoph Hertrich, and Rolf Niedermeier. The computational complexity of relu network training parameterized by data dimensionality. Journal of Artificial Intelligence Research, 74:1775 1790, 2022.

[22] Prachi Garg, Shivang Agarwal, Alexis Lechervy, and Frederic Jurie. Memorization and generalization in deep cnns using soft gating mechanisms. https://prachigarg23.github.io/reports/Report-GREYC.pdf, 2019.

[23] Surbhi Goel, Adam Klivans, Pasin Manurangsi, and Daniel Reichman. Tight hardness results for training depth-2 relu networks. ar Xiv preprint ar Xiv:2011.13550, 2020.

[24] Paul Goldberg and Mark Jerrum. Bounding the vapnik-chervonenkis dimension of concept classes parameterized by real numbers. In Proceedings of the sixth annual conference on Computational learning theory, pages 361 369, 1993.

[25] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 297 299. PMLR, 06 09 Jul 2018.

[26] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. ar Xiv preprint ar Xiv:1611.04231, 2016.

[27] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225 1234. PMLR, 2016.

[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[29] Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274 281, 2003.

[30] Guang-Bin Huang and H.A. Babri. Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. IEEE Transactions on Neural Networks, 9(1):224 229, Jan 1998.

[31] Shih-Chi Huang and Yih-Fang Huang. Bounds on number of hidden neurons of multilayer perceptrons in classification and recognition. In Proceedings of 1990 IEEE International Symposium on Circuits and Systems (ISCAS), pages 2500 2503, 1990.

[32] Adam R. Klivans and Alexander A. Sherstov. Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences, 75(1):2 12, 2009.

[33] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical Report TR-2009, 2009.

[34] Ilja Kuzborskij and Christoph Lampert. Data-dependent stability of stochastic gradient descent. In International Conference on Machine Learning, pages 2815 2824. PMLR, 2018.

[35] Binghui Li, Jikai Jin, Han Zhong, John E Hopcroft, and Liwei Wang. Why robust generalization in deep learning is difficult: Perspective of expressive power. ar Xiv preprint ar Xiv:2205.13863, 2022.

[36] Hongkang Li, Meng Wang, Sijia Liu, and Pin-Yu Chen. A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. ar Xiv preprint ar Xiv:2302.06015, 2023.

[37] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

[38] Tengyuan Liang and Benjamin Recht. Interpolating classifiers make few mistakes. Journal of Machine Learning Research, 24:1 27, 2023.

[39] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.

[40] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

[41] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3325 3334. PMLR, 10 15 Jul 2018.

[42] Pasin Manurangsi and Daniel Reichman. The computational complexity of training relu (s). ar Xiv preprint ar Xiv:1810.04207, 2018.

[43] Nimrod Megiddo. On the complexity of polyhedral separability. Discrete & Computational Geometry, 3:325 337, 1988.

[44] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.

[45] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.

[46] Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. Advances in neural information processing systems, 30, 2017.

[47] Quynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep cnns. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3730 3739. PMLR, 10 15 Jul 2018.

[48] Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. ar Xiv preprint ar Xiv:2006.08859, 2020.

[49] Sejun Park, Jaeho Lee, Chulhee Yun, and Jinwoo Shin. Provable memorization via deep neural networks using sub-linear parameters. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3627 3661. PMLR, 15 19 Aug 2021.

[50] Michael A. Sartori and Panos J. Antsaklis. A simple method to derive bounds on the size and to train multilayer neural networks. IEEE Transactions on Neural Networks, 2(4):467 471, 1991.

[51] Shalev-Shwartz Shai and Ben-David Shai. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[52] Rahul Sharma, Aditya V. Nori, and Alex Aiken. Interpolants as classifiers. In CAV 2012, LNCS 7358, pages 71 87, 2012.

[53] Ryan Theisen, Jason M. Klusowski, and Michael W. Mahoney. Good classifiers are abundant in the interpolating regime. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, pages 15532 15543, 2021.

[54] Vladimir N. Vapnik. An overview of statistical learning theory. IEEE Transactions On Neural Networks, 10(5):988 999, 1999.

[55] Gal Vardi, Gilad Yehudai, and Ohad Shamir. On the optimal memorization power of relu neural networks. ar Xiv preprint ar Xiv:2110.03187, 2021.

[56] Roman Vershynin. Memory capacity of neural networks with threshold and rectified linear unit activations. Siam J. Math. Data Sci., 2(4):1004 1033, 2020.

[57] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

[58] Yihan Wang, Shuang Liu, and Xiao-Shan Gao. Data-dependent stability analysis of adversarial training. ar Xiv preprint ar Xiv:2401.03156, 2021.

[59] Zhen Wang, Lan Bai, and Yuanhai Shao. Generalization memorization machine with zero empirical risk for classification. Pattern Recognition, 152:110469, 2024.

[60] Jiancong Xiao, Yanbo Fan, Ruoyu Sun, Jue Wang, and Zhi-Quan Luo. Stability analysis and generalization bounds of adversarial training. Advances in Neural Information Processing Systems, 35:15446 15459, 2022.

[61] Yue Xing, Qifan Song, and Guang Cheng. On the algorithmic stability of adversarial training. Advances in neural information processing systems, 34:26523 26535, 2021.

[62] Lijia Yu, Xiao-Shan Gao, and Lijun Zhang. Optimal robust memorization with relu neural networks. In International Conference on Learning Representations, 2024.

[63] Lijia Yu, Shuang Liu, Yibo Miao, Xiao-Shan Gao, and Lijun Zhang. Generalization bound and new algorithm for clean-label backdoor attack. In Proceedings of the 41st International Conference on Machine Learning, pages 235:57559 57596, 2024.

[64] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small relu networks are powerful memorizers: a tight analysis of memorization capacity. In Advances in Neural Information Processing Systems, volume 32, pages 15532 15543, 2019.

[65] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3): 107 115, 2021.

[66] Lijia Zhou, Frederic Koehler, Danica J. Sutherland, and Nathan Srebro. Optimistic rates: A unifying theory for interpolation learning and regularization in linear regression. ACM/JMS Journal of Data Science, 1(2):1 51, 2024.

[67] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. ar Xiv preprint ar Xiv:1811.08888, 2018.

A Proof of Proposition 3.3

Using the following steps, we construct a distribution D in [0, 1] { 1, 1}. We use (x, y) D to mean that

(1) Randomly select a number in { 1, 1} as the label y.

(2) If we get 1 as the label, then randomly select an irrational number in [0, 1] as samples x; if we get 1 as the label, then randomly select a rational number in [0, 1] as samples x.

Then Proposition 3.3 follows from the following lemma.

Lemma A.1. For any neural network F, we have AD(F) 0.5.

Proof. Let F be a network. Firstly, we show that F can be written as

i=1 Li(x)I(x Ai), (1)

where Li are linear functions, I(x) = 1 if x is true or I(x) = 0. In addition, Ai is an interval and Aj Ai = when j = i, and Li(x)I(x Ai) is a non-negative or non-positive function for any i [M].

It is obvious that the network is a locally linear function with a finite number of linear regions, so we can write

i=1 L i(x)I(x A i), (2)

where L i are linear functions, A i is an interval and A j A i = when j = i.

Consider that L i(x)I(x A i) = L i(x)I(x A i, L i(x) > 0) + L i(x)I(x A i, L i(x) < 0), and L i(x)I(x A i, L i(x) > 0) is a non-negative function, {x A i, L i(x) > 0} is an interval which is disjoint with {x A i, L i(x) < 0}. Similarly as L i(x)I(x A i, L i(x) < 0), so we use L i(x)I(x A i) in (2) instead of L i(x)I(x A i, L i(x) > 0) + L i(x)I(x A i, L i(x) < 0). Then we get the equation (1).

By equation (2), we have that

P(x,y) D(Sgn(F(x)) = y) = P(x,y) D(Sgn(PM i=1 Li(x)I(x Ai)) = y) = PM i=1 P(x,y) D(Sgn(Li(x)I(x Ai)) = y, x Ai) = PM i=1 P(x,y) D(Sgn(Li(x)I(x Ai)) = y|x Ai)P(x,y) D(x Ai).

The second equation uses Ai Aj = .

For convenience, we use x Rr to mean that x is an irrational number and x / Rr to mean that x is a rational number. Then, if Li(x)I(x Ai) is a non-negative function, then we have P(x,y) D(Sgn(Li(x)I(x Ai)) = y|x Ai) P(x,y) D(x Rr|x Ai). Moreover, we have that P(x,y) D(x Rr|x Ai) = P(x,y) D(x Rr,x Ai)

P(x,y) D(x Ai) = 0.5P(x,y) D(x Ai|x Rr)

P(x,y) D(x Ai) = 0.5P(x,y) D(x Ai|x Rr) P(x,y) D(x Rr)P(x,y) D(x Ai|x Rr)+P(x,y) D(x/ Rr)P(x,y) D(x Ai|x/ Rr) = P(x,y) D(x Ai|x Rr) P(x,y) D(x Ai|x Rr)+P(x,y) D(x Ai|x/ Rr).

By (2) in the definition of D, we have P(x,y) D(x Ai|x Rr) = P(x,y) D(x Ai|x / Rr). Substituting this in equation (3), we have that P(x,y) D(Sgn(Li(x)I(x Ai)) = y|x Ai)

P(x,y) D(x Rr|x Ai) = P(x,y) D(x Ai|x Rr) P(x,y) D(x Ai|x Rr)+P(x,y) D(x Ai|x/ Rr) = 0.5. Proof is similar when Li(x)I(x Ai) is a non-positive function.

Using this in equation (2), we have that

P(x,y) D(Sgn(F(x)) = y) = PM i=1 P(x,y) D(Sgn(Li(x)I(x Ai)) = y|x Ai)P(x,y) D(x Ai) PM i=1 0.5P(x,y) D(x Ai) 0.5.

The lemma is proved.

B Proof of Theorem 4.1

For the proof of this theorem, we mainly follow the constructive approach of the memorization network in [55]. Our proof is divided into four parts.

B.1 Data Compression

The general method of constructing memorization networks will compress the data into a low dimensional space at first, and we follow this approach. We are trying to compress the data into a 1-dimensional space, and we require the compressed data to meet some conditions, as shown in the following lemma. Lemma B.1. Let D be a distribution in [0, 1]n { 1, 1} with separation bound c and Dtr DN. Then, there exist w Rn and b R such that (1): O(n N 2/c) wx + b 1 for all x [0, 1]n; (2): |wx wz| 4 for all (x, 1), (z, 1) Dtr.

To prove this lemma, we need the following lemma. Lemma B.2. For any v Rn and T 1, let u Rn be uniformly randomly sampled from the

hypersphere Sn 1. Then we have P(| u, v | < ||v||2

This is Lemma 13 in [49]. Now, we prove the lemma B.1.

Proof. Let c0 = min(x, 1),(z,1) Dtr ||x z||2. Then, we prove the following result:

Result R1: Let u Rn be uniformly randomly sampled from the hypersphere Sn 1, then there are

P(| u, (x z) | c0 4N 2 q

8 nπ, (x, 1), (z, 1) Dtr) > 0.5.

By lemma B.2, and take T = 4N 2, for any x, z which satisfies (x, 1), (z, 1) Dtr, we have that: let u Rn be uniformly randomly sampled from the hypersphere Sn 1, then there are

P(| u, (x z) | < c0 4N 2 q

8 nπ) < 2 4N2 , using ||x z||2 c0 here. So, it holds

P(| u, (x z) | c0 4N2 q

8 nπ, (x, 1), (z, 1) Dtr)

(x, 1),(z,1) Dtr P(| u, (x z) | < c0 4N2 q

4N2 . = 0.5

We proved Result R1.

In practice, to find such a vector, we can randomly select a vector u in hypersphere Sn 1, and verify

that if it satisfies | u, (x z) | c0 4N2 q

8 nπ, (x, 1), (z, 1) Dtr. Verifying such a fact needs poly(B(Dtr)) times. If such a u is not what we want, randomly select a vector u and verify it again.

In each selection, with probability 0.5, we can get a vector we need, so with ln 1/ϵ times the selections, we can get a vector we need with probability 1 ϵ.

Construct w, b and verify their rationality

By the above result, we have that: there exists a u Rn such that ||u||2 = 1 and | u, (x z) |

8 nπ, (x, 1), (z, 1) Dtr, and we can find such a u in poly(B(Dtr), ln(1/ϵ)) times.

Now, let w = 16 n N 2

c0 u and b = ||w||2 n + 1, then we show that w and b are what we want:

(1): We have O(n N 2/c) wx + b 1 for all x [0, 1]n.

Firstly, because D is defined in [0, 1]n { 1, 1}, so it holds ||x||2 n for any (x, y) Dtr, and consequently wx + b b ||w||2 n 1.

On the other hand, |wx| ||w||2 n O( n N 2

c0 ), so wx+b |wx|+b O(n N 2/c0) O(n N 2/c).

(2): We have |w(x z)| 4 for all (x, 1), (z, 1) Dtr.

It is easy to see that |w(x z)| | 16 n N 2

c0 u(x z)| = 16 n N 2

c0 |u(x z)|. Because |u(x z)|

c0 4 n N 2 , so |w(x z)| = 16 n N 2

c0 |u(x z)| 16 n N 2

c0 c0 4 n N 2 = 4.

By Definition 3.1, we know that c0 c. So, w and b are what we want. The lemma is proved.

B.2 Data Projection

The purpose of this part is to map the compressed data into appropriate values.

Let w Rn and b R be given and Dtr = {(xi, yi)}N i=1. Without losing generality, we assume that 0 < wxi < wxi+1.

In this section, we show that, after compressing the data into 1-dimension, we can use a network

F to map wxi + b to v[ i [

N] ], where {vj} [ N [

j=0 R+ are given values. This network has O(

parameters and width 4, as shown in the following lemma.

Lemma B.3. Let {xi}N i=1 R+, {vj} [ N [

j=0 R+. Assume that xi < xi+1. Then a network F with width 4 and depth O(

N) (at most O(

N) parameters) can be obtained such that F(xi) = v[ i

N ] for all i [N].

Proof. Let Fi(x) be the i-th hidden layer of network F, (Fi)j be the j-th nodes of i-th hidden layer of network F.

Let qi = xi+1 xi and t(i) = argmaxj [N]{[j/

N] = i}. Consider the following network F:

The 2i + 1 hidden layer has width 4, and each node is:

(F2i+1)1(x) = Relu((F2i)2(x) (xt(i)+1) + 2qt(i)/3);

(F2i+1)2(x) = Relu((F2i)2(x) (xt(i)+1) + qt(i)/3);

(F2i+1)3(x) = Relu((F2i)1(x));

(F2i+1)4(x) = Relu((F2i)2(x)).

For the case i = 0, let (F0)2(x) = x and (F1)3(x) = v0.

The (2i + 2)-th hidden layer is:

(F2i+2)1(x) = Relu((F2i+1)3(x) + vi+1 vi

qt(i)/3 ((F2i+1)1(x) (F2i+1)2(x)));

(F2i+2)2(x) = Relu((F2i+1)4(x)).

The output is F(x) = (F2[N/

This network has width 4 and O(

N) hidden layers. We can verify that such a network is what we want as follows.

Firstly, it is easy to see that (F2i+2)2(x) = Relu((F2i+1)4(x)) = Relu((F2i)2(x)) = Relu((F2i 1)4(x)) = = Relu((F1)4(x)) = Relu(x) = x.

Then, for vi+1 vi

qt(i)/3 ((F2i+1)1(x) (F2i+1)2(x)) = vi vi 1

qt(i)/3 (Relu(x xt(i)+1+2qt(i)/3) Relu(x xt(i)+1 + qt(i)/3), easy to verify that, when x xt(i), it is 0; when x xt(i+1), it is vi+1 vi.

By the above two results, we have that (F2i+2)1(x) = Relu((F2i+1)3(x)+ vi vi 1

qt(i)/3 ((F2i+1)1(x)

(F2i+1)2(x))) = Relu((F2i)1(x)) when x xt(i); and (F2i+2)1(x) = Relu((F2i+1)3(x) + vi vi 1

qt(i)/3 ((F2i+1)1(x) (F2i+1)2(x))) = Relu((F2i)1(x) + vi+1 vi) when x xt(i)+1.

So, we have that, if t(i 1)+1 j t(i), there are (F2)1(xj) = v0, (F4)1(xj) = v1 v0+v0 = v1, (F6)1(xj) = v2 v1 + v1 = v2, . . . , (F2i)1(xj) = vi vi 1 + vi 1 = vi; and F(xj) = (F2[N/

N])1(xj) = (F2[N/

N] 2)1(xj) = = (F2i)1(xj) = vi.

So, by the definition of t(i), we have that F(xj) = v[ j

N ], such F is what we what and the lemma is proved.

B.3 Label determination

The purpose of this part is to use the values to which the compressed data are mapped, mentioned in the above section, to determine the labels of the data.

Assuming xi is compressed to ci where ci 1 is given in section B.1. Value vi in section B.2 is designed as: vi = [ci[

N]+1] . . . [c(i+1)[

N]], where we treat [cj] as a w digit number for all j (w is a given number). If there exist not enough digits for some cj, we fill in 0 before it, and we use ab to denote the integer by putting a and b together.

First, prove a lemma. Lemma B.4. For a given N, there exists a network f : R R2 with width 4 and at most O(w) parameters such that, for any w digit number ai > 0, we have f(a1a2 . . . a N) = (a1, a2 . . . a N).

Proof. Firstly, we show that, for any a > b > 0, there exists a network Fa,b(x) : R+ R+ with depth 2 and width 3, such that Fa,b(x) = x when x [0, a], and Fa,b(x) = x a when x [a + b, 2a].

We just need to take Fa,b(x) = Relu(x) a/b Relu(x a) + a/b Relu(x (a + b)).l It is easy to verify that this is what we want.

Now, let q N + satisfy 2q 10w+1 1 an 2q+1 > 10w+1 and p < 1 10w N . We consider the following network: F = F20,p F21,p F2q 1,p F2q,p,

and show that, F(a1a2 . . . a N/10w(N 1)) = a2 . . . a N/10w(N 1).

Firstly, we have F2q,p(a1a2 . . . a N/10w(N 1)) = a1(q)a2 . . . a N/10w(N 1), where a1(q) = a1 if a1a2 . . . a N/10w(N 1) 2q and a1(q) = a1 2q if a1a2 . . . a N/10w(N 1) > 2q + p. Just by the definition of q, we know that there must be a1a2 . . . a N/10w(N 1) 2q+1. Further by the definition of p, one of the following two inequalities is true:

a1a2 . . . a N/10w(N 1) < 2q or a1a2 . . . a N/10w(N 1) > 2q + p.

So using the definition of F2q,p, we get the desired result.

Similarly as before, for k = q 1, q 2, . . . , 0, we have F2k,p(a1(k + 1)a2 . . . a N/10w(N 1)) = a1(k)a2 . . . a N/10w(N 1), where a1(k) = a1(k + 1) if a1(k + 1)a2 . . . a N/10w(N 1) 2k and a1(k) = a1(k + 1) 2k if a1(k + 1)a2 . . . a N/10w(N 1) > 2k + p.

Then we have the following result: a1(k) < 2k for any k = 0, 1, . . . , q. By the definition, it is easy to see that a1 < 2q+1. If a1 < 2q, then a1(q) a1 < 2q; if a1 2q, then a1a2 . . . a N/10w(N 1) > 2q + p, so a1(q) = a1 2q < 2q+1 2q = 2q. Thus a1(q) < 2q. When a1(t) < 2t for a t [q], similar as before, we have a1(t 1) < 2t 1. And t = q is proved, so we get the desired result.

It is easy to see that, a1(k) are non negative integers, so there must be F(a1a2 . . . a N/10w(N 1)) = a1(0)a2 . . . a N/10w(N 1) = a2 . . . a N/10w(N 1), by a1(0) < 20 = 1, which implies a1(0) = 0.

Now we construct a network Fb as follows:

Fb(x) = Fb1 Fb1(x) such that:

Fb1(x) : R R2 and Fb1(x) = (F(x/10w(N 1)), x) where x is defined as before.

Fb2(x) : R2 R2 and Fb2((x1, x2)) = (x2/10w(N 1) x1, x1 10w(N 1)).

Now we verify that Fb is what we want.

By the structure of F, Fb has width 4 and depth O(w), so there are at most O(w) parameters.

It is easy to see that Fb1(a1a2 . . . a N) = (a2 . . . a N/10w(N 1), a1a2 . . . a N). Then by the definition of Fb2(x), we have Fb(x) = (a1, a2 . . . a N), this is what we want. The lemma is proved.

By the preceding lemma, we have the following lemma.

Lemma B.5. There is a network R2 R with at most O(Nw) parameters and width 6, and for any {ai}N i=1 where aj is a w digit number and aj 1, which satisfies f(x, a1a2 . . . a N) > 0.1 if |x ak| < 1 for some k [N], and f(x, a1a2 . . . a N) = 0 if |x ak| 1.1 for all k [N].

Proof. The proof idea is as follows: First, we use x and a1a2 . . . a N to judge if |x a1| < 1 as follows: Using lemma B.4, we calculate a1 and a2 . . . a N and then calculate |x a1|.

If |x a1| < 1, then we let the network output a positive number; if |x a1| 1, then calculate a2 . . . a N, and use x and a2 . . . a N to repeat the above process until all |x ai| have been calculated.

The specific structure of the network is as follows:

step 1: Firstly, for a given N, we introduce a sub-network fs : R2 R2, which satisfies (fs)1(x, a1a2 . . . a N) > 0.1 if |x a1| < 1, and fs(x, a1a2 . . . a N) = 0 if |x a1| 1.1, and (fs)2(x, a1a2 . . . a N) = a2 . . . a N. And fs has O(w) parameters and width 5.

The first part of fs is to calculate a1 and a2 . . . a N by lemma B.4. We also need to keep x, and the network has width 5. The second part of fs is to calculate |x a1| and keep a2 . . . a N by using |x| = Relu(x) + Relu( x), which has width 4. The output of fs is Relu(1.1 |x a1|). Easy to check that this is what we want.

step 2: Now we build the f mentioned in the lemma.

Let f = g f N f N 1 f1.

For each i [N], we will let the input of fi which is also the output of fi 1 when i > 1 be the form (x, aiai+1 . . . a N, qi), where q1 = 0. The detail is as follows:

For i [N], in fi, construct fs(x, aiai+1 . . . a N) at first, and then let qi+1 = qi + (fs)1(x, aiai+1 . . . a N), to keep qi in each layer, where we need one more width than fs. Then, output (x, ai+1ai+2 . . . a N, qi+1), which is also the input of (i + 1)-th part.

The output of f is q N+1, that is, g(x, 0, q N+1) = q N+1. Now, we show that, f is what we want.

(1): f has at most O(Nw) parameters and width 6, which is obvious, because each part fi, fi has O(w) parameters by lemma B.4, and f has at most N parts, so we get the result.

(2): f(x, a1a2 . . . a N) > 0.1 if |x ak| < 1 for some k.

This is because when |x ak| < 1, the k-th part will make qk+1 = qk +fs(x, akak+1 . . . a N) > 0.1, because (fs)1(x, akak+1 . . . a N) > 0.1 as said in step 1. Since qj+1 = qj + (fs)1 qj, we have f(x, a1a2 . . . a N) = q N+1 qk+1 > 0.1.

(3): f(x, a1a2 . . . a N) = 0 if |x ak| 1.1 for all k.

This is because when |x ak| 1.1, the k-th part will make qk+1 = qk+fs(x, akak+1 . . . a N) = qk, because fs(x, akak+1 . . . a N) = 0 as said in step 1. Since fs(x, akak+1 . . . a N) = 0 for all k, we have f(x, a1a2 . . . a N) = q N+1 = q N + fs(x, a N) = q N = = q0 = 0.

B.4 The proof of Theorem 4.1

Now, we will prove Theorem 4.1. As we mentioned before, three steps are required: data compression, data projection, and label determination. The proof is as follows.

Proof. Assume that Dtr = {xi}N i=1, without loss of generality, let xi = xj. Now, we show that there is a memorization network F of Dtr with O(

N) parameters.

Part One, data compression.

The part is to compress the data in Dtr into R. Let w, b satisfy (1) and (2) in lemma B.1. Then, the first part of F is f1(x) = Relu(wx + b).

Part two, data projection.

Let ci = f1(xi), without loss of generality, we assume ci ci+1 and y1 = 1. We define c i as: c i = ci if xi has label 1; otherwise c i = c1.

Let t(i) = argmaxj [N]{[j/

N] = i} and vk = [c t(k 1)+1][c t(k 1)+2] . . . [c t(k)].

In this part, the second part of F(x), named as f2(x) : R R2, need to satisfy f2(ci) = (v[ i

for any i [N].

By lemma B.3, a network with O(

N) parameters and width 4 is enough to map xi to v[ i

for keeping the input, and one node is needed at each layer. So f2 just need O(

N) parameters and width 5.

Part Three, Label determination.

In this part, we will use the vk mentioned in part two to output the label of input. The third part, nameed as f3(v, c), should satisfy that:

For f3(vk, c), where vk = [c t(k 1)+1][c t(k 1)+2] . . . [c t(k)] is defined above, if |c c q| < 1 for some q [t(k 1) + 1, t(k)], then f3(vk, c) > 0.1; and f3(vk, c) = 0 if |c c q| 1.1 for all q [t(k 1) + 1, t(k)].

Because the number of digits for ci is O(ln(n N/c)) by (1) in lemma B.1 and lemma B.5, we know that such a network need O(

N ln(Nn/c)) parameters.

Construction of F and verify it:

Let F(x) = f3(f2(f1(x))) 0.05. We show that F is what we want.

(1): By parts one, two, three, it is easy to see that F has at most O(

N ln(Nn/c)) parameters and width 6.

(2): F(x) is a memorization of Dtr. For any (xi, yi) Dtr, consider two sub-cases:

(1.1: if yi = 1): Using the symbols in Part Two, f2(f1(xi)) will output (v[ i

N ], f1(xi)). Since

c i = ci because yi = 1, by part three, we have f3(f2(f1(x))) 0.05 0.1 0.05 > 0.

(1.2 if yi = 1): By (2) in lemma B.1, for (z, 1) Dtr, we know that |f1(xi) [f1(x1)]| |f1(xi) f1(x1)| |f1(x1) [f1(x1)]| 4 1 = 3. So, by part three, we have f3(f2(f1(xi))) = 0 0.05 < 0.

The Running Time: In Part One, it takes poly(B(Dtr), ln ϵ) times to find such w and b with probability 1 ϵ, as said in lemma B.1. In other parts, the parameters are calculated deterministically. We proved the theorem.

C Proof of Theorem 4.3

Proof. It suffices to show that there exists a memorization algorithm L, such that if D D(n, c) and Dtr DN, then the network L(Dtr) has a constant number of parameters (independent of N). The construction has four steps.

Step One: Calculate the min(x,yx),(z,yz) Ds ||x z||2, name it as c0.

Step Two: There is a Dtr Dtr, such that:

(c1): For any (x, yx), (z, yz) Ds, it holds ||x z||2 > c0/3;

(c2): For any (x, yx) Dtr, it holds ||x z||2 c0/3 for some (z, yz) Ds.

It is obvious that such Ds exists.

Step Three: We prove that |Ds| (1+2c0/3)n

Cn(c0/3)n , where Cn is the volume of unit ball in Rn. Let

Q = (1+2c/3)n

Cn(c/3)n , consider that c0 c, so there are |Ds| Q.

Let B2(x, r) = {z : ||z x||2 r}, and V (A) the volume of A.

Due to Ds Dtr [0, 1]n { 1, 1}, so (x,y) Ds B2(x, c0/3) [ c0/3, 1 + c0/3]n. By condition (c1), we have B2(x, c0/3) B2(z, c0/3) = for any (x, yx), (z, yz) Ds, so we have P (x,y) Ds V (B2(x, c0/3)) (1 + 2c0/3)n, which means |Ds| (1+2c0/3)n

Cn(c0/3)n < Q.

Step Four: There is a robust memorization network [62] with at most O(Qn) parameters for Ds with robust radius c0/3, and this memorization network is a memorization of Dtr.

By condition (c1), there is a robust memorization network Frm with O(|Ds|n) parameters for Ds with radius c0/3 [62]. By step three, we have |Ds| Q, so that such a network has at most O(Qn) parameters.

By condition (c2), for any (x, yx) Dtr, there is a (z, yz) Ds satisfying ||x z||2 c0/3. Firstly, there must be yx = yz, because the distribution D has separation bound c0, and if yx = yz then ||x z||2 c0 > c0/3. Then, since robust memorization Frm has robust radius c0/3, we have Sgn(Frm(x)) = Sgn(Frm(z)) = yz = yx, so Frm is a memorization network of Dtr. The theorem is proved.

D Proof for Theorem 5.1

In this section, we will prove that networks with small width cannot have a good generalization for some distributions. For a given width w, we will construct a distribution on which any network with width w will have poor generalization. The proof consists of the following parts.

D.1 Disadvantages of network with small width

In this section, we demonstrate that a network with a small width may have some unfavorable properties. We have the following simple fact.

Lemma D.1. Let the first transition weight matrix of network F be W. Then if Wx = Wz, we have F(x) = F(z).

If W is not full-rank, then there exist x and z satisfying Wx = Wz. Moreover, if x and z have different labels, according to lemma D.1, we have F(x) = F(z), so there must be an incorrect result given between F(x) and F(z).

According to the theorem of matrices decomposition, we also have the following fact. Lemma D.2. Let the first transition weight matrix of network F : Rn R be W. If W has width w < n, then exists a W1 Rw n, whose rows are orthogonal and unit such that W1x = W1z implies F(x) = F(z).

Proof. Using matrix decomposition theory, we can write W = NW1, where N Rw w and W1 Rw n and the rows of W1 are orthogonal to each other and unit.

Next, we only need to consider W1 as the first transition matrix of the network F and use lemma D.1.

At this point, we can try to construct a distribution where any network with small width will have poor generalization.

D.2 Some useful lemmas

In this section, we introduce some lemmas which are used in the proof in section D.3. Lemma D.3. Let B(r) be the ball with radius r in Rn. For any given δ > 0, let ϵ = 2δ/n. Then we have V (B( 1 ϵr))

V (B(r)) > 1 δ.

Proof. we have V (B( 1 ϵr))

V (B(r)) = (1 ϵ)n/2 1 nϵ/2 = 1 δ.

For w Ra,b and q Ra, let q w = Pa i=1 qiwi, where qi is the i-th weight of q, wi is the i-th row of wi. Then we have Lemma D.4. Let W Rw n, and its rows are unit and orthogonal.

(1): For any q1 = q2 Rw, we have {x Rn : Wx = W(q1 W)} {x Rn : Wx = W(q2 W)} = .

(2): If S is the unit ball in Rn, then S = q Rw,||q||2 1{x Rn : Wx = W(q W), x S}.

(3): For any q Rw, {x Rn : Wx = W(q W), x S} is a ball in Rn w with volume (1 ||q||2 2)(n w)/2Cn w, where Ci is the volume of the unit ball in Ri.

Proof. First, we define an orthogonal coordinate system {Wi}n i=1 in Rn. Let Wi be the i-th row of W when i w. When i > w, let Wi be a unit vector orthogonal with all Wj where j < i.

Then for all x Rn, we say exi is the i-th weight of x under such coordinate system. Then, Wx = Wz if and only exi = ezi for i [w].

Now, we can prove the lemma.

(1): The first weight w of q1 W under orthogonal coordinate system {Wi}n i=1 is q1, so if x {x Rn : Wx = W(q1 W)}, we have exi = (q1)i for i [w].

The first w weight of q2 W under orthogonal coordinate system {Wi}n i=1 is q2, so if x {x Rn : Wx = W(q2 W)}, we have exi = (q2)i for i [w]. Because q1 = q2 Rw, we get the result.

(2): For any x Rn, let q(x) = (f x1, f x2, . . . , f xw) Rw. It is easy to see that ||x||2 = q Pn i=1 exi 2, so ||q(x)||2 1 when ||x||2 1.

Now we verify that: for any s S, we have s {x Rn : Wx = W(q(s) W), x S}.

Firstly, we have Ws = Pw i=1 < wi, PN i=1 esiwi >= Pw i=1 esi.

Secondly, we have W(q(s) W) = Pw i=1 < wi, Pw i=1 esiwi >= Pw i=1 esi. So Ws = W(q(s) W), resulting in s {x Rn : Wx = W(q(s) W), x S}, which implies that S = q Rw,||q||2 1{x Rn : Wx = W(q W), x S}.

(3): By the proof of (2), we know that if x satisfies exi = qi for i [w], then x {x Rn : Wx = W(q W)}. By (1), {x Rn : Wx = W(q W)} will not intersect for different q. Therefore, x {x Rn : Wx = W(q W)} equals exi = qi for i [w].

Since ||x||2 = q Pn i=1 exi 2, when x {x Rn : Wx = W(q W)}, we have exi = qi for i [w],

so Pn i=w+1 exi 2 = ||x||2 2 ||q||2 2, and such n w weight is optional.

Therefore, {x Rn : Wx = W(q W), x S} is a ball in Rn w with radius p

1 ||q||2 2, so we get the result.

Lemma D.5. Let r3 > r2 > r1, n 1 and x r1, then (r3 x)n (r2 x)n

(r1 x)n rn 3 rn 2 rn 1 .

Proof. Let f(x) = (r3 x)n (r2 x)n

(r1 x)n . We just need to prove f(x) f(0) when x r1. We calculate the derivative f(x) at first:

f (x) = ((r3 x)n (r2 x)n) (r1 x)n ((r3 x)n (r2 x)n)((r1 x)n)

It is easy to calculate that ((r3 x)n (r2 x)n) = n((r3 x)n 1 (r2 x)n 1) and ((r1 x)n) = n(r1 x)n 1. Putting this into the above equation, we have

f (x) = P(x)(((r3 x)n 1 (r2 x)n 1)(r1 x) ((r3 x)n (r2 x)n))

Where P(x) is a positive value about x. Since

((r3 x)n 1 (r2 x)n 1)(r1 x) ((r3 x)n (r2 x)n) = (r3 x)n 1(r3 r1) + (r2 x)n 1(r2 r1) 0

we have f (x) 0, resulting in f(x) f(0). The lemma is proved.

Lemma D.6. Let a > b > 1, n > m 1. If an bn = 1. Then am bm 1.

Proof. We have 1 = an bn bn m(am bm) > am bm.

Lemma D.7. Let a > qb where q < 1 and a, b > 0. Then min{a, b} qb.

Proof. When min{a, b} = b, by q < 1, the result is obvious. When min{a, b} = a, by a > qb, the result is obvious.

Lemma D.8. For any w > 0, there exist r1, r2, r3 and n such that

(1): rn 3 rn 2 = rn 1 ;

(2): rn w 3 rn w 2 0.99rn w 1 .

Proof. Because the equations are all homogeneous, without loss of generality, we assume that r1 = 1. We take α = 21/n 1, β + α = 31/n 1, and n to satisfy 3w/n < 1.001. Let r2 = 1 + α, r3 = 1 + α + β. We show that this is what we want.

At first, we have rn 3 rn 2 = (1 + α + β)n (1 + α)n = 3 2 = 1 = rn 1 . We also have (1 + α + β)w < 1.001, named (k1). So we have

rn w 3 rn w 2 = (1 + α + β)n w (1 + α)n w

= (1+α+β)n w(1+α)w (1+α)n

(1+α)w (1+α+β)n w(1+α)w (1+α)n

1.001 (by (k1)) = (1+α+β)n (1+α+β)n w((1+α+β)w (1+α)w) (1+α)n

1.001 (1+α+β)n (1+α)n 0.001(1+α+β)n

1.001 (by (k1)) = (1+α+β)n (1+α)n 0.003

1.001 = 1 0.003

1.001 0.99.

The lemma is proved.

D.3 Construct the distribution

In this section, we construct the distribution in Theorem 5.1. Definition D.9. Let q be a point in [0, 1]n, 0 < r1 < r2 < r3, and we define Bk 2(z, t) = {x Rk : ||x z||2 t}, where k N +, z Rk and t 0.

The distribution D(n, q, r1, r2, r3) is defined as:

(1): This is a distirbution on Rn { 1, 1}.

(2): A point has label 1 if and only if it is in Bn 2 (q, r1). A point has label -1 if and only if it is in Bn 2 (q, r3)/Bn 2 (q, r2).

(3): The points with label 1 or -1 satisfy the uniform distribution, and let the density function be f(x) = λ = 1 V (Bn 2 (q,r3)) V (Bn 2 (q,r2))+V (Bn 2 (q,r1)).

We now prove Theorem 5.1.

Proof. Use the notations in Definition D.9.

Now, we let ri, q, n, w satisfy: (c1): Bn 2 (q, r3) [0, 1]n; (c2): rn 3 rn 2 = rn 1 ; (c3): rn w 3 rn w 2 0.99rn w 1 . Lemma D.8 ensures that such ri, q, n exist.

Let distribution D = D(n, q, r1, r2, r3), where D(n, q, r1, r2, r3) is given in Definition D.9. Now, we show that D is what we want. We prove that for any given F with width w, we have AD(F) < 0.51.

Firstly, we define some symbols. Using lemma D.2, let W Rw n whose rows are unit and orthogonal and satisfy that Wx = Wz implying F(x) = F(z).

Then define S1,x = {z : Wz = Wx, z Bn 2 (q, r1)} and S2,x = {z : Wz = Wx, z Bn 2 (q, r3)/Bn 2 (q, r2)}.

By lemma D.2, we know that, for any given x, the points in S1,x S2,x have the same output after inputting to F, but the points in S1,x have label 1 and the points in S2,x have label -1. So F must give the wrong label to the point in S1,x or S2,x.

The proof is then divided into two parts.

Part One: Let h Bw 2 (0, r1), and x(h) = q + h W Rn, where is defined in section D.2. Consider that for any given h, F must give the wrong label to the point in S1x(h) or S2x(h), we have that F will give the wrong label with probability at least min{P(x,y) D(x S1x(h)), P(x,y) D(x S2x(h))}. So, now we only need to sum these values about h.

For any different h1, h2 Bw 2 (0, r1), we have S1x(h1) S1x(h2) = , S2x(h1) S2x(h2) = , and h Bw 2 (0,r1)S1x(h) = Bn 2 (q, r1). By (1) and (2) in lemma D.4. Proof is similar for S2x(h). Then, by the volume of S1x(h), S2x(h) calculated in lemma D.4, we know that, the probability of F producing an error on distribution D is at least

h Bw 2 (0,r1) min{P(x,y) D(x S1x(h)), P(x,y) D(x S2x(h))} = λCn w R

x Bw 2 (0,r1) min{(r2 1 ||x||2 2)(n w)/2, (r2 3 ||x||2 2)(n w)/2 (r2 2 ||x||2)(n w)/2}dx

where Cn w is the volume of the unit ball in Rn w as mentioned in lemma D.4. Next, we will estimate the lower bound of this value

Part Two: Firstly, by lemma D.5, we know that (r2 3 ||x||2 2)(n w)/2 (r2 2 ||x||2 2)(n w)/2

(r2 1 ||x||2)(n w)/2

(r2 3)(n w)/2 (r2 2)(n w)/2

(r2 1)(n w)/2 .

Then, by lemma D.6 and (c2) , we know that (r2 3)(n w)/2 (r2 2)(n w)/2

(r2 1)(n w)/2 1. Thus by lemma D.7, we have

x Bw 2 (0,r1) min{(r2 1 ||x||2 2)(n w)/2, (r2 3 ||x||2 2)(n w)/2 (r2 2 ||x||2 2)(n w)/2}dx = λCn w R

x Bw 2 (0,r1) min{(r2 1 ||x||2 2)(n w)/2,

(r2 3 ||x||2 2)(n w)/2 (r2 2 ||x||2 2)(n w)/2

(r2 1 ||x||2 2)(n w)/2 (r2 1 ||x||2 2)(n w)/2}dx λCn w R

x Bw 2 (0,r1) min{(r2 1 ||x||2 2)(n w)/2,

(r2 3)(n w)/2 (r2 2)(n w)/2

(r2 1)(n w)/2 (r2 1 ||x||2 2)(n w)/2}dx

λCn w (r2 3)(n w)/2 (r2 2)(n w)/2

(r2 1)(n w)/2 R

x Bw 2 (0,r1)(r2 1 ||x||2 2)(n w)/2dx

= (r2 3)(n w)/2 (r2 2)(n w)/2

(r2 1)(n w)/2 P(x,y) D(y = 1).

From rn 3 rn 2 = rn 1 , we know that λV (Bn 2 (q, r1)) = λ(V (Bn 2 (q, r3)) V (Bn 2 (q, r2))) = 0.5, so P(x,y) D(y = 1) = P(x,y) D(y = 1) = 0.5, and further consider the (c3), we have

(r2 3)(n w)/2 (r2 2)(n w)/2

(r2 1)(n w)/2 P(x,y) D(y = 1)

0.5 (r2 3)(n w)/2 (r2 2)(n w)/2

(r2 1)(n w)/2 0.49.

The theorem is proved.

E Proof of Theorem 5.3

Firstly, note that Theorem 5.3 cannot be proved by the following classic result.

Theorem E.1 ([57]). Let D be any joint distribution over Rn { 1, 1}, Dtr a dataset of size N selected i.i.d. from D, and H = {h : Rn R} the hypothesis space. Then with probability at least 1 δ,

sup h H |R(h, D) R(h, Dtr)| Rad N(H)

where R(h, D) is the population risk, R(h, Dtr) is the empirical risk, and Rad N(H) is the Radermecher complexity of H.

Theorem E.1 is the classical conclusion about the lower bound of generalization error, and theorem 5.3 and Theorem E.1 are different. Firstly, Theorem E.1 is established on the basis of probability, whereas Theorem 5.3 is not. Secondly, Theorem E.1 highlights the existence of a gap between the empirical error and the generalization error for certain functions within the hypothesis space, and does not impose any constraints on the value of empirical error. However, memorization networks, which perfectly fit the training set, will inherently have a zero empirical error, so Theorem E.1 cannot directly address Theorem 5.3. Lastly, Theorem E.1 relies on Radermacher complexity, which can be challenging to calculate, while Theorem 5.3 does not have such a requirement.

For the proof of Theorem 5.3, we mainly follow the constructive approach of memorization network in [55], but during the construction process, we will also consider the accuracy of the memorization network. Our proof is divided into four parts.

E.1 Data Compression

The general method of constructing memorization networks compresses the data into a low dimensional space at first, and we adopt this approach. We are trying to compress the data into 1-dimension space. However, we require the compressed data to meet some conditions, as stated in the following lemma.

Lemma E.2. Let D be a distribution in [0, 1]n { 1, 1} with separation bound c and density r, and Dtr DN. Then, there are w Rn and b R that satisfy: (1): O(n N 3r/c) wx + b 1 for all x [0, 1]n; (2): |wx wz| 4 for all (x, 1), (z, 1) Dtr; (3): P(x,y) D( (z, yz) Dtr, |wx wz| 3) < 0.01.

Proof. Since distribution D is definition on [0, 1]n, we have c 1 and r 1.

Because the density function of D is r, we have P(x,y) D(x B2(z, r1)) r V (B2(z, r1)) < r(2r1)n = 1 400N2 for all z Rn, where r1 = 1 2(400r N 2)1/n . It is easy to see that r1 1 because r 1.

Then, we have the following two results:

Result one: Let u Rn be uniformly randomly sampled from the hypersphere Sn 1. Then we have

P(| u, (x z) | c 4N2 q

8 nπ, (x, 1), (z, 1) Dtr) > 0.5.The proof is similar to that of lemma B.1.

Result Two: Let u Rn be uniformly randomly sampled from the hypersphere Sn 1. Then

Pu(P(x,y) D( (xi, yi) Dtr, | u, (x xi) | < r 800N 2 q

8 nπ) < 0.01) > 0.5.

Firstly, by lemma B.2, and take T = 800N 2, we can get that: for any given v Rn, if u Rn be

uniformly randomly sampled from the hypersphere Sn 1, then P(| u, v | < ||v||2 800N2 q

8 nπ) < 1 400N2 . Thus, by such inequality, the density of D and the definition of r1, we have that:

Pu,(x,y) D(| u, (x v) | < r1 800N 2 q

= Pu,(x,y) D(| u, (x v) | < r1 800N 2 q

8 nπ| ||x v||2 r1)P(x,y) D(||x v||2 r1)

+Pu,(x,y) D(| u, (x v) | < r1 800N2 q

8 nπ| ||x v||2 < r1)P(x,y) D(||x v||2 < r1)

< Pu,(x,y) D(| u, (x v) | < ||x v||2

8 nπ| ||x v||2 r1) + P(x,y) D(||x v||2 < r1)

Pu(| u, (x v) | < ||x v||2

8 nπ) + P(x,y) D(||x v||2 < r1) < 1 400N2 + 1 400N2 = 1/(200N 2).

On the other hand, we have

Pu,(x,y) D(| u, (x v) | < r1 800N2 q

Pu(P(x,y) D(| u, (x v) | < r1 800N2 q

8 nπ) 0.01/N) 0.01/N.

So, we have Pu(P(x,y) D(| u, (x v) | < r 800N2 q

8 nπ) 0.01/N) < 1 200N 2 /(0.01/N) = 1/(2N). Name this inequality as (*).

On the other hand, we have

Pu(P(x,y) D( (xi, yi) Dtr, | u, (x xi) | < r 800N 2 q

8 nπ) < 0.01)

= 1 Pu(P(x,y) D( (xi, yi) Dtr, | u, (x xi) | < r 800N2 q

8 nπ) 0.01)

Then, if a u Rn satisfies P(x,y) D( (xi, yi) Dtr, | u, (x xi) | < r 800N2 q

8 nπ) 0.01, then

we have P(x,y) D(|u(x xi)| < r 800N 2 q

8 nπ) 0.01/N for some (xi, yi) Dtr.

So taking v as xi in inequality (*) and using the above result, we have

Pu(P(x,y) D( (xi, yi) Dtr, | u, (x xi) | < r 800N2 q

8 nπ) < 0.01)

= 1 Pu(P(x,y) D( (xi, yi) Dtr, | u, (x xi) | < r 800N2 q

8 nπ) 0.01)

1 P (xi,yi) Dtr Pu(P(x,y) D(| u, (x xi) | < r 800N2 q

8 nπ) 0.01/N) > 1 N 1 2N = 0.5.

So we get the result. This is what we want.

Construct w, b and verify their property

Consider the fact: if A(u), B(u) are two events about random variable u, and Pu(A(u) = True) > 0.5, Pu(B(u) = True) > 0.5, then there is a u, which makes events A(u) and B(u) occurring simultaneously. By the above fact and Results one and two, we have that there exist ||u||2 = 1

and u Rn such that | u, (x z) | c 4N2 q

8 nπ, (x, 1), (z, 1) Dtr and P(x,y) D( (xi, yi)

Dtr, | u, (x xi) )| < r 800N2 q

8 nπ) < 0.01.

Now, let w = max{ 2400 n N 2

r1 , 16 n N 2

c }u and b = ||w||2 n + 1, then we show that w and b are what we want:

(1): we have O(n N 3) wx + b 1 for all x [0, 1]n.

Firstly, because D is defined in [0, 1]n { 1, 1}, we have ||x||2 n, resulting in and wx + b b ||w||2 n 1.

On the other hand, using c 1 and r1 1, we have |wx| ||w||2 n O( n N 2

r1c ), so wx + b |wx| + b O(n N 3r1/n/c).

(2): We have |w(x z)| 4 for all (x, 1), (z, 1) Dtr.

It is easy to see that |w(x z)| | 16 n N 2

c u(x z)| = 16 n N 2

c |u(x z)|. Because |u(x z)|

c 4 n N 2 , so |w(x z)| = 16 n N 2

c |u(x z)| 16 n N 2

c c 4 n N 2 = 4.

(3): we have P(x,y) D( (z, yz) Dtr, |wx wz| 3) < 0.01.

Because |w(x z)| 2400 n N 2

r1 |u(x z)| |u(x z)|, and consider that P(x,y) D( (z, yz)

Dtr, |u(x z)| < r1 800N2 q

8 nπ) < 0.01, we get the result. So, w and b are what we want. and the lemma is proved.

E.2 Data Projection

The purpose of this part is to map the compressed data to appropriate values. Let w Rn and b R be given, and Dtr = {(xi, yi)}N i=1. Without losing generality, we assume that wxi < wxi+1.

In this section, we show that, after compressing the data into 1-dimension, we can use a network F

to map wxi + b to v[ i [

N] ], where {vj} [ N [

j=0 are the given values. Furthermore, F should also satisfy

F(wx + b) {vj} [ N [

j=0 for all x [0, 1]n except for a small portion.

This network has O(

N) parameters, as shown below.

Lemma E.3. Let w Rn and b R be given, {vj} [ N [

j=0 R and 1 > ϵ > 0 be given.

Let Dtr = {(xi, yi)}N i=1 and Dtr DN where D is a distribution, and assume that wxi + b < wxi+1 + b.

Then a network F with width O(

N), depth 2, and at most O(

N) parameters, can satisfy that:

(1): F(wxi + b) = v[ i

N ] for all i [N];

(2): P(x,y) D(F(wx + b) {vj} [ N [

Proof. Let qi = (wxi+1 + b) (wxi + b) and q = mini{qi}. Then we consider the set of points Si = {wxi + b + qϵ

2N j}[N/ϵ]+1 j=1 , for any i. We have that: P

s Si P(x,y) D(wx + b (s qϵ

2N /2, s + qϵ

2N /2)) = P(x,y) D( s Si, wx + b (s qϵ

2N /2, s + qϵ

Consider that |Si| N/ϵ, so for any i, there is a si Si, makes that P(x,y) D(wx + b (si qϵ 2N /2, si + qϵ

2N /2)) ϵ N .

And it is easy to see that Si satisfies the following result: if z Si, then:

wxi + b < wxi + b + qϵ

2N z wxi + b + qϵ

2N ([N/ϵ] + 1) < wxi + b + qi = wxi+1 + b.

So we have (si qϵ

2N /2, si + qϵ

2N /2) (wxi + b, wxi+1 + b), Name this inequality as ( ).

Let k = [ N [

N]] and t(i) = argmaxj [N]{[j/

N] = i}. Now, we define such a network:

F(x) = Pk i=1 vi vi 1

qϵ 2N (Relu(x st(i) + qϵ

2N /2) Relu(x st(i) qϵ

2N /2)) + v0.

This network has width 2k, depth 2 and O(

N) parameters. We can verify that such networks satisfy (1) and (2).

Verify (1): For a given i [N], let c(i) = [ i

N ]. Then, when j < c(i), we have t(j) < i, so st(j) +

qϵ 2N /2 wxt(j)+1 + b wxi + b (this has been shown in ( )), resulting in: vj vj 1

qϵ 2N (Relu(wxi + b st(j) + qϵ

2N /2) Relu(wxi + b st(j) qϵ

2N /2) = vj vj 1. When j c(i), similar to before, we have st(j) qϵ

2N /2 > wxi +b, resulting in vj vj 1

qϵ 2N (Relu(wxi +b st(j) + qϵ

2N /2) Relu(wxi + b st(j) qϵ

2N /2) = 0. So F(xi) = v0 + (v1 v0) + + (vc(i) vc(i) 1) = vc(i), this is what we want.

Verify (2): At first, we show that for any x [0, 1]n satisying wx+b / k i=1(si qϵ

2N /2, si + qϵ

2N /2), we have F(x) {vi}.

This is because: for any x satisfies wx + b / k i=1(si qϵ

2N /2, si + qϵ

2N /2), we have F(wx + b) = v0 + (v1 v0) + + (vk vk 1) = vk, where k satisfies st(k) < wx + b and k is the maximum. The proof is similar as above.

Second, we show that the probability of such x is at least 1 ϵ.

By P(x,y) D(wx + b (si qϵ

2N /2, si + qϵ

2N /2)) ϵ N for any i, we have P(x,y) D( i, wx + b (si qϵ

2N /2, si + qϵ

2N /2)) Pk i=1 P(x,y) D(wx + b (si qϵ

2N /2, si + qϵ

2N /2)) ϵ/N N = ϵ, this is what we want. So F is what we want. The lemma is proved.

E.3 Label determination

This is the same as in section B.3.

E.4 The proof of Theorem 5.3

Three steps are required: data compression, data projection, label determination. The specific proof is as follows.

Proof. Assume that Dtr = {xi}N i=1, without loss of generality, let xi = xj. Now, we show that there is a memorization network F of Dtr with O(

N) parameters but with poor generalization.

Part One, data compression. The first part is to compress the data in Dtr into R, let w, b satisfy (1),(2),(3) in lemma E.2. Then, the first part of F is f1(x) = Relu(wx + b).

On the other hand, not just samples in Dtr, all the data in Rn have been compressed into R by f1(x). By (3) in lemma E.2, we have P(x,y) D( (z, yz) Dtr, |wx wz| 3) < 0.01, resulting in, we have P(x,y) D(|wx wz| > 3 for (z, yz) Dtr) > 0.99. By the probability theory, we have

P(x,y) D(|wx wz| > 3 for (z, yz) Dtr > 0.99) = P(x,y) D(|wx wz| > 3 for (z, yz) Dtr > 0.99, y = 1)+ P(x,y) D(|wx wz| > 3 for (z, yz) Dtr > 0.99, y = 1) > 0.99.

Without losing generality, we assume that P(x,y) D( (z, yz) Dtr, |wx wz| > 3, y = 1) > 0.99/2, which represents the following fact. Define S = {x : x has label 1 and |wx wz| > 3 for (z, yz) Dtr}. Then the probability of points in S is at least 0.99/2. In the following proof, in order to make the network having bad generalization, we will make the network giving these points (the points in S) incorrect labels.

Part two, data projection.

Let ci = f1(xi)/ Without losing generality, we will assume ci ci+1.

Now, assume that we have N0 samples in Dtr with label 1, and {ij}N0 j=1 [N] such that xij has label 1, and ij < ij+1. Let t(i) = argmaxj [N]{[j/ N0] = i} and vk = [cit(k 1)+1][cit(k 1)+2] . . . [cit(k)].

In this part, the second part of F(x), named as f2(x), need to satisfy f2(cij) = (v[ j

Furthermore, we also hope that P(x,y) D(f2(f1(x))[1] {vi}) 0.999, where f2(f1(x))[i] is the i-th weight of f2(f1(x)), and P(x,y) D(f2(f1(x))[2]) = f1(x).

By lemma B.3, a network with O(

N) parameters and depth 2 is enough to calculate v[ j

N ] by cij,

and the output in {vi} has probability 0.999. Retaining ci just need one node. So f2 need O(

N) parameters.

Part Three, Label determination. In this part, we will use the vk mentioned in part two to output the label of inputs. The third part, named as f3(v, c), should satisfy that for f3(vk, c), where vk = [cit(k 1)+1][cit(k 1)+2] . . . [cit(k)] as mentioned above, if |c ciq| < 1 for some q [t(k 1) + 1, t(k)], then f3(vk, c) > 0.1, and f3(vk, c) = 0 when |c ciq| 1.1 for all q [t(k 1) + 1, t(k)].

This network need O( N0 ln(N0nr/c)) parameters, by (1) in lemma E.2 and lemma B.5.

Construction of F and verify it:

Let F(x) = f3(f2(f1(x))) 0.05. We show that F is what we want.

(1): By parts one, two, three, and the fact N0 N, it is easy to see that F has at most O(n +

N ln(Nnr/c)) parameters.

(2): F(x) is a memorization of Dtr. For any (x, y) Dtr, two cases are consided.

(1.1, if y = 1): using the symbols in Part two, because y = 1, so x = xik for some k. As mentioned in part two, f2(f1(x)) will output (vi[ k

N0 ], f1(x)). Then, by part three, because |f1(x) [f1(x)]| < 1,

so we have f3(f2(f1(x))) 0.05 0.1 0.05 > 0.

(1.2 if y = 1): By (2) in lemma E.2, for (z, 1) Dtr, we know that |f1(x) [f1(z)]| |f1(x) f1(z)| |f1(z) [f1(z)]| 4 1 = 3. So, by part three, we have f3(f2(f1(x))) = 0 0.05 < 0.

(3): AD(F) < 0.51. We show that, almost all x S (S is mentioned in part one) will be given wrong label.

For x S, we have |wx wxi| 3, so |wx + b [wxi + b]| 2 for all (xi, yi) Dtr. Then for any vi, by part three and the definition of vi, we have f3(vi, wx + b) = 0 when x S. So, when f2(f1(x))[1] {vi} and x S, we have f3(f2(f1(x))) 0.05 = 0 0.05 < 0.

Consider that for any x S, the label of x is 1 in distribution D. So when x S satisfies f2(f1(x))[1] {vi}, we find that f(x) gives the wrong label to x. Since P(x S) 0.99/2 and P(f2(f1(x))[1] {vi}) > 0.999, we have P(x S, f2(f1(x))[1] {vi}) 0.99/2 0.001 > 0.49.

By the above result, we have that, with probability at least 0.49, Sgn(f(x)) = y, so AD(f) < 0.51. So, we prove the theorem.

F Proof of Theorem 6.1

We first give three simple lemmas.

Lemma F.1. We can find 2 [ n c2 ] points in [0, 1]n, and the distance between any two points shall not be less than c.

Proof. Let t = [ n c2 ]. We just need to consider following points in [0, 1]n:

For any given i1, i2, i3, . . . , it {0, 1}, let xi1,i2,i3,...,it be the vector in [0, 1]n satisfying: for any j [t], the (j 1) c2 + 1 to j c2 weights of xi1,i2,i3,...,it is ij; other weights are 0.

We will show that, if {i1, i2, i3, . . . , it} = {j1, j2, j3, . . . , jt}, then it holds ||xi1,i2,i3,...,it xj1,j2,j3,...,jt||2 c. Without losing generality, let i1 = j1. Then the first c2 weights of xi1,i2,i3,...,it and xj1,j2,j3,...,jt are different: one is all 1, and the other is all 0. So, the distance between such two points is at least p

Then {xi1,i2,i3,...,it}ij [0,1] is the 2t point we want, so we prove the lemma.

Lemma F.2. If ϵ, δ (0, 1) and k, x Z+ satisfy that: x k(1 2ϵ δ), then 2x(P[kϵ] j=0 k x j ) < 2k(1 δ).

Proof. We have

2x(P[kϵ] j=0 k x j ) 2x2k x [kϵ]

k x < 2k(1 δ).

The first inequality sign uses Pm j=0 n m m2n/n where m n/2, and by x k(1 2ϵ δ), so [kϵ] (k x)/2. The third inequality sign uses the fact x k(1 2ϵ δ).

Lemma F.3. If k, v R+ such that kv > 3, and a = [kv] and 3 b

k), then a (b/ ln(b))2v/2.

k b/ ln(b), then b

k ln(b) b, which is impossible. So b

k), and then

k b/ ln(b). Resulting in a kv 1 kv/2 (b/ ln(b))2v/2.

Now, we prove Theorem 6.1

Proof. By Theorem 4.1, we know that there is a v1 > 1, when

N n, for any distribution D D(n, c) and Dtr DN, Dtr has a memorization with v1

N ln(Nn/c) parameters. We will show that Theorem 6.1 is true for v = 1 32v2 1 .

Assume Theorem 6.1 is wrong, then there exists a memorization algorithm L such that for any n Z+, c, ϵ, δ (0, 1), if D D(n, c) and N 1 32v2 1 N2 D ln2(ND)(1 2ϵ δ), we have

PDtr DN (A(L(Dtr)) 1 ϵ) 1 δ.

We will derive contradictions based on this L.

Part 1: Find some points and values.

We can find k, n, c, δ, ϵ satisfying

(1): we have n, k Z+ and 12v1 n

k. Let c = 1, and we can find k points in [0, 1]n and the distance between any pair of these points is greater than c;

(2): δ, ϵ (0, 1) and q = [k(1 2ϵ δ)] 3.

By lemma F.1, to make (1) valid, we just need n2 < k 2n, and (2) is easy to satisfy.

Part 2: Construct some distribution

Let {ui}k i=1 satisfy ui [0, 1]n and ||ui uj||2 c. By (1) mentioned in (1) in Part 1, such {ui}k i=1 must exist. Now, we consider the following types of distribution D:

(c1): D is a distribution in D(n, c) and P(x,y) D(x {ui}k i=1) = 1.

(c2): P(x,y) D(x = ui) = P(x,y) D(x = uj) = 1/k for any i, j [k].

It is obvious that, by ||ui uj||2 c, such a distribution exists. Let S be the set that contains all such distributions. We will show that for D S, it holds ND v1

k ln(kn/c).

By Theorem 4.1 and definition of v1, we know that for any distribution D S, let yi be the label of ui in distribution D S. Then there is a memorization F of {(ui, yi)}k i=1 with at most v1

k ln(kn/c) parameters. Then by (c1), the above result implies AD(F(x)) = 1, so we know that ND v1

k ln(kn/c) for any D S. Moreover, by k n 3, c = 1 and it is easy to see that ND n. We thus have 3 ND 4v1

Part 3: A definition.

Moreover, for D S, we define S(D) as the following set:

Z S(D) if and only if Z [k]q is a vector satisfying: Define D(Z) as D(Z) = {(uzi, yzi)}q i=1, then AD(L(D(Z))) 1 ϵ, where zi is the i-th weight of Z and yzi is the label of uzi in distribution D.

It is easy to see that, if we i.i.d select q samples in distribution D to form a dataset Dtr, then

(1): By c2, with probability 1, Dtr only contains the samples (uj, yj) where j [k];

(2): Let Dtr has the form shown in (1). Then every time a sample is selected, it is in {(ui, yi)}k i=1. Now we construct a vector in [k]q as follows: the index of i-th selected samples as the i-th component of the vector. Then each selection situation corresponds to a vector in [k]q which is constructed as before. Then by the definition of S(D), we have AD(L(Dtr)) 1 ϵ if and only if the corresponding vector of Dtr is in S(D).

Putting ND 4v1

k) and q = [k(1 2ϵ δ)] in lemma F.3, we have q ( ND/(4v1) ln(ND/(4v1)))2(1

2ϵ δ)/2 N2 D(1 2ϵ δ) 32v2 1 ln2(ND).

By the above result and the by the assumption of L at the beginning of the proof, so that for any D S we have t

PDtr Dq(A(L(Dtr)) 1 ϵ) = |S(D)|

kq 1 δ. (4)

Part 4: Prove the Theorem.

Let Ss be a subset of S, and Ss = {Di1,i2,...,ik}ij { 1,1},j [k] S, where distribution Di1,i2,...,ik satisfies the label of uj is ij, where j [k].

We will show that there exists at least one D Ss, such that |S(D)| < (1 δ)kq, which is contrary to equation 4. To prove that, we just need to prove that P

D Ss |S(D)| < (1 δ)2kkq, use |Ss| = 2k here.

To prove that, for any vector Z [k]q, we estimate how many D Ss which makes Z to be included in S(D).

Part 4.1, situation of a given vector Z and a given distribution D.

For a Z = (zi)q i=1 and D such that Z S(D), let len(Z) = {c [k] : i, c = zi}. We consider the distributions in Ss that satisfy the following condition: for i len(Z), the label of ui is equal to the label of ui in D.

Obviously, we have 2k |len(Z)| distributions that can satisfy the above condition in Ss. Let such distributions make up a set Sss(D, Z). Now, we estimate how many distributions Ds in Sss(D, Z) satisfy Z S(Ds).

For any distribution G Ss, let y(G)i be the label of ui in distribution G, and define the dataset Dtr = {(uzi, y(D)zi)}q i=1. Then Z S(Ds) if and only if: for at least k [kϵ] of i [k], L(Dtr) gives the label y(Ds)i to ui.

Firstly, consider that when i len(Z). For any Ds Sss(D, Z), we have y(Ds)i = y(D)i and L(Dtr) must give the label y(D)i to ui, so when i len(Z), L(Dtr) gives the label y(Ds)i to ui.

Then, consider i / len(Z). Because Z is a given vector, so if Z S(Ds), the label y(Ds)i where i / len(Z) are at most [kϵ] different from the label of ui given by L(Dtr).

So, by the above two results, this kind of Ds is at most P[kϵ] i=0 k |len(Z)| i . So, we have P[kϵ] i=0 k |len(Z)| i number of distributions Ds in Sss(D, Z) satisfy Z S(Ds).

Part 4.2, for any vector Z and distribution D.

Firstly, for a given Z, we have at most 2|len(Z)| different Sss(D, Z) for D DS.

Because when D1 and D2 satisfy y(D1)i = y(D2)i for any i len(Z), we have Dss(D1, Z) = Dss(D2, Z), and 2|len(Z)| situations of label of ui where i len(Z), so there exist at most 2|len(Z)| different Sss(D, Z).

By part 4.1, for a Sss(D, Z), at most P[kϵ] i=0 k |len(Z)| i of Ds Sss(D, Z) satsify Z S(Ds). So by the above result and consider that Ds = D Ds Sss(D, Z), at most 2|len(Z)| P[kϵ] i=0 k |len(Z)| i

number of Ds Ss such that Z S(Ds).

And there exist kq different Z, so P

D Ss |S(D)| P

Z 2|len(Z)| P[kϵ] i=0 k |len(Z)| i P

δ) = kq2k(1 δ). For the last inequality, we use 2|len(Z)| P[kϵ] i=0 k |len(Z)| i < 2k(1 δ), which can be shown by |len(Z)| q and lemma F.2.

This is what we want. we proved the theorem.

We now prove Corollary 6.4.

Proof. Using lemma F.1, we can find 2 [ n c2 ] points in [0, 1]n and the distance between any two points shall not be less than c. So we take a ϵ, δ such that 1 2ϵ δ > 0, n = 3[12v1/(1 2ϵ δ)] + 3, c = 1 and k = 2 [ n c2 ] in the (1) in the part 1 of the proof of Theorem 6.1, then similar as the proof of Theorem 6.1, and we get this corollary.

G Proof of Theorem 6.5

G.1 The Existence

Firstly, it is easy to show that there exists a memorization algorithm which satisfies L(Dtr) ND when Dtr DN with probability 1. We just consider the following memorization algorithm:

For a given dataset D, let L(D) be the memorization of D with minimum parameters, as shown in Theorem 4.1. Then para(L(D)) O( p

And if D is i.i.d selected from distribution D, where D D(n, c), then by the definition of L and ND in Theorem 4.3, we have para(L(D)) ND with probability 1. So L is what we want.

G.2 The Sample Complexity of Generalization

To prove (1) in the theorem, we need three lemmas. Lemma G.1 ([44]). Let H be a hypothesis space with VCdim h and D is distribution of data, if N h, then with probability 1 δ of Dtr DN, we have

|ED(F) EDtr(F)|

δ N for any F H. Here, ED(F) = E(x,y) D[I(F(x) = y)], EDtr(F) = P (x,y Dtr)[I(F(x) = y)] and I(x) = 1 if x is true or I(x) = 0.

Moreover, when h 1, we have

|ED(F) EDtr(F)|

Lemma G.2. If e ba/c, then we have a ln(bu) cu when u 2a ln(ba/c)/c.

Proof. Firstly, we have a ln(bu)

cu = ln(ba/c(cu/a))

cu/a , and we just need to show ln(ba/c(cu/a))

Then, we show that there are 2 ln(ba/c) ba/c. Just consider the function g(x) = x 2 ln x, by g (x) = 1 2/x, so g (x) 0 when x 2, so g(ba/c) g(e) = e 2 > 0, this is what we want.

Now we consider the function f(x) = ln((ba/c)x)/x, by the above result, we have that 1 2 ln(ba/c) ba/c, we have that f(2 ln(ba/c)) = ln(2(ba/c) ln(ba/c))/(2 ln(ba/c)) ln((ba/c) (ba/c))/(2 ln(ba/c)) = 1.

And consider that f (x) = 1 ln((ba/c)x)

x2 0 when x 1, so, when x 2 ln(ba/c), we have f(x) f(2 ln(ba/c)) 1, which means that when cu/a 2 ln(ba/c), it holds ln(ba/c(cu/a))

cu/a 1. The lemma is proved.

Lemma G.3 ([6]). Let Hm be the hypothesis space composed of the networks with at most m parameters. Then the VCdim of Hm is not more than qm2 ln(m), where q is a constant not dependent on m.

Then we can prove (1) in the theorem.

Proof. Let Dtr DN. Because the algorithm satisfies the condition in theorem, then L(Dtr) HND, where HND is defined in lemma G.3. By lemma G.3, the VCdim of HDtr is not more than q N 2 D ln(ND) for some q 1. Using lemma G.1 to this fact, we have N

16q N2 D ln(ND) ln( 64qe N2 D ln(ND)

δϵ2 ) ϵ2 . Take these values in lemma G.1, and considering that the memorization algorithm L must satisfy that EDtr(L(Dtr)) = 1, using lemma G.2 (just take a = 8q N 2 D ln(ND), b = 8e/δ and c = ϵ2 in lemma G.2), we have

1 ED(L(Dtr))

8q N 2 D ln(ND) ln 8e N

which implies 1 ϵ ED(L(Dtr)). The theorem is proved.

G.3 More Lemmas

We need three more lemmas to prove Theorem 6.5.

Lemma G.4. Let D [0, 1]n { 1, 1}. Then D has a memorization with width 1 if and only if D is linearly separable.

Proof. If D is linearly separable, then it obviously has a memorization with width 1.

If D has a memorization with width 1, we show that D is linearly separable. Let F be the memorization network of D with width 1, and F1 the first layer of F.

Part 1: We show that it is impossible to find any (x1, 1), (x2, 1), (x3, 1) D such that F1(x1) < F1(x2) < F1(x3). If we can, then contradiction will be obtained.

Assume (x1, 1), (x2, 1), (x3, 1) D such that F1(x1) < F1(x2) < F1(x3).

It is easy to see that, for any linear function wx+b and u v k, we have wu+b wv+b wk+b or wu + b wv + b wk + b, which implies Relu(wu + b) Relu(wv + b) Relu(wk + b) or Relu(wu + b) Relu(wv + b) Relu(wk + b).

Because (x1, 1), (x2, 1), (x3, 1) D satisfy that F1(x1) < F1(x2) < F1(x3), and each layer of F is a linear function composite Relu, so after each layer, the order of F1(x1), F1(x2), F1(x3) is not changed or contrary. So there must be F(x1) F(x2) F(x3) or F(x1) F(x2) F(x3). Then F cannot classify x1, x2, x3 correctly, which contradicts to the fact that F is a memorization of D.

Part 2: We show that, it is impossible to find any (x1, 1), (x2, 1), (x3, 1) D such that F1(x1) < F1(x2) < F1(x3).

This is similar to Part 1.

By parts 1 and 2, without losing generalization, we know that for any (x1, 1), (x2, 1) D, it holds F1(x1) > F1(x2). Since F1 is a linear function composite Relu, D is linear separable.

Lemma G.5. Let D = {(xi, yi)} [0, 1] { 1, 1}. Then D has a memorization with width 2 and depth 2 if and only if at least one of the following conditions is valid.

(c1): There is a closed interval I such that: if (x, 1) D then x I and if (x, 1) D then x / I.

(c2): There is a closed interval I such that: if (x, 1) D then x / I and if (x, 1) D then x I.

Proof. Part 1: We show that if condition (c1) is valid, then D has a memorization with width 2 and depth 2. It is similar for (c2) to be valid.

Let I = [a, b]. If for all (x, 1) D, we have x < a, then D is linear separable, and the result is valid. If for all (x, 1) D, we have x > b, then D is linear separable, and the result is valid. Now we consider the situation where x > a for some (x, 1) D and x < b for some (x, 1) D.

Let x 1 = max(x, 1) D{x < a}. Then for F1(x) = x (x 1 + a)/2, it is easy to verify that F1(x) > a x 1

2 for all x a and F1(x) < 0 for all (x0, 1) D such that x0 < a.

Let x1 = min(x, 1) D{x > b}. Then for F2(x) = x (x1 +b)/2, it is easy to verify that F2(x) < 0 for all x b and F2(x) > (x1 b)/2 for all (x0, 1) D such that x0 > b.

Let the network F be defined by F = Relu(F1(x)) TRelu(F2(x)) t, where T = 8 x1 b is a positive real number, and t = a x 1

Now we prove F is what we want. It is easy to see that, F is a depth 2 width 2 network. When x [a, b], then F1(x) > a x 1

2 and F2(x) 0, so F(x) > 0. For (x, 1) D such that x < a, we have F1(x) < 0 and F2(x) < 0, so F(x) < 0; for (x, 1) D such that x > b, we have F1(x) < 1 and F2(x) > x1 b

4 , so F(x) 1 2 a x 1

4 < 0, this is what we want.

Part 2: If D has a memorization with width 2 and depth 2, then we show that D satisfies conditions (c1) or (c2).

If D is linear separable, (c1) and (c2) are valid. If not, without losing generality, assume that (x1, 1), (x2, 1), (x3, 1) D such that x1 < x2 < x3 (for the situation that (x1, 1), (x2, 1), (x3, 1) D such that x1 < x2 < x3, the proof is similar). Then we show that if (x, 1) D, we have x1 < x < x3. Assume (x0, 1) D such that x0 < x1, then we have that x0 < x1 < x2 < x3, then we can deduce the contradiction.

Let F = a Relu(F1(x)) + b Relu(F2(x)) + c be the memorization network of D, where Fi(x) is a linear function. Let u, v R such that F1(u) = F2(v) = 0, without loss generality, let u v.

Then we know that F is linear in such three regions: ( , u], [u, v] and [v, ). We call the three regions as linear regions of F. We prove the following three results at first.

(1): The slope of F on ( , u] is positive.

Firstly, we show that x0 ( , u]. If not, since (x0, 1), (x1, 1), (x2, 1) are not linear separable, and (x1, 1), (x2, 1), (x3, 1) are not linear separable, we have (x0, 1), (x1, 1) [u, v] and (x2, 1), (x3, 1) [v, ). Then, because x1 > x0 and F(x1) > F(x0), and F is linear in [u, v], we have that F(v) F(x1) > 0. Now we consider the points (v, 1), (x2, 1), (x3, 1). It is easy to see that F memorizes such three points and they are in the linear region of F, so (v, 1), (x2, 1), (x3, 1) is linear separable, which is impossible because v x2 x3 and resulting in contradiction, so x0 ( , u].

If the slope of F on ( , u] is not positive, since u x0, we have F(u) < 0. Now we consider the points (u, 1), (x1, 1), (x2, 1), (x3, 1). Just similar to above to get the contradiction. So the slope of F on ( , u] is positive.

(2): The slope of F on [v, ) is positive. Similar to (1).

(3): The slope of F on ( , u] is negative. If not, F must be a non-decreasing function, which is impossible.

Using (1),(2),(3), we can get a contradiction, which means that there is a (x0, 1) D such that x0 < x1 is not possible.

Consider that, in a linear region of F, if the activation states of F1 and F2 are both not activated, then on such linear regions, the slope of F is 0. But due to (1),(2),(3), all linear regions have non-zeros slope of F, so on each linear regions, at least one of F1 and F2 is activated. So, the activation states of F1 and F2 at ( , u], [u, v] and [v, ) is ( , +), (+, +), (+, ) (+ means activated, - means not activated).

Then the slope of F on [u, v] is equal to the sum of the slopes of F on ( , u] and the slope of F on [v, ). But by (1),(2),(3), that means a negative number is equal to the sum of two positive numbers, which is impossible. So we get a contradiction.

So if (x0, 1) D, we have x0 > x1. Similar to before, we have x0 < x3. So we get the result.

By the above result, all the samples (x0, 1) D satisfies x (x1, x3), so there is a close interval in (x1, x3) such that: if (x0, 1) D, then x0 is in such interval, then (c2) is vald, and we prove the lemma.

G.4 The algorithm is no-efficient.

Now we prove (2) of theorem 6.5, that is, all such algorithm is not efficient if P = NP. We need the reversible 6-SAT problem defined in definition 6.6.

Proof. We will show that, if there is an efficient memorization algorithm which satisfies the conditions of the theorem (has at most ND parameters with probability 1), then we can solve the reversible 6-SAT in polynomial time, which implies P = NP.

Firstly, for the 6-SAT problem, we write it as the following form.

Let φ = m i=1φi(n, m) be a 6-SAT for n variables, where φi(n, m) = 6 j=1 xi,j and xi,j is either xs or xs for some s [n] (see Definition 6.6). Then, we define some vectors in Rn based on φi(n, m).

For i [m], define Qφ i Rn as follows: Qφ i [j] = 1 if xj occurs in φi(n, m); Qφ i [j] = 1 if xj occurs in φi(n, m); Qφ i [j] = 0 otherwise. Qφ i [j] is the j-th entry in Qφ i , then six entries in Qφ i are 1 or 1 and all other entries are zero.

Now, we define a binary classification dataset D(φ) = {(xi, yi)}m+4n i=1 [0, 1]n [2] as follows.

(1) For i [n], xi = 1i/3 + 1.11/3, yi = 1.

(2) For i {n + 1, n + 2, . . . , 2n}, xi = 1.11i n/3 + 1.11/3, yi = 1.

(3) For i {2n + 1, 2n + 2, . . . , 3n}, xi = 1i 2n/3 + 1.11/3, yi = 1.

(4) For i {3n + 1, 3n + 2, . . . , 4n}, xi = 1.11i 3n/3 + 1.11/3, yi = 1.

(5) For i {4n + 1, 4n + 2, . . . , 4n + m}, xi = 1/12Qφ i 4n + 1.11/3, yi = 1.

Here, 1i is the vector whose i-th weight is 1 and other weights are 0, 1 is the vector whose weights are all 1.

Let L be an efficient memorization algorithm which satisfies the condition in the theorem. Then we prove the following result: If n 4 and φ is a reversible 6-SAT problem, then para(L(D(φ))) = n+8 if and only if φ has a solution, which means P = NP and leads to that L does not exist when P = NP. The proof is divided into two parts.

Part 1: If φ is a reversible 6-SAT problem that has a solution, then para(L(D(φ))) = n + 8.

To prove this part, we only need to prove that para(L(D(φ))) n + 8 and para(L(D(φ))) n + 8.

Part 1.1: we have para(L(D(φ))) n + 8.

Firstly, we show that {(x1, 1), (xn+1, 1), (x2n+1, 1), (x3n+1, 1)} D(φ) are not linearly separable. This is because {x1, xn+1, x2n+1, x3n+1} is a linear transformation of {11, 1.111, 11, 1.111}, so {(x1, 1), (xn+1, 1), (x2n+1, 1), (x3n+1, 1)} D(φ) are not linearly separable if and only if {(11, 1), (1.111, 1), ( 11, 1), ( 1.111, 1)} are not linearly separable, by the definition of 11, easy to see that {(11, 1), (1.111, 1), ( 11, 1), ( 1.111, 1)} are not linearly separable, so we get the result.

By the above result, a subset of D(φ) is not linearly separable, so we have that D(φ) is not linearly separable. So, by lemma G.4, L(D(φ)) must have width more than 1. For a network with width at least 2, when it has depth 2, it has at least 2n + 5 parameters; when it has depth 3, it has at least n + 8 parameters; when it has depth more than 3, it has at least n + 10 parameters. So when n 4, we have para(L(D(φ))) n + 8.

Part 1.2: If φ is a reversible 6-SAT problem that has a solution, then para(L(D(φ))) n + 8.

We define a distribution D at first. D is defined on D(φ), and each point has the same probability. It is easy to see that, D D(n, 1/30).

Since when N m + 4n, we have PDtr DN (Dtr = D(φ)) > 0, so by the definition of ND and the fact L satisfies the conditions in the theorem, we have para(L(D(φ))) ND. Moreover, because D is defined on D(φ), we will construct a network with n + 8 parameters to memorize D(φ) to show that ND n + 8, which implies para(L(D(φ))) ND n + 8 because L satisfies the condition in the theorem.

This network has three layers, the first layer has width 1; the second layer has width 2; the third output layer has width 1.

Let s = (s1, s2, . . . , sn) { 1, 1}n be a solution of φ. Then the first layer is F1(x) = Relu(3s(x 1.11/3) + 3). Then we have the following results:

(1): F1(x) = 4.1 or F1(x) = 1.9 for all (x, 1) D(φ);

(2): 2 |F1(x)| 4 for all (x, 1) D(φ).

(1) is very easy to validate. We just prove (2).

For i [n] and i {2n + 1, . . . , 3n}, because s { 1, 1}n, so 3s(x 1.11/3) = 1 or 3s(x 1.11/3) = 1, which implies 2 |F1(xi)| 4.

For i {4n + 1, . . . , 4n + m}, xi 1.11/3 has only six components that are not 0. Because s is the solution of φ, which indicates that at least one of the six non-zero components of xi 1.11/3 has the same positive or negative shape as the corresponding component of s. Consider that such six non-zero components of xi 1.11/3 are in { 1/12, 1/12}, so 3s(xi 1.11/3) 1/4 5/4 = 1.

Moreover, because φ is a reversible problem, so φi(n, m) and φi(n, m) are both in the φ, which indicates that the positive and negative forms of the six non-zero components of xi 1.11/3 cannot be exactly the same as the positive and negative forms of the corresponding components in s, or there must be φi(n, m) = 0, which contradicts to s is the solution of φ. So, we have 3s(xi 1.11/3) 5/4 1/4 = 1.

Then we have that, for i {4n + 1, . . . , 4n + m}, it holds 3s(xi 1.11/3) [ 1, 1], resulting in 2 |F1(xi)| 4. We proved (2).

By (1) and (2), and using lemma G.5, there is a network F2 : R R with width 2, depth 2 and 7 parameters that can classify the {(F1(xi), yi)}4n+m i=1 , so F2 F1 is the network we want.

By such a network, we have that ND n + 8, and then, we have para(L(D(φ))) ND n + 8. We proved the result.

Part Two: If φ is a reversible 6-SAT problem and para(L(D(φ))) = n + 8, then φ has a solution.

If L(D(φ)) has width 2 of the first layer, then para(L(D(φ))) 2n + 5 > n + 8, so when para(L(D(φ))) = n + 8, the first layer has width 1.

Write L(D(φ)) = F2(F1(x)), and write F1 as F1(x) = Relu(3s(x 1.11/3) + b), and let s = (s1, s2, . . . , sn).

We will prove that Sgn(s) = (Sgn(s1), Sgn(s2), Sgn(s3), . . . , Sgn(sn)) is a solution of φ. The proof is given in two parts.

Part 2.1 we have 1.1|si| |sj| for any i, j [n]. Firstly, we have si = 0 for any i [n]. Because if si = 0, it holds F1(xi) = F1(xn+i), which implies that L(D(φ)) gives the same label to xi and xn+i, but xi and xn+i have the different labels in dataset D(φ), so it contradicts L(D(φ)) is the memorization of D(φ).

Without losing generality, let |s1| |s2| |sn|. Then we just need to prove that 1.1|sn| |s1|.

Because D(φ) is not linear separable, so by lemma G.4, L(D(φ)) has width more than 1. Because F1 has width 1, so F2 has width 2 and 7 parameters, resulting in that F2 is a network with width 2 and depth 2. And F2 can classify such six points: {(F1(xi), yi)}i {1,n+1,2n+1,3n+1,2n,4n}.

If s1 > 0, taking the values of x1, xn+1, x2n+1, x3n+1 in F1, we have 1.1s1 + b = F1(xn+1) s1 +b = F1(x1) s1 +b = F1(x2n+1) 1.1s1 +b = F1(x3n+1), which implies F1(xn+1)

F1(x1) F1(x2n+1) F1(x3n+1); if s1 < 0. Similarly as before, we have F1(xn+1) F1(x1) F1(x2n+1) F1(x3n+1). So, F1(x1) and F1(x2n+1) are always in the interval from F1(xn+1) to F1(x3n+1).

Consider that xn+1 and x3n+1 have label 1, x1 and x2n+1 have label 1, so by Lemma G.5, if {(F1(xi), yi)}i {1,n+1,2n+1,3n+1,2n,4n} can be memorized by a depth 2 width 2 network, then F1(x2n) and F1(x4n) must be not in the interval from F1(x1) to F1(x2n+1), or we cannot find a interval satisfies the conditions of lemma G.5.

Since max{F1(x2n), F1(x4n)} = 1.1|sn| + b, to ensure that F1(x2n) and F1(x4n) are not in the interval from F1(x1) to F1(x2n+1), we have max{F1(x2n), F1(x4n)} = 1.1|sn| + b max{F1(x1), F1(x2n+1)} = |s1| + b or max{F1(x2n), F1(x4n)} = 1.1|sn| + b min{F1(x1), F1(x2n+1)} = |s1| + b. The second case is impossible, so we have 1.1|sn| |s1|. This is what we want in this part.

Part 2.2 We show that Sgn(s) is the solution of φ. Assume that Sgn(s) is not the solution of φ. There is a i {4n + 1, . . . , 4n + m}, such that the positive and negative forms of the six non-zero components of xi are exactly the same as the positive and negative forms of the corresponding components in s. Then sxi + b 6/4|sn| + b 6/4.4|s1| + b 1.1|s1| + b. So, by max{F1(x1+n), F1(x3n+1)} = 1.1|s1| + b and min{F1(x1+n), F1(x3n+1)} = 1.1|s1| + b, we know that F1(xi) is not in the interval from F1(x1+n) to F1(x3n+1).

Then similar to part 2.1, consider the point {(F1(xi), yi)}i {1,n+1,2n+1,3n+1,i}, we have that F1(x1) and F1(x2n+1) are always in the interval from F1(xn+1) to F1(x3n+1), but F1(xi) is not in the interval from F1(x1+n) to F1(x3n+1). By lemma G.5 and the fact that the label of F1(xn+1) and F1(x3n+1) is different from that of other three samples, we cannot find an interval satisfying the condition in lemma G.5, so F2(x) cannot classify such five points: {(F1(xi), yi)}i=1,n+1,2n+1,3n+1,i. This is contradictory, as L(D(φ)) is the memorization of D(φ). So, the assumption is wrong, we prove the theorem.

H Proof of Theorem 7.3

H.1 Proof of Proposition 7.7

Proof. It suffices to prove that we can find an Sc(D) {(x, y) (x, y) D} such that for any (x, y) D, we have x (z,w) Sc(D)B((z, w)).

Let Sc = {(i1c/(6.2n), i2c/(6.2n), . . . , inc/(6.2n)) ij {0, 1, . . . , [6.2n/c] + 1}}, and define Sc(D) as: for any (i1c/(6.2n), i2c/(6.2n), . . . , inc/(6.2n)) Sc, randomly take a (x, y) D satisfying ||x (i1c/(6.2n), i2c/(6.2n), . . . , inc/(6.2n))|| c/(6.2n) (if we have such a x), and put (x, y) into Sc(D).

Then, we have that, for any (x, y) D, there is a point z Sc such that ||z x|| c/(6.2n), and there is a (xz, yz) Sc(D) such that ||z xz|| c/(6.2n), so ||xz x|| c/(3.1n), which implies ||x xz|| c/3.1.

Since the radius of B((z, w)) is more than c/3.1, for any (x, y) D, we have x (z,w) Sc(D)B((z, w)), we prove the lemma.

H.2 Main idea

For a given dataset Dtr [0, 1]n { 1, 1}, we use the following two steps to construct a memorization network:

(c1): Find suitable convex sets {Ci} in [0, 1]n, ensuring that: each sample in Dtr is in at least one of these convex sets. Furthermore, if x, z Ci and (x, yx), (z, yz) Dtr, then yx = yz, and define y(Ci) = yx.

(c2): Construct a network F satisfying that for any x Ci, Sgn(F(x)) = y(Ci). Such a network must be a memorization of Dtr, because each sample in Dtr is in at least one of {Ci}, so if x Ci and (x, yx) Dtr, then Sgn(F(x)) = y(Ci) = yx, which is the network we want.

H.3 Finding convex sets

For a given dataset Dtr [0, 1]n { 1, 1}, let Dtr = {(xi, yi)}N i=1, and for i [n], the convex sets Ci are constructed as follows:

(1): For any i, j [N], define Si,j(x) = (xi xj)(x (0.51 xi + 0.49 xj)), it is easy to see that Si,j is a vertical between xi and xj;

(2): The convex sets Ci are defined as Ci = j [N],yi =yj{x [0, 1]n Si,j(x) 0}.

Now, we have the following lemma, which implies that Ci satisfies conditions (c1) mentioned in above. Lemma H.1. If Ci are constructed as above, then

(1): xi Ci;

(2): If z Ci and (z, yz) Dtr, then yz = yi;

(3): Ci is a convex set.

Proof. Firstly, we show that xi Ci. For any i, j [N], taking xi into Si,j(x), we have Si,j(xi) = 0.49||xi xj||2 2 > 0, so xi {x [0, 1]n Si,j(x) 0}. Thus xi j [N],yi =yj{x [0, 1]n Si,j(x) 0} = Ci.

Then, we show that: if yj = yi, then xj / Ci, which implies (2) of lemma is valid.

For any i, j [N], taking xj into Si,j(x), we have Si,j(xj) = 0.51||xi xj||2 2 < 0, so xj / {x [0, 1]n Si,j(x) 0}. Thus xj / k [N],yi =yk{x [0, 1]n Si,k(x) 0} = Ci.

Finally, we show Ci is a convex set. Because for any i, j [N], {x [0, 1]n Si,j(x) 0} is a convex set, and the combination of convex sets is also convex set, so Ci is a convex set.

H.4 Construct the Network

We show how to construct a network F, such that Sgn(F(x)) = y(Ci) for any x Ci, where Ci is defined in section H.3.

For a given dataset Dtr = {(xi, yi)}N i=1, we construct a network Fmem which has three layers as following.

(1): Let r = 0.01 mini,j [N],yi =yj ||xi xj||2 2. For any i, j [N], Si,j defined in section H.3, let ui(x) = P

j [N],yj =yi Relu( Si,j(x)) r. It is easy to see that ui is a depth 2 network.

(2): The first two layers are F1 : Rn RN. Let F1(x)[i] be the i-th output of F1(x), then let F1(x)[i] equal to Relu( ui(x)). It is easy to see that, F1(x) requires O(N 2n) parameters.

(3): The third layer is F2 : RN R, and F2(v) = PN i=1 yivi, where vi is the i-th weight of vi.

Now, we prove that Sgn(Fmem(x)) = y(Ci) for any x Ci. We need the following lemma. Lemma H.2. For any x Ci, we have ui(x) < 0 and uj(x) > 0 when yi = yj.

Proof. Assume that x Ci. We prove the following two properties, and hence the lemma.

P1. ui(x) < 0.

By the definition of Ci, we have Si,j(x) 0 for all j [N] staisfying yi = yj, so ui(x) = P

j [N],yj =yi Relu( Si,j(x)) r = P

j [N],yj =yi 0 r = r < 0.

P2. uj(x) > 0 when yi = yj.

For any j such that yi = yj, we show Sj,i(x) 0.02||xi xj||2 2 at first. Because x Ci, so Si,j(x) 0, that is (xi xj)(x (0.51 xi + 0.49 xj)) 0, so

(xi xj)(x (0.51 xi + 0.49 xj)) = (xi xj)(x (0.49 xi + 0.51 xj)) 0.02||xi xj||2 2 = Sj,i(x) 0.02||xi xj||2 2 0.

Thus Sj,i(x) 0.02||xi xj||2 2. Then, by the above result, taking the value of r in it, we have uj(x) Relu( Si,j(x)) r 0.02||xi xj||2 2 r > 0.

By the above lemma, we can prove the result.

Lemma H.3. we have Sgn(Fmem(x)) = yi for any x Ci.

Proof. Let x Ci. By lemma H.2, we have F1(x)[i] > 0, and F1(x)[j] = 0 when j satisfies yj = yi, so F(x) = P

j [N] yj F1(x)[j] = yi P

j [N],yj=yi F1(x)[j], by F1(x)[i] > 0, and we thus have Sgn(F(x)) = yi.

H.5 Effective and Generalization Guarantee

In this section, we prove that the above algorithm is an effective memorization algorithm with guaranteed generalization. We give a lemma.

Lemma H.4. For any a, b, c Rn such that ||b a||2 3.1||a c||2, let V be the plane (b c)(x (0.51c + 0.49b)). Then the distance of a to the plane V is greater than ||b a||/3.1.

Proof. Let ||a b||2 = Lab, ||a c||2 = Lac, ||c b||2 = Lbc. Let the angle abc = θ. Then the distance between a and the plane V is Lab cos θ 0.51Lbc.

Using cosine theorem, we have cos θ = L2 bc+L2 ab L2 ac 2Lbc Lab , so we just need to prove that L2 bc+L2 ab L2 ac 2Lbc

0.51Lbc Lab/3.1, that is 0.5L2 ab 0.5L2 ac Lab Lbc/3.1 L2 bc 0.01. It is easy to see that such value is inversely proportional to Lac and Lbc. By Lac Lab/3.1 and Lbc Lac + Lab 4.1Lab/3.1, we have 0.5L2 ab 0.5L2 ac Lab Lbc/3.1 L2 bc 0.5 0.5/(3.1)2 4.1/(3.1)2

(4.1/3.1)2 > 0.01. The lemma is proved.

We now show that the algorithm is effective and has generalization guarantee.

Proof. Let Fmem be the memorization network of Dtr constructed by the above algorithm.

Effective. We show that Fmem is a memorization of Dtr can be constructed in polynomial time.

It is easy to see that, ui has width at most N, and each value of parameters can be calculated by Dtr in polynomial time. So F1 defined in (1) in section H.4 can be calculated in polynomial time. It is easy to see that the F2 defined in (1) in section H.4 can be calculated in polynomial time. This, F can be calculated in polynomial time.

Generalization Guarantee. Let S = {(vi, yvi)}SD i=1 be the nearby set defined in Definition 7.1. Then, we show the result in two parts.

Part One, we show that: for a (xi, yi) Dtr, if xi B((vj, yvj)) for a j [SD], then Sgn(F(x)) = yi for any x B((vj, yvj)).

Firstly, we show that it holds B((vj, yvj)) Ci. For any k [N] such that yk = yi, we have ||vj xk||2 3.1r 3.1||vj xi||, where r is the radius of B((vj, yvj)) so by lemma H.4, the distance from vj to Sik(x) is greater than r, which means that the points in B((vj, yvj)) are on the same side of the plane Sik(x), by xi B((vj, yvj)) and Sik(xi) > 0 as said in lemma H.1. Thus, for any x B((vj, yvj)), we have Sik(x) 0. By Ci = j [N],yi =yj{x [0, 1]n Si,j(x) 0}, we know that B((vj, yvj)) Ci.

By the above result, if x B((vj, yvj)), then x Ci; so by lemma H.3, we have Sgn(F(x)) = yi for all x B((vj, yvj)).

Part Two, we show that if Dtr DN and N SD/ϵ ln(SD/δ), then PDtr DN (AD(Fmem) 1 ϵ) 1 δ.

Let Qi = P(x,y) D(x B((vi, yvi))), then without losing generality, we assume that Q1 Q2 QSD. Then, for the dataset Dtr = {(xi, yi)}N i=1, let Z(Dtr) = {j [SD] i [N], xi B((vj, yvj))}. The proof is given in three parts.

part 2.1. Firstly, we show that AD(Fmem) 1 P i/ Z(Dtr) Qi.

If i Z(Dtr), then by the definition of Z(Dtr), we know that there is a j [N] such that xj B((vi, yvi)), so by part one, we have Sgn(Fmem(x)) = yj for any x B((vi, yvi)).

Moreover, for any (x, y) D and x B((vi, yvi)), by lemma H.1 and B((vi, yvi)) Cj which has been shown in part one, we know that y = yj.

So Sgn(Fmem(x)) = yj = y for any (x, y) D and x B((vi, yvi)), which means that Fmem gives the correct label to all x B((vi, yvi)) when i Z(Dtr, S). So AD(Fmem) P

i Z(Dtr,S) Qi 1 P

i/ Z(Dtr,S) Qi.

part 2.2. Now, we show that PDtr DN (P

i/ Z(Dtr) Qi ϵ) 1 δ.

Let Cci = {Dtr Dtr DN, i / Z(Dtr) and j Z(Dtr) for j > i}, easy to see that Ccj Cci = when i = j and PN i=0 PDtr DN (Dtr Cci) = 1. It is easy to see that, PDtr DN (Dtr Cci) (1 Qi)N when i 1.

Firstly we have that, if some i [SD] makes that Qi < ϵ/i, then for any Dtr Ccj where j i, we have P

k/ Z(Dtr) Qk Pj k=1 Qk j Qj i Qi < ϵ.

So that, we consider two situations.

Situation 1: There is a i [SD] such that Qi < ϵ/i.

Let N0 be the biggest number in [SD] such that QN0 < ϵ/N0. Then we have that:

i/ Z(Dtr) Qi ϵ) = PDtr DN (P i/ Z(Dtr) Qi ϵ Dtr N0 k=0Cck)PDtr DN (Dtr N0 k=0Cck)

+PDtr DN (P

i/ Z(Dtr) Qi ϵ Dtr [SD] k=N0+1Cck)PDtr DN (Dtr [SD] k=N0+1Cck)

= PDtr DN (Dtr N0 k=0Cck) + PDtr DN (P

i/ Z(Dtr) Qi ϵ Dtr [SD] k=N0+1Cck)

PDtr DN (Dtr [SD] k=N0+1Cck).

Hence, we have

PDtr DN (Dtr [SD] k=N0+1Cck) PSD i=N0+1 PDtr DN (Dtr Cci) PSD i=N0+1(1 Qi)N

PSD i=N0+1 e NQi

PSD i=N0+1 e Nϵ/i

PSD i=1 e Nϵ/i

SDe Nϵ/SD δ.

The last step is to take N SD/ϵ ln(SD/δ) in. So, taking the above result in equation 5, we have

i/ Z(Dtr) Qi ϵ)

1 δ + PDtr DN (P

i/ Z(Dtr) Qi ϵ Dtr [SD] k=N0+1Cck)δ 1 δ

which is what we want.

Situation 2: There is no i [SD] such that Qi < ϵ/i.

Then, we have PDtr DN (Dtr [SD] k=1Cck) PSD i=1 PDtr DN (Dtr Cci) PSD i=1(1 Qi)N

PSD i=1 e NQi

PSD i=1 e Nϵ/i

SDe Nϵ/SD δ. So with probability 1 δ, we have Dtr Cc0. When Dtr Cc0, we have Z(Dtr) = [SD], so that P

i/ Z(Dtr) Qi = 0. Hence, PDtr DN (P

i/ Z(Dtr) Qi ϵ) 1 δ.

part 2.3 Now we can prove the part 2, by part 2.1 and part 2.2, we have that PDtr DN (AD(Fmem) 1 ϵ) PDtr DN (1 P

i/ Z(Dtr,S) Qi 1 ϵ) 1 δ. The theorem is proved.

I Experiments

We try to verify Theorem 7.3 on MNIST and CIFAR10 [33].

I.1 Experiment on MNIST

For MNIST, we tested all binary classification problems with different label compositions. For each pair of labels, we use 500 corresponding samples with each label in the original dataset to form a new dataset Dtr, and then construct memorization network for Dtr by Theorem 7.3. For each binary classification problem, Table 1 shows the accuracy on the samples with such two labels in testset.

Table 1: On MNIST, accuracy for all binary classification problems with different label compositions, use memorization algorithm by theorem 7.3. The result in row i and column j is the result for classifying classes i and j.

category 0 1 2 3 4 5 6 7 8 9 0 - 0.99 0.96 0.99 0.99 0.97 0.96 0.98 0.98 0.97 1 0.99 - 0.97 0.99 0.98 0.99 0.98 0.98 0.98 0.99 2 0.96 0.97 - 0.96 0.97 0.96 0.96 0.97 0.93 0.97 3 0.99 0.99 0.96 - 0.98 0.95 0.98 0.95 0.92 0.96 4 0.99 0.98 0.97 0.98 - 0.98 0.97 0.96 0.95 0.91 5 0.97 0.99 0.96 0.95 0.95 - 0.96 0.97 0.91 0.96 6 0.96 0.98 0.96 0.98 0.97 0.96 - 0.99 0.95 0.98 7 0.98 0.98 0.97 0.95 0.96 0.97 0.99 - 0.95 0.91 8 0.98 0.98 0.93 0.92 0.95 0.91 0.95 0.95 - 0.96 9 0.97 0.99 0.97 0.96 0.91 0.96 0.98 0.91 0.96 -

From Table 1, we can see that the algorithm shown in the theorem 7.3 has good generalization ability for mnist, almost all result is higher than 90%.

I.2 Experiment on CIFAR10

For CIFAR10, we test all binary classification problems with different label combinations. For each pair of labels, we use 3000 corresponding samples with each label in the original dataset to form a new dataset Dtr, and then construct memorization network for Dtr by Theorem 7.3. For each binary classification problem, Table 2 shows the accuracy on the samples with such two labels in testset.

From Table 2, we can see that, most of the accuracies are above 70%, but for certain pairs, the results may be poor, such as cat and dog (category 3 and category 5).

Our memorization algorithm cannot exceed the training methods empirically. Training, as a method that has been developed for a long time, is undoubtedly effective. For each pair of labels, we use 3000 corresponding samples with each label in the original dataset to form a training set Dtr, and train Resnet18 [28] on Dtr (with 20 epochs, learning rate 0.1, use crossentropy as loss function, device is GPU NVIDIA Ge Force RTX 3090), the accuracy of the obtained network is shown in Table 3.

Table 2: On CIFAR10, accuracy for all binary classification problems with different label compositions, use memorization algorithm by theorem 7.3. The result in row i and column j is the result for classifying classes i and j.

category 0 1 2 3 4 5 6 7 8 9 0 - 0.77 0.74 0.78 0.81 0.81 0.85 0.85 0.68 0.73 1 0.77 - 0.78 0.75 0.82 0.78 0.82 0.87 0.79 0.63 2 0.74 0.78 - 0.61 0.61 0.65 0.67 0.67 0.82 0.77 3 0.78 0.75 0.61 - 0.71 0.54 0.67 0.69 0.83 0.76 4 0.81 0.82 0.61 0.71 - 0.66 0.62 0.65 0.82 0.79 5 0.81 0.78 0.65 0.54 0.66 - 0.73 0.67 0.81 0.78 6 0.85 0.82 0.67 0.67 0.62 0.73 - 0.71 0.86 0.81 7 0.85 0.87 0.67 0.69 0.65 0.67 0.71 - 0.82 0.73 8 0.68 0.79 0.82 0.83 0.82 0.81 0.86 0.82 - 0.69 9 0.73 0.63 0.77 0.76 0.79 0.78 0.81 0.73 0.69 -

Table 3: On CIFAR10, accuracy for all binary classification problems with different label compositions, use normal training algorithm. The result in row i and column j is the result for classifying classes i and j.

category 0 1 2 3 4 5 6 7 8 9 0 - 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.98 0.99 1 0.99 - 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.99 2 0.98 0.99 - 0.99 0.99 0.99 0.99 0.99 0.99 0.99 3 0.99 0.98 0.99 - 0.98 0.96 0.97 0.99 0.98 0.99 4 0.99 0.99 0.99 0.98 - 0.99 0.99 0.99 0.99 0.99 5 0.99 0.99 0.99 0.96 0.99 - 0.99 0.99 0.99 0.99 6 0.99 0.99 0.99 0.97 0.99 0.99 - 0.98 0.99 0.99 7 0.99 0.99 0.99 0.99 0.99 0.99 0.98 - 0.99 0.99 8 0.98 0.99 0.99 0.98 0.99 0.99 0.99 0.99 - 0.99 9 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 -

Comparing Tables 2 and 3, it can be seen that the training results are significantly better.

I.3 Compare with other memorization algorithm

Three memorization network construction methods are considered in this section: (M1): Our algorithm in theorem 7.3; (M2): Method in [49]; (M3): Method in [55].

In particular, we do experiments on the classification of such five pairs of numbers in MNIST: 1 and 7, 2 and 3, 4 and 9, 5 and 6, 8 and 9, to compare methods M1, M2, M3. The main basis for selecting such pairs of labels is the similarity of the numbers. For any pair of numbers, we label the smaller number as -1 and the larger number as 1. Other settings follow section I.1, and the result is given in Table 4. We can see that our method performs much better in all cases.

From table 4, our method gets the best accuracy. When constructing a memorization network, the methods (M2), (M3) compress data into one dimension, such action will break the feature of the image, so they cannot get a good generalization.

Table 4: On MNIST, accuracy about different memorization algorithm.

pair (1,7) Accuracy M1 0.98 M2 0.51 M3 0.46 pair (2,3) Accuracy M1 0.96 M2 0.50 M3 0.51 pair (4,9) Accuracy M1 0.91 M2 0.45 M3 0.46 pair (5,6) Accuracy M1 0.96 M2 0.59 M3 0.47 pair (8,9) Accuracy M1 0.96 M2 0.41 M3 0.48

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: Our paper supports the claims made in the abstract and introduction.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We have discussed limitations in Section 8.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We have provided the full set of assumptions in every theorem and made a complete proof in Appendix A to appendix H. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We have provided reproductive details in Appendix I. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes] Justification: We have provided our codes in the supplemental matrial. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We have provided experimental details in Appendix I. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: Our main contribution is in terms of theory. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We have provided them in Appendix I. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [NA] Justification: Our main contribution is in terms of theory. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: None of this. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to

generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If we have negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Our paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We use open-source dataset and models in our paper, and have cited the original paper of these dataset and models as presented in Appendix I.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Our paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.