# nonasymptotic_length_generalization__81ab255d.pdf

Non-Asymptotic Length Generalization

Thomas Chen 1 Tengyu Ma 1 Zhiyuan Li 2

Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for the minimum input length that guarantees length generalization, as a function of the complexity of ground-truth function under some given complexity measure. We refer to this minimum input length to length generalize as length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We further show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of Context-Free Grammars. On the positive side, we show that the length complexity of Deterministic Finite Automata is 2n 2 where n is the number of states of the ground-truth automaton. Our main results are upper bounds of length complexity for a subset of a transformerrelated function class called C-RASP (Yang & Chiang, 2024). We show that the length complexity of 1-layer C-RASP functions is O(T 2) when the ground-truth function has precision T, and that the length complexity of 2-layer C-RASP functions is O(T O(K)) when the ground-truth function has precision T and K heads.

1Department of Computer Science, Stanford University, Stanford, USA 2Toyota Technological Institute at Chicago, Chicago, USA. Correspondence to: Thomas Chen <tchen@cs.stanford.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1. Introduction

The generalization of a trained model from shorter inputs seen during training time to longer inputs seen at inference time is a phenomenon called length generalization. The question of when length generalization is possible is an important question in the area of language modeling and reasoning with Large Language Models (LLMs) (Brown et al., 2020). Real-world language understanding often requires handling longer contexts that exceed training-time input lengths, such as long Chain-of-Thought for reasoning problems, multi-turn conversations, lengthy documents, or complex code. Many factors limit the ability to train on long sequences directly, such as the increasing computational cost and memory requirement of training on longer sequences and the fact that long sequences are rare. Successful length generalization does not only allow efficient training for good long-context performance, but also is a natural test on whether the model has robustly learned the underlying language or reasoning patterns.

Prior works study length generalization in a controlled setting where Transformers (Vaswani et al., 2017) are trained on algorithmic tasks from scratch. In this context, length generalization has been empirically observed for certain algorithmic tasks but not others, and it appears that transformers cannot length generalize for most tasks. Tasks that have been tested are arithmetic tasks like integer addition (Nogueira et al., 2021; Nye et al., 2021; Anil et al., 2022; Zhou et al., 2024), formal-language tasks like PARITY (Anil et al., 2022), algorithmic tasks in the Chomsky Hierarchy (Shaw et al., 2021; Del etang et al., 2023; Ruoss et al., 2023), copying tasks, tasks involving a sequence of integers such as sorting or finding the mode (Zhou et al., 2023), and deducing the end-assignment of a variable in a block of code (Anil et al., 2022). Some of these tasks do not exhibit length generalization in their vanilla form, but do exhibit length generalization if certain modifications are made to the learning setup. These modifications include modifying the input and output format (Zhou et al., 2023; 2024), adding positional embeddings (Jelassi et al., 2023), and adding access to a scratchpad (Nye et al., 2021).

On the theoretical side, (Gold, 1967) provides a foundational result in the theory of length generalization, which is originally stated in a more broad context of language iden-

Non-Asymptotic Length Generalization

Function Class Complexity of Ground-Truth Function Length of Training Data Sufficient to Generalize DFAs number of states, c 2c 2 (Proposition 2.7) CFGs description length, c no computable bound in c exists (Proposition 3.5) C-RASP1 precision, T O(T 2) (Theorem 5.5) C-RASP2 precision, T, and number of heads, K O(T O(K)) (Theorem 5.6)

Table 1. Summary of results: upper bounds on minimum length of binary strings in training data which suffices for length generalization.

tification. Let Σ be a finite alphabet, let a hypothesis be a mapping from all strings Σ to {0, 1}, and let F be the hypothesis class. Gold (1967) roughly says that there is a learning algorithm A such that for all hypotheses in F, A will eventually learn the ground-truth hypothesis, when the training set inputted to A contains (string, label) pairs for all the strings up to a sufficiently large length. Gold s result holds for all hypothesis classes F satisfying the mild assumption that F can be enumerated by a Turing machine, including Regular languages, Context-Free languages, decision problems that can be solved by finite-precision transformers with or without Chain-of-Thought, etc. The learning algorithm A simply returns the first hypothesis in the enumeration of F that correctly labels all training examples. When this enumeration lists the functions in F in order of increasing complexity for some complexity measure, then we call this learning algorithm the Minimum-Complexity Interpolator, denoted by Amci.

The gap between the wide range of function classes which Gold s theory predicts length generalization for and the limited classes of functions that transformers empirically length generalize on presents an opportunity to develop a more predictive theory of length generalization. There are many potential reasons for this gap, such as the discrepancy between the learning algorithm Amci and the gradient-based learning algorithms used in practice. However, a more fundamental issue of Gold s result is that the guarantee is inherently asymptotic it does not provide any information on the minimum input length required to achieve length generalization. It is completely possible that even if gradient-based training methods like SGD are able to find the minimumcomplexity interpolating hypothesis, length generalization still only happens when the training set contains impractically long length inputs. In this case, it may appear empirically that length generalization does not occur at all, despite the asymptotic theoretical guarantee.

Intuitively, the minimum input length required to achieve length generalization should be a function of the complexity of the ground truth hypothesis, and better length generalization is expected for simpler hypotheses. To both get a more useful length generalization guarantee and better understand the limit of length generalization, the goal of this paper is to answer the following question:

What is the minimum input length required to

achieve length generalization as a function of complexity of ground-truth hypothesis, assuming we have infinite computational resources?

Our Contributions: This paper provides a more finegrained analysis of length generalization, by providing a non-asymptotic guarantee on the minimum input length required to achieve length generalization, as a function of complexity of ground-truth. More specifically, we make the following contributions:

In Section 2, we introduce the framework of nonasymptotic length generalization (Definition 2.6), which requires a non-asymptotic upper bound for the minimum input length that guarantees length generalization, given the complexity of ground-truth function under some given complexity measure C. The latter we call the length complexity under C (Definition 3.1). We show that the Minimum-Complexity Interpolator learning algorithm Amci, instantiated with complexity measure C, is the optimal with respect to the length complexity under C. As a concrete example, we show that Amci only needs inputs up to length 2c 2 to learn any ground-truth Deterministic Finite Automata (DFA) of c states. In Section 3, we show whether a hypothesis class admits non-asymptotic length generalization is equivalent to whether the problem of deciding equivalence of two finite descriptions of hypotheses is decidable. As a consequence, Context-Free Grammars (CFGs) do not admit non-asymptotic length-generalization, though they admit (asymptotic) length-generalization in the limit. In Section 5, we prove non-asymptotic length generalization for a subset of the transformer-related hypothesis class, C-RASP (Yang & Chiang, 2024). C-RASP is a superset of functions expressible by transformers. Variants of the RASP function class, RASP-L and C-RASP, have been shown to have a good predictive power on the length generalization performance of transformers (Zhou et al., 2023; Huang et al., 2024). We study two subclasses of C-RASP: C-RASP1 (Theorem 5.5) and C-RASP2 (Theorem 5.6), which are of depth 1 and 2, respectively.

2. Non-Asymptotic Length Generalization

Notation. Fix the alphabet Σ = {0, 1} throughout the paper. Define computable (recursive) functions as functions which can be computed by a Turing Machine (TM),

Non-Asymptotic Length Generalization

which halts on all inputs. A function f : N N is computably bounded if there exists a computable function g : N N such that f(n) g(n) for all n N. Let M {0, 1} denote the binary string encoding of M. Let [N] := {1, 2, 3, . . . , N 1, N}. Denote {0, 1}n as the set of binary strings of length n, {0, 1} n := Sn j=0{0, 1}j, and {0, 1} := S

j 0{0, 1}j. 1[ ] denotes the indicator function and cl(A) as the closure of set A Rd.

Representation of Functions. We are interested in learning subsets of computable functions mapping {0, 1} to {0, 1}, denoted by F. Because a learning algorithm can only return a finite description of the hypothesis it selects, we need to define the representation of the hypothesis and the corresponding encoding system below in Definition 2.1.

Definition 2.1 (Encoding System). An encoding system is a TM R which on input of a finite string p (which can be thought as the code or description of a function), outputs the TM description of the computable function f : {0, 1} {0, 1} represented by p. Here we denote the description of the TM as R(p) and the function f as R(p). We use FR to denote the function class implicitly defined by the encoding system R, i.e. FR = {R(p) : p {0, 1} }.

Often the standard encoding system is only defined on valid inputs. For convenience we define the encoding system R for all inputs in {0, 1} and map invalid inputs to the empty language. As examples of encoding systems, DFAs and CFGs are two encoding systems for regular languages and context-free languages, respectively.

Definition 2.2. An n-state Deterministic Finite Automaton (DFA) is a tuple M = (Q = [n], Σ = {0, 1}, δ, q0, F) where Q = [n] is the set of states, Σ = {0, 1} is the input alphabet, δ : Q Σ Q is the transition function, q0 Q is the start state, and F Q is the set of accepting states. The encoding system RDFA is a TM that reads M and outputs the description of a TM that simulates M. For any input string x {0, 1} , the language characterized by M is where RDFA( M )(x) = 1 if and only if δ (q0, x) F, where δ : Q {0, 1} Q is the natural extension of δ to strings defined recursively as δ (q, ϵ) = q and δ (q, sa) = δ(δ (q, s), a) for any q Q, s {0, 1} , and a {0, 1}.

Definition 2.3. A Context-Free Grammar (CFG) is a tuple G = (N, T = {0, 1}, P, S) where N is a finite set of nonterminal symbols, T = {0, 1} is the terminal alphabet, P is a finite set of production rules of the form A α where A N and α (N T) , and S N is the start symbol. Let G {0, 1} denote the binary string encoding of G. The encoding system RCFG is a TM that reads G and outputs the description of a TM that simulates G. For any input string x {0, 1} , the language characterized by G is RCFG( G )(x) = 1 if and only if x can be derived from S by applying a finite sequence of production rules from P.

Finally, a Linear CFG is a CFG G = (N, T = {0, 1}, P, S) where each production rule in P has at most one nonterminal symbol on its right-hand side. The encoding system for Linear CFGs is RL-CFG, where RL-CFG(p) = RCFG(p) for p representing Linear CFGs, and otherwise RL-CFG(p) returns the TM with the empty language.

Learning Setup. With the ground truth function being denoted f or R(p), we define the labeled training dataset consisting all data up to length N as below, for all N N: DN(f ) := {(x, f (x)) : x {0, 1} , |x| N}.

With R as the encoding system, an adversary picks any ground-truth function f FR. A learning algorithm is a TM A which takes as input a training set DN(f ) and outputs some ˆp {0, 1} . We say a learning algorithm A length-generalizably learns a function f at input length N w.r.t. encoding system R iff R(A(DN(f ))) = f .

To give context to our notion of non-asymptotic length generalization, we describe Gold (1967) s asymptotic notion of learnability in Appendix B.

Complexity Measure. We are interested in understanding the minimum length of training data which suffices for length generalization. To this end, a complexity measure for the functions in F is necessary. We assume a complexity measure C : {0, 1} N, which assigns a complexity to each representation p, and define the complexity of the function f as the minimum complexity of any representation of f in R, namely CR(f) := minp {0,1} ,R(p)=f C(p). We may drop the superscript R and just use C for function complexity when R is clear from the context. We will make the following mild assumption that C is reasonably simple throughout the paper, unless otherwise stated.

Assumption 2.4. C is computable and there exists a TM E which enumerates programs p {0, 1} in non-decreasing order of C(p). In particular, the range of E is {0, 1} and i i , C(E(i)) C(E(i )). In addition, C is such that for each c N, |{p {0, 1} : C(p) c}| < .

This assumption is easily satisfied by a standard choice of C, namely where C(p) returns the length of p {0, 1} . There is little loss in generality in just thinking of C as this standard complexity measure. Having a general C provides some extra flexibility and makes the results easier to understand.

Definition 2.5 (Complexity Measures for DFAs and CFGs). The complexity measure CDFA for DFAs maps a DFA M = (Q = [n], Σ, δ, q0, F) (represented by M ) to n, the number of states in Q. The number of states of any DFA is within a logarithmic factor of the length of its representation in bits (n log n). For CFGs, given a CFG G = (N, T, P, S), let |P| be the total length of the production rules in P. The complexity measure of CFG G is CCFG( G ) = |P| + |N| + |T|, which is within a constant

Non-Asymptotic Length Generalization

factor of the length of G in bits.

Now we define non-asymptotic length generalization.

Definition 2.6 (Non-Asymptotic Length Generalization). A function class F FR admits non-asymptotic length generalization w.r.t. encoding system R and complexity measure C if there exists a learning algorithm A and a computable function ˆN R,F A : N N such that for all f F and for all N ˆN R,F A (CR(f )), A length-generalizably learns f at input length N .

Length Complexity. For notation simplicity, we define the length complexity of a function f for a learning algorithm A w.r.t. encoding system R, N R A (f ), as the minimum length of training data which suffices for the learning algorithm A to length generalize on f : N R A (f ) := min{N 0 : n N, R(A(Dn(f ))) = f }

This quantity is if there is no such N where n N, R(A(Dn(f ))) = f . We also define the length complexity of functions in F FR up to complexity c for a learning algorithm A w.r.t. encoding system R, N R,F A (c), as the maximum length complexity of any function in F FR with complexity at most c, given as N R,F A (c) := maxf F s.t. CR(f ) c N R A (f ) = maxp {0,1} s.t. C(p) c R(p) F N R A (R(p)). We denote

A (c) by N R A (c) for convenience.

As a concrete example, Proposition 2.7 shows that DFAs admits non-asymptotic length generalization w.r.t. the standard encoding system RDFA and complexity measure CDFA in Definition 2.5. Its proof is in Appendix D.

Proposition 2.7 (Non-Asymptotic Length Generalization for DFAs). Let RDFA be the DFA encoding system defined in Definition 2.2, and let CDFA be the number of states in DFA. Regular languages FR admits non-asymptotic length generalization w.r.t. encoding system RDFA and complexity measure CDFA. More specifically, there exists a learning algorithm A such that N RDFA A (c) 2c 2 for all c N.

3. Characterization of Definition 2.6

In Section 3, we characterize the conditions under which a function class admits non-asymptotic length generalization w.r.t. an encoding system R and complexity measure C satisfying Assumption 2.4. First, we introduce Definition 3.1, an algorithm-independent version of length complexity, which coincides with the optimal algorithm-dependent length complexity, over all learning algorithms A.

Definition 3.1 (Length Complexity of Function Class). Given a function class F, we define the length complexity of F as the minimum input length that can distinguish any two functions in F: N(F) := min{n N : f = f F, x {0, 1} n s.t. f(x) = f (x)}.

Algorithm 1 Minimum-Complexity Interpolator (AR,C mci ) Hyperparameters: Complexity Measure C, encoding system R Input: Finite Training Dataset S {(x, y) | x {0, 1} , y {0, 1}}. Output: arg minp {0,1} : (x,y) DN(f ),y=R(p)(x) C(p)

3.1. Optimality of Minimum Complexity Interpolator

The Minimum-Complexity Interpolator (Algorithm 1) is the main learning algorithm which we study in this work. Although it would be ideal to study a learning algorithm closer to what is used empirically to train transformers, we study the Minimum-Complexity Interpolator to abstract away the complex training dynamics of gradient-based methods like SGD, which are very non-trivial even for training 2 layer neural networks (Mahankali et al., 2023). We also acknowledge the limitation that the Minimum-Complexity Interpolator is computationally intractable for general function classes and encoding systems. For instance, it was shown that the problem of finding the minimum-state DFA for a regular language is NP-hard (Pitt & Warmuth, 1993).

We denote the Minimum-Complexity Interpolator learning algorithm by AR,C mci for short when the encoding system is R and complexity measure is C. When the complexity measure C is the length of the program, AR,C mci is just the famous Minimum Description Length (MDL) algorithm (Rissanen, 1978). AR,C mci has the nice property that it is the best possible algorithm over all learning algorithms in minimizing N R,F A (c), in the following sense.

Theorem 3.2 (Optimality of Minimum-Complexity Interpolator). Given any encoding system R and complexity measure C, for all c N, it holds that N R AR,C mci (c) =

min A N R A (c) = N(FR c ). As a consequence, the following three statements are equivalent:

Function class FR admits non-asymptotic length generalization; Function class FR admits non-asymptotic length generalization, via learning algorithm AR,C mci ; For all c N, length complexity of FR c , N(FR c ), is computably bounded in c.

The proof of Theorem 3.2 is in Appendix C. Though the above optimality result of AR,C mci holds for all complexity measures C, AR,C mci may not be computable without restrictions on the complexity measure. For example, it is wellknown that Kolmogorov complexity is not computable (Li & Vit anyi, 2008). Lemma 3.3 shows that AR,C mci is indeed computable under Assumption 2.4.

Lemma 3.3. Under Assumption 2.4, Algorithm 1 is computable and thus a valid learning algorithm.

Non-Asymptotic Length Generalization

The proof of Lemma 3.3 is in Appendix C. In the remaining sections, we may omit R and C in the superscripts of AR,C mci when they are clear from context for convenience.

3.2. Equivalence to Decidability of Language Equivalence Problem

The main goal of this paper is to seek concrete upper bounds for the length complexity N R A (c) of various function classes. Surprisingly, such upper bounds are not always computable. In this section, we show that there exist computable upper bounds on the length complexity if and only if the Language Equivalence Problem for encoding system R is decidable, where the Language Equivalence Problem for encoding system R is the computational problem where given any p, q {0, 1} , determine whether R(p) = R(q).

Lemma 3.4. For any encoding system R and complexity measure C satisfying Assumption 2.4, the Language Equivalence problem for R is decidable if and only if length complexity of FR c , N(FR c ), is computably bounded in c. Thus it is also equivalent to the property that FR admits non-asymptotic length generalization.

As a consequence, we have the following impossibility result for non-asymptotic length generalization for a special case of CFGs: Linear CFGs. Comparing Proposition B.2 and Proposition 3.5, we see that CFG serves as a concrete example for the separation between length generalization in the limit and non-asymptotic length generalization.

Proposition 3.5 ((Linear) CFGs only admit length generalization in the limit). Recall RL-CFG is the encoding system for Linear CFGs defined in Definition 2.3 and CCFG( G ) is the complexity measure that maps CFG G = (N, T, P, S = {0, 1}) to |N| + |T| + |P|. Then for any learning algorithm A, the length complexity, N RL-CFG A , is not computably bounded. That is, Linear CFGs do not admit non-asymptotic length generalization (w.r.t. standard CFG encoding system RCFG), and neither does the set of all CFGs.

The proofs of Lemma 3.4 and Proposition 3.5 are in Appendix C and Appendix D, respectively. In Appendix C.3, we show equivalence between non-asymptotic length generalization to a variant of finite identification proposed in Gold (1967). The results are summarized in Figure 1.

4. Related Work

Solomonoff (1964) proposed Bayesian-inference-based algorithms which when given a sequence of symbols, predict the next symbol according to the posterior distribution computed from the Solomonoff-Levin prior distribution. Gold (1967) introduced Identification-in-the-Limit as an asymptotic notion of learnability and proved that many classes of functions can be learned in this sense.

Zhou et al. (2023) propose the RASP-L Conjecture, which says that whether a transformer length generalizes on a particular ground-truth function f is well predicted by whether f has a short RASP-L description. Huang et al. (2024) formulate the problem of length generalization and the RASPL Conjecture formally as a version of Identification-in-the Limit and prove asymptotic results for identifying languages expressible by Limit Transformers. None of the works above provide non-asymptotic guarantees. We distinguish this work from previous work by providing non-asymptotic bounds on the length of the training data required in order to guarantee that the learner outputs a single hypothesis which exhibits perfect length generalization.

(Weiss et al., 2021) (Zhou et al., 2023), (Yang & Chiang, 2024), and (Shaw et al., 2024) study programming languages which capture the set of functions which transformers can express (like RASP).

Regarding theoretical works for length generalization not related to transformers, Marsden et al. (2024) prove length generalization for learning linear dynamical systems with SGD. Abbe et al. (2024) prove out-of-domain generalization for boolean functions of a fixed input size.

5. Main Results: Non-Asymptotic Length Generalization of C-RASP

5.1. Recap: Definition of C-RASP

Our main results pertain to a class of functions called CRASP. C-RASP is a variant of RASP that, with alphabet Σ = {0, 1}, defines a class of functions from {0, 1} to {0, 1}, where only certain sequence-to-sequence operations are permitted. C-RASP was shown to be a subset to the class of functions expressible by transformers with infiniteprecision activations and a superset to that expressible by finite-precision transformers (Yang & Chiang, 2024).

Definition 5.1 (C-RASP, (Yang & Chiang, 2024)). A sequence is an element of N . A C-RASP program over alphabet Σ = {0, 1} is defined as a series of n < C-RASP operations. We denote the h(1), . . . , h(n) as the output sequences of each C-RASP operation. Denote x {0, 1} as the input sequence to the C-RASP program. Denote h(i) j as the jth element of the sequence h(i), where i [n] and j [|x|]. Table 2 shows the list of allowed operations in C-RASP. The partial-sum operator ps : {0, 1} N is defined as: h {0, 1} , j [|x|], ps(h)j = P

l [j] hl. The entire C-RASP program is a mapping {0, 1} {0, 1} . We take the last bit of the output sequence as the output bit, yielding a function {0, 1} to {0, 1}.

There are several natural parameters that contribute to the complexity of C-RASP functions. First, we study the impact of the following notion of precision of the parameters of

Non-Asymptotic Length Generalization

C-RASP functions on the difficulty of length generalization. Definition 5.2 (p-precision). An integer of absolute value at most p is of p-precision. A rational number between [0, 1] is of p-precision if in simplest form, where the numerator and denominator are relatively prime, its denominator is at most p in magnitude. A tuple of rational numbers in [0, 1] is precision p if the least common denominator of its entries is at most p in magnitude.

This notion of precision makes sense for C-RASP since C-RASP only allows integer combinations of previously computed variables instead of arbitrary linear combinations.

We also study the impact of depth of C-RASP programs on the difficulty of length generalization where depth is, loosely, the maximum number of sequentially-applied ps operations along any part of the C-RASP program. We study depth-1 and depth-2 programs, which we will also refer to as 1-layer and 2-layer programs. We will first prove length generalization results pertaining to the following subset of 1-layer C-RASP, which we call C-RASP1. Note that for convenience, we use functions f FR and descriptions p {0, 1} interchangeably in the following definitions. Definition 5.3 (C-RASP1). With integer T, let C-RASP1,T

denote the set of programs of the following form. Each program f has parameters a, b, d [ T, T], a > 0. For any n > 0, on input x {0, 1}n, f computes: f(x) = 1[a ps(x)n b n d > 0]. Then, C-RASP1 = S

T 1 C-RASP1,T . Given a function f C-RASP1, the complexity measure C(f) = max(|a|, |b|, |d|), the precision of f s parameters in the sense of Definition 5.2.

Here, the encoding system R maps a string p to a C-RASP1

function. The string p is interpreted by R to encode 3 parameters, each taking Θ(log T) bits to encode. Invalid encodings are mapped to a default C-RASP1 function. Thus, the complexity measure proposed above, which takes in p and returns the maximum precision of its parameters, returns an integer which is roughly exponential in the length of p.

Our main length generalization result will apply to a subset of 2-layer C-RASP, which we call C-RASP2. For 2 layer programs, the width K of the first layer becomes a natural parameter to study. Definition 5.4 (C-RASP2). With integers T and 1 K T 2, let C-RASP2,K,T be the set of programs of the following form. Each program f has parameters 0 < z T, i [K], a(i), b(i), λi { T, . . . , T}, with a(i) > 0. We require that for all i [K], b(i)

a(i) (0, 1) and is distinct from

a(i ) for i = i. We also require P

i [K] λi > z.

For any n > 0, on input x {0, 1}n, the first layer computes the values of K heads, {h(i)}i [K], on the n prefixes of x: {{x1}, {x1, x2}, . . . , {x1, . . . , xn}} as follows: j [n], i [K], h(i) j = 1[ps(x)j > b(i)

a(i) j]. Subscript j

indicates the value of a quantity on the jth prefix of x.

The second layer computes the output, which is the nth bit of the final sequence: f(x) = 1[P

i [K] λips(h(i))n > z n].

Then, C-RASP2 = S

1 T,1 K T 2 C-RASP2,K,T . Given a function f C-RASP2, let K(f) be the number of heads, h(i), in the first layer of f and let T(f) := max(maxi [K(f)] |a(i)|, maxi [K(f)] |b(i)|, maxi [K(f)] |λi|, |z|). The complexity measure is C(f) = T(f)K(f), the precision of the function s parameters to the power of the number of heads.

We refer to functions in C-RASP2,K,T as 2 layer, K-head, T-precision programs, where each intermediate variable h(i), i [K] is called a head.

Here, the encoding system RC-RASP2 maps a string p to a C-RASP2 function. The string p is interpreted by RC-RASP2 to encode Θ(K) parameters, each taking Θ(log T) bits to encode, so that the entire encoding of a C-RASP2,K,T function is roughly Θ(K log T) bits long. Invalid encodings are mapped to a default C-RASP2 function. Thus, the complexity measure C(f) = T(f)K(f) = exp(K(f) log T(f)) returns an integer which is roughly exponential in |p|.

5.2. Non-Asymptotic Length Generalization of C-RASP1 and C-RASP2

The following result states that in order to identify the ground-truth function from the set C-RASP1 where the precision of parameters is at most T, it suffices for the Minimum-Complexity Interpolator to receive (string, label) pairs for all strings of length at most O(T 2). Theorem 5.5 (C-RASP1 Length Generalization). Let F = C-RASP1 and C(f) = max(|a|, |b|, |d|), defined in Definition 5.3. Then T N, we have NAmci(T) O(T 2). That is, the Minimum-Complexity Interpolator, with complexity C and function class F, can length generalize given inputs of length O(T 2) when the ground-truth has complexity T.

The following is our main length generalization result. It states that in order to identify the ground-truth function from the set C-RASP2 where the precision of parameters is at most T and number of heads is at most K, it suffices for the Minimum-Complexity Interpolator to receive (string, label) pairs for all strings of length at most O(T O(K)). Theorem 5.6 (C-RASP2 Length Generalization, Main Result). Let F = C-RASP2 and C(f) = T(f)K(f), defined in Definition 5.4. If the ground-truth function f has T(f ) T and K(f ) K, then the Minimum-Complexity Interpolator, with complexity C and function class F, can length generalize given inputs of length O(T O(K)).

By Theorem 5.6, C-RASP2,K,T can be learned in a length generalizable way with inputs of length O(T O(K)). By

Non-Asymptotic Length Generalization

Theorem 5.5, C-RASP1,T can be learned in a length generalizable way with inputs of length O(T 2). Our upper bound on the minimum length required to learn C-RASP2,K,T is much larger than that of C-RASP1,T . We don t have a lower bound on the length of inputs required to identify C-RASP2,K,T , but our best guess is that one of the form Ω(T K) exists in the regime K Θ(log T).

The proof of Theorem 5.5 can be found in Appendix E.1. Below we sketch the proof of Theorem 5.6.

6. Proof Sketch of Theorem 5.6.

Theorem 5.6 follows as a Corollary to the following, stronger Theorem 6.1, which says that the Minimum Complexity Interpolator with complexity measure C(f) = T(f)K(f) will length generalize if it receives inputs of length NAmci(α) O(αO(1)), when the ground-truth function has complexity at most α. Theorem 6.1. Let F = C-RASP2 and C(f) = T(f)K(f), as in Definition 5.4. Then α N, NAmci(α) O(αO(1)).

Theorem 6.1 is proved in Appendix E.2. The proof that Theorem 5.6 follows as a Corollary to Theorem 6.1 is simple, and deferred to Appendix A for brevity. Below, we sketch the proof Lemma 6.2, a weaker version of Theorem 6.1. Although weaker, our proof sketch of Lemma 6.2 will still illustrate the key ideas of the proof of Theorem 6.1. Lemma 6.2. Let F = C-RASP2 and C(f) = T(f)K(f), as in Definition 5.4. Then α N, NAmci(α) O(αO(log2 α)).

To prove Lemma 6.2, it suffices to prove Lemma 6.3. We will upper bound, for any T and K, the minimum length of inputs required to distinguish any two unequal C-RASP2

functions that have at most K heads and T precision. Lemma 6.3 (Length Bound on C-RASP2,K,T ). For any 1 T, 1 K T 2, for all f, f C-RASP2,K,T such that f = f , there exists a string x {0, 1} of length at most O(T O(K2)) such that f(x ) = f (x ).

The proof that Lemma 6.2 is a Corollary to Lemma 6.3 is simple and is deferred to Appendix A. The rest of the proof sketch describes how to prove Lemma 6.3. To do this, given two arbitrary f, f C-RASP2,K,T that are not equal, we show the existence of a short string x which distinguishes f and f , in the sense that f(x ) = f (x ).

6.1. Key Definitions.

Suppose f, f C-RASP2,K,T , with parameters (a(i))i [K], (b(i))i [K], (λi)i [K], z and ((a(i)) )i [K], ((b(i)) )i [K], (λ i)i [K], z , respectively.

Suppose the set of unique numbers R := { b(i)

a(i) }i [K]

(a(i)) }i [K] (0, 1) between the first layer of f and f

has size k, where K k 2K. We will refer to these numbers as slopes. We will denote R = {sj}j [k] (0, 1), where s1 is the largest slope and sk is the smallest slope, and the slopes are sorted in descending order so that s1 > . . . > sk. For i [K], let ord(1, i) : [K] [k] be the index within R of the ith slope of f, b(i)

a(i) . Let ord(2, i) : [K] [k] be the index within R of the ith slope of f , (b(i))

(a(i)) . In the following exposition, we will refer to line j as the homogeneous, 2D line y = sjx, with slope sj, j [k]. We will denote line j by the symbol lj. We will call the set of k 2K unique slopes {si}i [k] a configuration.

Definition 6.4. A (k, T)-configuration is a set of k distinct T-precision rational numbers {si}i [k] (0, 1).

We will refer to strings in {0, 1} synonymously as discrete test-functions, because there is a one-to-one correspondence between x and the sequence of 2D points {(j, ps(x)j)}j [|x|] R2. The latter set of points acts as a test of whether two programs f, f C-RASP2,K,T

are different or not.

Definition 6.5 (Discrete Test-Function). Given a (k, T)- configuration {si}i [k], a discrete test-function X, with respect to {si}i [k] and of length n < , is a function {0, 1, . . . , n} {0, 1, . . . , n} where X(0) = 0 and j [n], X(j) = X(j 1) or X(j) = X(j 1) + 1. The induced activations (B1(X), . . . , Bk(X)) of X with respect to the (k, T)-configuration are defined as: i [k], Bi(X) := 1

n Pn j=1 1[X(j) > si j]

In our proof sketch, we will need to analyze properties of a continuous analog to discrete test-functions, which can be thought of as corresponding to infinite-length strings.

Definition 6.6 (Continuous Test-Function). Given a (k, T)- configuration {si}i [k], a continuous test-function Y, with respect to {si}i [k], is a 1-Lipschitz, monotone nondecreasing continuous function [0, 1] [0, 1], with Y(0) = 0. Continuous test-functions can only intersect the k lines {li}i [k] of slopes given by {si}i [k] at finitely many points. The induced activations (B1(Y), . . . , Bk(Y)) of Y w.r.t. {si}i [k] are: i [k], Bi(Y) := R 1 0 1[Y(j) > si j]dj.

We study continuous test-functions as a proxy for discrete test-functions. As each C-RASP2 function can be thought of as a linear threshold function over the induced activations of the input string (test-function) with respect to the (k, T)-configuration given by the parameters of its first layer, it will be important to study the activations which can be induced by continuous test-functions: A({si}i [k]) := {(B1(Y), . . . , Bk(Y)) : Y continuous test-function w.r.t {si}i [k]}.

The importance of the activations induced by continuous test-functions motivates the segmentation of continuous testfunctions at the points where the continuous test-function

Non-Asymptotic Length Generalization

intersects the k lines, {y = six : i [k]}. Definition 6.7 (Segment). Given any (k, T)-configuration {si}i [k] (0, 1), a segment is a restricted test-function S : [a, b] [0, 1] where [a, b] [0, 1] which maps a continuous subset [a, b] to [0, 1]. S is 1-Lipschitz and monotone non-decreasing. The segment s start-point (a, S(a)) and the end-point (b, S(b)) each lie on one of the k lines, in the sense that there exists some i, j [k] where S(a) = si a and S(b) = sj b, where i = j or |i j| = 1. No other points (x, S(x)), x (a, b) can lie on a line l1, . . . , lk.

Figure 2 depicts a continuous test-function. Figure 7 shows four generic types of segments. Two different test-functions can share the same schema: roughly, the order which the test-functions cross the k lines. Definition 6.8 (Schema). Given any (k, T)-configuration {si}i [k] (0, 1), a schema Y is a blueprint for a continuous test-function, specifying a sequence of lines {li}i [k] that any test-function of the schema must cross. It consists of an integer 0 < M < and two tuples {idx(i)}i [M] [k]M, {seci}i [M] [k+1]M, where |idx(i) idx(i+1)| 1 for all i [M 1]. If |idx(i) idx(i + 1)| = 1, then seci+1 is unique and must be max(idx(i), idx(i + 1)). If idx(i) = idx(i + 1), then seci+1 can be either idx(i + 1) or idx(i + 1) + 1. sec1 can be either idx(1) or idx(1) + 1.

Any continuous test-function of schema Y = ({idx(i)}i [M], {seci}i [M]) consists of exactly M segments S1, S2, . . . , SM whose domains are a partition of [0, 1]. For each i [M], the ith segment Si s end-point lies on lidx(i). For i > 1, Si s start-point lies on line lidx(i 1), and S1 s start-point is the origin, (0, 0). In addition, the ith segment must be contained in Sectorseci, where Sector1 is the subset of the positive quadrant of the 2D plane which lies above y = s1x, Sectork+1 is the subset of the positive quadrant which lies below y = skx, and for i {2, . . . , k}, Sectori is the subset of the positive quadrant which lies below y = si 1x and above y = six.

There is a useful approximation property between discrete and continuous test-functions which we use later. Lemma 6.9 (Discrete Approximation to Continuous Test Function, weaker version of Lemma E.24). Suppose we are given a (k, T)-configuration {si}i [k]. For any schema Y of M segments, suppose Y is any continuous test-function of schema Y , with segment lengths (n1(Y), . . . , n M(Y)) [0, 1]M (where we assumed WLOG that P

j nj(Y) = 1). Suppose every nj(Y) is a rational number and that the common denominator of all (nj(Y))j [M] is p. Then there exists an n0 O(p T k) so that for any positive integer multiple n of n0, there exists a discrete test-function X of length n so that i [k], |Bi(Y) Bi(X)| T 2+M

Definition 6.10 (Margin). Given a linear inequality L over M variables and a point x RM, define L(x) as the dif-

ference between the left-hand-side and right-hand-side of the inequality when the coordinates of x are plugged into L. Let L(x) = 0 x satisfies the inequalities with equality. We say L(x) is the margin of x for L.

6.2. Proof Sketch of Lemma 6.3.

Proof Plan. We want to show that any two C-RASP programs f, f C-RASP2,K,T that are not equal must be distinguished by some string x of length at most O(T O(K2)). Consider any two C-RASP programs f, f C-RASP2,K,T , with parameters (a(i))i [K], (b(i))i [K], (λi)i [K], z and ((a(i)) )i [K], ((b(i)) )i [K], (λ i)i [K], z , respectively.

For any n > 0, consider the induced activations by an arbitrary string x {0, 1}n, defined as { ps(h(i)(x))n

{ ps((h(i)) (x))n

n }i [K], per Definition 6.5. With k 2K being the number of unique slopes (i.e. unique values in { b(i)

a(i) }i [K] { (b(i))

(a(i)) }i [K]) among the heads in the first layers of f and f , denote (B1(x), B2(x), . . . , Bk(x)) :=

{ ps(h(i)(x))n

n }i [K] { ps((h(i)) (x))n

To argue that the existence of distinguisher x0 for f and f implies existence of a short distinguisher for for f and f , it suffices to find an x {0, 1}n, n O((KT)O(K2)) that induces activations (Bi(x))i [k] where either P

i [K] λi Bord(1,i)(x) > z and P i [K] λ i Bord(2,i)(x) z or P i [K] λi Bord(1,i)(x) z and P

i [K] λ i Bord(2,i)(x) > z . In particular, define the following halfspaces, H+ 1 , H 1 , H+ 2 , H 2 . j {1, 2}, H+ j := {(B1, . . . , Bk) : P i [K] λi Bord(j,i) > z} and H j := {(B1, . . . , Bk) : P

i [K] λi Bord(j,i) < z}. With this goal in mind, we use the following proof plan.

1. First, we argue that if f = f , then there exists a continuous test-function Y0 which induces activations (B1(Y0), . . . , Bk(Y0)) that is either contained in H+ 1 H+ 2 or H 1 H 2 . This is a non-trivial step which uses a technical Lemma E.30, whose proof is deferred to Appendix E.6. WLOG, suppose that (B1(Y0), . . . , Bk(Y0)) H+ 1 H+ 2 .

2. Next, we will show that because the set A({si}i [k]) H+ 1 H+ 2 is non-empty, then A({si}i [k]) H+ 1 H+ 2 must be at least a minimum size. We accomplish this via the following steps.

With Corollary E.14, we first characterize the set A({si}i [k]) as the union of a finite number of polytopes, where each polytope is equal to the set of activations which can be induced by test-functions of a particular schema. In this way, each polytope corresponds to a unique schema, and the finite set of schemas corresponding to the finite set of polytopes

Non-Asymptotic Length Generalization

serves as a basis of all test-functions. We convert Y0 into a new test-function which induces the same activations as Y0, but which follows one of the basis schema, Y . This new test-function, which we call Y1, is specified by a tuple of M := M(Y ) k2 numbers (n1, n2, . . . , n M) [0, 1]M, the lengths of the segments of Y1 in schema Y . Let A(M)(Y ) [0, 1]M denote the polytope of valid settings of the lengths of segments of schema Y , as explained in Lemma E.6. Then, there are two halfspaces H(M) 1 and H(M) 2 over M variables such that the polytope P := cl(A(M)(Y ) H(M) 1 H(M) 2 ) is the set of settings of the lengths of segments of schema Y , which correspond to test-functions whose induced activations are contained in cl(H+ 1 ) cl(H+ 2 ). By the existence of Y1, P = . Moreover, P is a polytope whose faces are low-precision, in the sense of Definition 5.2. Denote the set of vertices of P as V . Let c := 1 |V | P

v V v P be the average of the vertices of P. We apply Lemma E.18 to P to derive a lowerbound of the margin of c to the faces of P, of γ 1 |V | 1 (poly(K,T )

M)M . Since c P [0, 1]M is a valid setting of the lengths of the segments of schema Y , there is a continuous test-function Y2 of schema Y with segment lengths given by c.

3. It follows from steps 1 and 2 of the proof plan that if f, f are not equal, then there exists a point c P whose margin to the faces of P is at least γ 1 3M 2 1 (poly(K,T )

M)M , where we used the fact that

|V | 3M 2, from Lemma E.22.

Using Lemma E.29, we perturb the coordinates of c [0, 1]M slightly to attain a point that still has γ 2 margin to the faces of P, but whose coordinates have precision at most poly(K,T )

γ , in the sense of Definition 5.2. We will call this perturbed point c := (n(c ) 1 , n(c ) 2 , . . . , n(c ) M ) [0, 1]M. Let (B 1, . . . , B k) be the activations induced by a testfunction of schema Y with segment lengths set according to (n(c ) 1 , n(c ) 2 , . . . , n(c ) M ).

4. Finally, we apply Lemma 6.9 to the test-function of schema Y , with segment lengths set according to (n(c ) 1 , n(c ) 2 , . . . , n(c ) M ), to attain a discrete test-function X of length n whose induced activations are such that ||(B1(X ), . . . , Bk(X )) (B 1, . . . , B k)|| O( poly(K) poly(T )

n ). Note that there is a minimal value n0 poly(K, T) T O(K) 1

γ , which n must be a multiple of, as a requirement of Lemma 6.9.

Now, we discuss how large n needs to be so that discrete test-function X distinguishes f and f . First,

n must be larger than n0, which was a prerequisite of using Lemma 6.9. There is a second, important requirement. WLOG, suppose that the continuous test-function Y0 outputted in step 1 of the proof plan is such that: (B1(Y0), . . . , Bk(Y0)) H+ 1 H+ 2 . Then, the procedure in steps 2 and 3 are such that (B1(Y1), . . . , Bk(Y1)) H+ 1 H+ 2 , (B1(Y2), . . . , Bk(Y2)) H+ 1 H+ 2 , and (B 1, . . . , B k) H+ 1 H+ 2 . Now, we would like n to be large enough so that (B1(X ), . . . , Bk(X )) is also in H+ 1 H+ 2 , so that X would distinguish f and f . To do this, we need to ensure the coordinate-wise distance between (B1(X ), . . . , Bk(X )) and (B 1, . . . , B k) is smaller than the margin γ of c on the faces of H(M) 1 H(M) 2 divided by the maximum L1 norm of the linear inequalities defining the faces of A({si}i [k]) H+ 1 H+ 2 . It is important that the margin of c is large, as the smallest value of n which ensures that (B1(X ), . . . , Bk(X )) H+ 1 H+ 2 is proportional to 1

In short, we need n Θ(max(n0, poly(K,T )

γ )) for the resulting discrete test-function X to distinguish f and f . Since M k2, k 2K, and K T 2, it suffices for n = (poly(T))M = O(T O(K2)). X corresponds to a

string x {0, 1}O(T O(K2)) that distinguishes f and f .

Extra materials for the proof sketch are in Appendix A.2.

7. Conclusion

We prove guarantees of length generalization for various function classes in an idealized setting. We formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of CFGs. On the positive side, we show that the length complexity of DFAs is 2n 2 where n is the number of states of the groundtruth automaton. We show that the length complexity of 1-layer C-RASP functions is O(T 2) when the ground-truth function has precision T, and that the length complexity of 2-layer C-RASP functions is O(T O(K)) when the groundtruth function has precision T and K heads.

It is open whether the proof techniques can be extended to 3-Layer C-RASP programs, or to C-RASP programs whose layers contain bias terms. It is open how to formalize a weaker notion of partial length generalization which does not entail length generalization to arbitrary length inputs.

Non-Asymptotic Length Generalization

Acknowledgements

This work was supported by NSF IIS 2211780 and the Stanford HAI Google Cloud Credits Program. TC acknowledges funding from an NSF Graduate Research Fellowship. ZL acknowledges funding from an Open AI Superalignment Grant.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Abbe, E., Bengio, S., Lotfi, A., and Rizk, K. Generalization on the unseen, logic reasoning and degree curriculum, 2024. URL https://proceedings.mlr.press/ v202/abbe23a/abbe23a.pdf.

Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G., Dyer, E., and Neyshabur, B. Exploring length generalization in large language models, 2022. URL https:// openreview.net/forum?id=z Sk YVe X7b C4.

Baker, B. S. and Book, R. V. Reversal-bounded multipushdown machines, 1974. URL https://doi.org/10. 1016/S0022-0000(74)80027-9.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020. URL https: //papers.nips.cc/paper/2020/hash/ 1457c0d6bfcb4967418bfb8ac142f64a-Abstract. html.

Del etang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., and Ortega, P. A. Neural networks and the chomsky hierarchy, 2023. URL https://openreview. net/pdf?id=Wbx HAzke Qcn.

Freiwald, R. C. An introduction to set theory and topology, 2014. URL https://openscholarship.wustl. edu/books/20/.

Gold, E. M. Language identification in the limit, 1967. URL https://www.sciencedirect.com/ science/article/pii/S0019995867911655.

Hopcroft, J. and Ullman, J. Introduction to automata theory, languages, and compuation, 1979.

Huang, X., Yang, A., Bhattamishra, S., Sarrof, Y., Krebs, A., Zhou, H., Nakkiran, P., and Hahn, M. A formal framework for understanding length generalization in transformers, 2024. URL https://openreview.net/ forum?id=U49N5V51r U.

Jelassi, S., d Ascoli, S., Domingo-Enrich, C., Wu, Y., Li, Y., and Charton, F. Length generalization in arithmetic transformers, 2023. URL https://arxiv.org/abs/ 2306.15400.

Li, M. and Vit anyi, P. An introduction to kolmogorov complexity and its applications, 2008. URL https://link.springer.com/book/10. 1007/978-0-387-49820-1.

Mahankali, A., Haochen, J. Z., Dong, K., Glasgow, M., and Ma, T. Beyond ntk with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time, 2023. URL https://openreview.net/forum? id=Y2hn MZv VDm&note Id=Szifox R0by.

Marsden, A., Dogariu, E., Agarwal, N., Chen, X., Suo, D., and Hazan, E. Provable length generalization in sequence prediction via spectral filtering, 2024. URL https: //arxiv.org/abs/2411.01035.

Nogueira, R., Jiang, Z., and Lin, J. Investigating the limitations of transformers with simple arithmetic tasks, 2021. URL https://arxiv.org/abs/2102.13019.

Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. Show your work: Scratchpads for intermediate computation with language models, 2021. URL https://arxiv.org/ abs/2112.00114.

Pitt, L. and Warmuth, M. K. The minimum consistent df a problem cannot be approximated within any polynomial, 1993. URL https://dl.acm.org/doi/pdf/10. 1145/138027.138042.

Rissanen, J. Modeling by shortest data description*. Autom., 14:465 471, 1978. URL https://api. semanticscholar.org/Corpus ID:30140639.

Rockafellar, R. T. Convex analysis, 1970.

Ruoss, A., Del etang, G., Genewein, T., Grau-Moya, J., Csord as, R., Bennani, M., Legg, S., and Veness, J. Randomized positional encodings boost length generalization of transformers, 2023. URL https:// aclanthology.org/2023.acl-short.161/.

Non-Asymptotic Length Generalization

Shaw, P., Chang, M.-W., Pasupat, P., and Toutanova, K. Compositional generalization and natural language variation: Can a semantic parsing approach handle both?, 2021. URL https://aclanthology.org/2021. acl-long.75/.

Shaw, P., Cohan, J., Eisenstein, J., Lee, K., Berant, J., and Toutanova, K. Alta: Compiler-based analysis of transformers, 2024. URL https://openreview.net/ forum?id=h751wl9xi R.

Solomonoff, R. J. A formal theory of inductive inference. part i, 1964. URL https: //www.sciencedirect.com/science/ article/pii/S0019995867911655.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need, 2017. URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper. pdf.

Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers, 2021. URL https://proceedings.mlr. press/v139/weiss21a/weiss21a.pdf.

Yang, A. and Chiang, D. Counting like transformers: Compiling temporal counting logic into softmax transformers, 2024. URL https://openreview.net/forum? id=Fmh Pg4UJ9K#discussion.

Zhou, H., Bradley, A., Littwin, E., Razin, N., Saremi, O., Susskind, J., Bengio, S., and Nakkiran, P. What algorithms can transformers learn? a study in length generalization, 2023. URL https://openreview.net/ forum?id=Ass Iu Hnm HX.

Zhou, Y., Alon, U., Chen, X., Wang, X., Agarwal, R., and Zhou, D. Transformers can achieve length generalization but not robustly, 2024. URL https://openreview. net/pdf?id=DWk WIh3v FJ.

Non-Asymptotic Length Generalization

Organization of Appendices.

Section A contains supplementary material to the main paper, particularly the proof sketch. Section B contains a discussion of Gold (1967) s asymptotic notions of learnability and the proof of Proposition B.2. Section C contains the proofs of Theorem 3.2, Lemma 3.4, and Lemma C.6. Section D contains the proof of Propositions 2.7 and 3.5. Section E.1 contains the proof of Theorem 5.5. Section E.2 contains the full proof of Theorem 6.1, which uses Definitions and Lemmas defined in the later sections E.3 to E.6. Section E.3 contains proofs of Lemmas relating to completeness of basis schema. Section E.4 contains proofs of Lemmas relating to lower bounding the margin of the average-of-vertices of a polytope with low-precision faces. Section E.5 contains proofs of Lemmas for approximating a continuous test-function with a discrete test-function. Section E.6 contains auxiliary Lemmas.

A. Missing Parts from Main Paper

Here, we put some supporting content which were referred to in the main paper, but omitted due to space constraints.

A.1. List of Allowed C-RASP Operations in (Yang & Chiang, 2024), in the Context of Definition 5.1.

Table 2. Booleanand count-valued operations allowed in a C-RASP program.

Boolean-Valued Operations

Initial h(i) j := 1[xj = a] for a {0, 1}

Boolean h(i) j := h(i ) j h(i) j := h(i ) j h(i ) j

Sign h(i) j := 1[h(i ) j > 0]

Constant h(i) j := 1

Count-Valued Operations

Partial Sum h(i) j := ps(h(i ))j

Conditional h(i) j := h(i ) j ? h(i ) j : h(i ) j

Addition h(i) j := h(i ) j + h(i ) j

Subtraction h(i) j := h(i ) j h(i ) j

Min/Max h(i) j := min(h(i ) j , h(i ) j )

h(i) j := max(h(i ) j , h(i ) j )

Constant h(i) j := 1

A.2. Supporting Material for the Proof Sketch in Section 6.

Proof that Theorem 5.6 follows from Theorem 6.1. Proof of Theorem 5.6. Every function f in C-RASP2,K,T has at most K heads and T precision, and so C(f) T K =: α . Thus, Theorem 6.1 implies that inputs of length at most NAmci(α ) O(αO(1) ) = O(T O(K)) suffices to learn C-RASP2,K,T .

Proof that Lemma 6.2 follows from Lemma 6.3. Proof of Lemma 6.2. By Theorem 3.2, it suffices to show that for any two un-equal functions f, f C-RASP2 such that T(f)K(f) α and T(f )K(f ) α, there is a string of length at most O(αO(log2 α)) which distinguishes them. If T(f)K(f) α and T(f )K(f ) α, then T(f), T(f ) are upper bounded as T(f), T(f ) α, maximized when K(f), K(f ) = 1. Meanwhile, K(f), K(f ) are upper bounded as K(f), K(f ) log α, maximized when K(f) = T(f)2, K(f ) = T(f )2. Thus, f, f C-RASP2,log α,α. Applying Lemma 6.3 implies that there is a string of length O(αO(log2 α)) distinguishing f and f .

Formal Summary of Each Step of Proof Plan in Section 6.2. Below, we formalize the conclusions drawn from each step of the proof plan in the proof sketch of Lemma 6.3. Putting these Lemmas together yields Lemma 6.3.

Lemma A.1 (Analysis of Step 1). If f = f C-RASP2,K,T , then there exists a continuous test-function Y0 such that (Bi(Y0))i [k] (H+ 1 H+ 2 ) (H 1 H 2 ). WLOG, suppose (Bi(Y0))i [k] H+ 1 H+ 2 .

Lemma A.2 (Analysis of Step 2). Given continuous test-function Y0 where (Bi(Y0))i [k] H+ 1 H+ 2 , then there exists a continuous test function Y2 of a schema Y of M k2 segments, where lengths of the segments of Y2 is given by

Non-Asymptotic Length Generalization

(n(c) i )i [M] [0, 1]M. In addition, the M-dimensional point (n(c) i )i [M] has margin at least γ 1 3M 2 1 (poly(K,T )

the faces of polytope P := cl(A(M)(Y ) H(M) 1 H(M) 2 ).

Lemma A.3 (Analysis of Step 3). Given any point c := (n(c) i )i [M] [0, 1]M which are segment lengths of a particular test-function of schema Y , suppose c has margin at least γ > 0 to faces of polytope P = cl(A(M)(Y ) H(M) 1 H(M) 2 ). Then, one can perturb the coordinates of c to get a point c [0, 1]M such that (1) the margin of c to the faces of P is at least γ

2 and (2) the coordinates of c have precision at most poly(K,T )

Lemma A.4 (Analysis of Step 4). If there exists γ > 0, c P such that the margin of c to the faces of P is at least γ

2 and the coordinates of c have precision at most poly(K,T )

γ , then there exists discrete test-function X such that

(Bi(X ))i [k] H+ 1 H+ 2 and the length of X is at most O( poly(K,T )

With M k2, k 2K, and K T 2, we conclude that if f = f , then there exists a string x , |x | O(T O(K2)) such that f(x ) = f (x ).

Details of Key Steps of Proof Plan of Section 6.2. We now elaborate on key steps of the proof plan. We will focus on Step 2, because Step 1 just applies Lemma E.30, Step 3 just applies Lemma E.29, and Step 4 applies Lemma 6.9 and analyzes how large n needs to be so that the final discrete test-function distinguishes f, f .

Details of Step 2. First, we characterize A({si}i [k]) as the union of a finite number of polytopes.

Lemma A.5 (Characterization of A({si}i [k]), combination of Corollary E.14 and Lemma E.16). For any (k, T)- configuration {si}i [k], there are a finite number of k-dimensional convex polytopes {Aj}j [Nk], Nk < , such that

A({si}i [k]) = [

Figure 9 pictorally depicts the completeness Lemma for k = 2, which shows that A({si}i [2]) is the union of two triangles.

The way Lemma A.5 is proved is by showing that for any continuous test-function Y, there exists another continuous test-function of one of a finite number of possible schemas which induces the same activations {Bi(Y)}i [k]. We call these Nk schemas basis schema; they are a basis in the sense that any continuous test-function is equivalent to a continuous test-function which follows some basis schema. Figure 8 depicts a basis schema. We defer the details to Appendix E.3. The main take-away is that each polytope Aj [0, 1]k corresponds to a schema Yj, and Aj is the set of activations which can be induced by a test-function of schema Yj.

j [Nk], Aj := {(B1(Y), . . . , Bk(Y)) : Y valid test-function of schema Yj } [0, 1]k

Thus, given Y0 from Step 1 such that, WLOG, (Bi(Y0))i [k] H+ 1 H+ 2 , then there is an equivalent test-function Y1 of one of the basis schema, Y {Yj}j [Nk], such that (Bi(Y1))i [k] = (Bi(Y0))i [k] H+ 1 H+ 2 . Denote the set A(Y ) as the particular Aj which corresponds to schema Y .

A(Y ) := {(B1(Y), . . . , Bk(Y)) : Y valid test-function of schema Y } [0, 1]k

Now, suppose that schema Y consists of M segments, whose lengths we denote (n1, n2, . . . , n M) [0, 1]M with P

i [M] ni = 1. Note that not all settings of (n1, n2, . . . , n M) are valid, because there are linear constraints which must be met by (n1, n2, . . . , n M), which are described in Lemma E.6. Define A(M)(Y ) [0, 1]M as the set of all valid settings of (n1, n2, . . . , n M), per Lemma E.6. A(M)(Y ) is an (M 1)-dimensional polytope, whose faces are parameterized by linear inequalities whose coefficients are at most poly(K, T) in magnitude (i.e. poly(K, T)-precision).

Non-Asymptotic Length Generalization

A(M)(Y ) := {(n1, . . . , n M) : valid segment lengths of schema Y and X

i [M] ni = 1} [0, 1]M

Further, define H(M) 1 and H(M) 2 as the analogous halfspaces to H+ 1 and H+ 2 but in the space of segment lengths as follows. There exists a linear map L : RM Rk which maps points in A(M)(Y ) to points in A(Y ). L {0, 1}k M is such that Lij = 1 segment j in schema Y lies above line i (that is, for every x in the domain of segment j, the y-value of the segment at x is at least si x) and hence contributes to the ith activation Bi(Y) of any test-function Y of schema Y . Using L, we can rewrite the inequalities which characterize H1, H2 in terms of (n1, . . . , n M).

H(M) 1 := {(n1, . . . , n M) :

i=1 λi Bord(1,i) > z} = {(n1, . . . , n M) :

j s.t. Lord(1,i),j=1 nj > z}

H(M) 2 := {(n1, . . . , n M) :

i=1 λ i Bord(2,i) < z } = {(n1, . . . , n M) :

j s.t. Lord(2,i),j=1 nj < z }

With P := cl(A(M)(Y ) H(M) 1 H(M) 2 ), we use the fact that (Bi(Y1)i [k] H+ 1 H+ 2 to deduce that A(M)(Y )

H(M) 1 H(M) 2 = . Let V be the set of vertices of P. This next Lemma is for lower-bounding the margin of point c = 1 |V | P

v V v [0, 1]M on the faces of P, when the latter is non-empty.

Lemma A.6 (Margin lower bound, slightly informal version of Lemma E.18). Consider a nonempty d-dimensional polytope P Rd with vertices V and N faces. Suppose the faces of P are each defined by a linear inequality over variables {xi}i [d], with integer coefficients of magnitude at most pface, where points on the face satisfy the linear inequality with equality. For j [N], define Lj as the linear inequality for the jth face of P. Then, for any j [N], for any vertex x V which does not lie on the jth face of P, we have the following lower bound on the margin of x on the jth face of P.

Lj(x) 1 (pface

With d = M and pface = poly(K, T) and using the linearity of the margin, it follows that the margin of c = 1 |V | P

v V v on any face of P is at least 1 |V | 1 (poly(K,T )

M)M . c is a point in [0, 1]M, and it is a valid setting of segment lengths for a

test-function of schema Y , since c has positive margin to the faces of P, so it is contained in P A(M)(Y ). The activations of the test-function with segment lengths given by c will be contained in H+ 1 H+ 2 since c P H(M) 1 H(M) 2 . This concludes Step 2.

B. Asymptotic Length Generalization, Adapted from (Gold, 1967)

Define the following asymptotic notion of learnability.

Definition B.1 (Length Generalization in the Limit, adapted from Gold (1967)). A function class F FR admits length generalization in the limit w.r.t. encoding system R if there exists a learning algorithm A such that for all f F, there exists a natural number N such that for all N N, A length-generalizably learns f at input length N .

The above definition of length generalization in the limit is a special case of the so-called identification in the limit in the informant model in (Gold, 1967). The major difference is that (Gold, 1967) requires the function class F to be learnable in arbitrary order of data presentation, while in our case, we are only interested in a particular order of data presentation, namely the order of increasing input length.

Proposition B.2 is a result about which function classes are learnable in the sense of Definition B.1. We provide a proof here, which is similar to that of Theorem I.4 of (Gold, 1967)

Proposition B.2 (Adapted from Theorem I.4 of Gold (1967)). For all encoding systems R, the function class FR admits length generalization in the limit, and thus so does any function class F FR.

Non-Asymptotic Length Generalization

Proof. Consider a learning algorithm which enumerates functions in FR. It maintains a current hypothesis, which is initialized to the first element of the enumeration. Given an input training set, the learning algorithm checks whether the current hypothesis interpolates the training data. If it does not, then the current hypothesis is updated to the next function in the enumeration until the current hypothesis interpolates the training data, at which point the current hypothesis is outputted.

First, because R maps {0, 1} to FR, then we can enumerate FR simply by enumerating all strings p {0, 1} , say in the order of increasing binary-representations of natural numbers, and then computing R(p) for each string p. The computation of R(p) will always halt by Definition 2.1. Every function in FR appears at some point in this enumeration by the definition of FR. Suppose that we denote this enumeration by E = {f0, f1, f2, . . .}, with Ec := {f0, f1, f2, . . . , fc} being the first c + 1 elements of the enumeration, for any c N.

Second, because the output of R is always a computable function which halts on every input, the learning algorithm described in the first paragraph is computable.

By Theorem 3.2 with C( ) being the mapping from a binary string to the integer it represents in binary (so that each finite representation is mapped to its order in the aforementioned enumeration), it suffices to show that for any c N, N(Ec) := min{n : f = f Ec, x {0, 1} n s.t. f(x) = f (x)} < . This follows directly from the fact that |Ec| = c < and that for any f = f Ec, there exists some string x, |x| < where f(x) = f (x). For each pair of unequal functions in Ec, pick the shortest such string x which distinguishes them, and take N(Ec) to be the maximum of the lengths of these distinguishers. This must be finite.

C. Proofs on Equivalence of Different Non-Asymptotic Length Generalization Definitions

C.1. Properties of Amci

We will prove that Amci is optimal with respect to the quantity NAmci(c). Note we will drop the superscripts for F, C, R from the notation for convenience.

Theorem 3.2 (Optimality of Minimum-Complexity Interpolator). Given any encoding system R and complexity measure C, for all c N, it holds that N R AR,C mci (c) = min A N R A (c) = N(FR c ). As a consequence, the following three statements are

equivalent:

Function class FR admits non-asymptotic length generalization; Function class FR admits non-asymptotic length generalization, via learning algorithm AR,C mci ; For all c N, length complexity of FR c , N(FR c ), is computably bounded in c.

Proof of Theorem 3.2. For any c N, we have

N(FR c ) := min{n N : f = f F with CR(f ), CR(f) c,

there exists x {0, 1} n s.t. f(x) = f (x)}

We will show separately that NAmci(c) min A NA(c) N(FR c ) and NAmci(c) N(FR c ).

(1). min A NA(c) N(FR c ) follows from the fact that for any N < N(FR c ), there exists f = f with CR(f ), CR(f) c that agree on all inputs x {0, 1} N. Given training set {(x, f(x)) : x {0, 1} N} = {(x, f (x)) : x {0, 1} N}, any algorithm A which outputs an encoding of a function which is not equal to f will be wrong in the case the ground truth is f. Meanwhile, any algorithm A which outputs an encoding of a function which is not equal to f will be wrong in the case the ground truth is f . Thus, NA(f) > N or NA(f ) > N, so NA(c) > N.

(2). NAmci(c) N(FR c ). For f with CR(f) c, suppose that on input of {(x, f(x)) : x {0, 1} N(FR c )}, Amci outputs p , where R(p ) = f. This implies that C(R(p )) C(p ) CR(f) c and that R(p ) is consistent with {(x, f(x)) : x {0, 1} N(FR c )}. This is a contradiction of the definition of N(FR c ), which says that all pairs of unequal functions f, f with CR(f), CR(f ) c can be distinguished from each other by some input of length at most N(FR c ).

Denote p p if R(p) = R(p ). The following Lemma is also useful for showing that the same bounds on the length complexity hold whether we think of Amci as operating over finite descriptions p {0, 1} (which is the actual implementation) or as operating over functions f F (which is a more abstract way to think of Amci).

Non-Asymptotic Length Generalization

Lemma C.1. Given (F, R, C), with CR(f) := minp {0,1} ,R(p)=f C(p), then for all c N:

min{n : f = f F, CR(f), CR(f ) c, x, |x| n, f(x) = f (x)}

= min{n : p p {0, 1} , C(p), C(p ) c, x, |x| n, R(p)(x) = R(p )(x)}

Proof. Every function f F with CR(f) c has some finite description p with R(p) = f and C(p) c. No function f F with CR(f) > c has any finite description p with R(p) = f and C(p) c.

We also have the following result which says that Amci is computable if C satisfies Assumption 2.4.

Lemma 3.3. Under Assumption 2.4, Algorithm 1 is computable and thus a valid learning algorithm.

Proof of Lemma 3.3. By Assumption 2.4, we use TM E to enumerate all programs p {0, 1} in non-decreasing order of C(p) and at each iteration check if R(p) is consistent with the training data. If it is, we output p and stop. Otherwise, we continue to the next program. Both the enumeration and checking-consistency procedures are computable, as assumed by Assumption 2.4 and Definition 2.1.

C.2. Equivalence of Non-Asymptotic Length Generalization with Decidability of Language Equivalence Problem

Now, we will prove that Definition 2.6 is equivalent to a few other definitions. First, recall the definition of non-asymptotic length generalization and the language equivalence problem.

Definition 2.6 (Non-Asymptotic Length Generalization). A function class F FR admits non-asymptotic length generalization w.r.t. encoding system R and complexity measure C if there exists a learning algorithm A and a computable function ˆN R,F A : N N such that for all f F and for all N ˆN R,F A (CR(f )), A length-generalizably learns f at input length N .

Definition C.2 (Language Equivalence Problem). The Language Equivalence Problem for encoding system R is the computational problem where given any p, q {0, 1} , determine whether R(p) = R(q).

We now prove the following equivalences.

Lemma 3.4. For any encoding system R and complexity measure C satisfying Assumption 2.4, the Language Equivalence problem for R is decidable if and only if length complexity of FR c , N(FR c ), is computably bounded in c. Thus it is also equivalent to the property that FR admits non-asymptotic length generalization.

Proof. Suppose TM E enumerates elements in {0, 1} in non-decreasing order of their complexity according to C.

N(FR c ) computably bounded in c = Language equivalence problem for R decidable.

Suppose there is a computable procedure F that, for any c, receives as input c and outputs an upper bound on N(FR c ). We will describe an algorithm that is given any two finite representations, p, q {0, 1} , and uses F to determine if R(p) = R(q).

Given p, q {0, 1} , compute c = max(C(p), C(q)) and generate the training dataset DF (c)(R(q)). Now, check whether (x, R(q)(x)) DF (c)(R(q)), that R(p)(x) = R(q)(x). If this is the case, return equivalent. Otherwise, return non-equivalent .

To argue correctness, if there is some (x, R(q)(x)) DF (c)(R(q)) where R(p)(x) = R(q)(x), then clearly these two functions are not equal. If (x, R(q)(x)) DF (c)(R(q)), R(p)(x) = R(q)(x), then since C(R(p)) C(p) c and C(R(q)) C(q) c, then it must be that R(p) = R(q), or else F(c) is not an upper bound of N(FR c ).

Language equivalence problem for R decidable = N(FR c ) computably bounded in c. Suppose TM M solves the Language Equivalence Problem for R. By Lemma C.1, it suffices to show that the following quantity is computably bounded in c N.

min{n : p p {0, 1} , C(p), C(p ) c, x, |x| n, R(p)(x) = R(p )(x)}

Algorithm 2 computes an upper bound of this quantity.

Non-Asymptotic Length Generalization

Algorithm 2 Computation of N(FR c ) Require: Integer c N Ensure: N(FR c ) 1: N 0 2: for all pairs of programs (p, q) (enumerated by E) with C(p), C(q) c do 3: if M(p, q) = non-equivalent then 4: for all strings x {0, 1} with non-decreasing length do 5: if R(p)(x) = R(q)(x) then 6: N max(N, |x|) 7: break 8: end if 9: end for 10: end if 11: end for 12: return N

Algorithm 2 always terminates, since the for-loop in line 4 only is performed on p, q {0, 1} which are not equivalent, which are distinguished by a finite string. It returns the smallest length required to distinguish every two non-equivalent p, q {0, 1} .

Finally, it follows from Theorem 3.2 that the language equivalence problem for R being decidable is equivalent to non-asymptotic length generalization of FR w.r.t. R, C when C satisfies Assumption 2.4.

C.3. Connection to Finite Identification, (Gold, 1967)

Our notion of non-asymptotic length generalization is related to Gold s notion of Finite Identification (Gold, 1967), which is restated as follows in the context of length generalization. Definition C.3 (Finite Length Generalization ( Identification , (Gold, 1967))). A function class F FR admits Finite Length Generalization w.r.t. encoding system R if there exists a Turing Machine (TM) A, where for any f F, on input of a training set DN(f ) for some N N, A satisfies:

1. A can only output either pass or some program ˆp {0, 1} . A outputs ˆp at least for one N N.

2. Whenever A outputs some program ˆp {0, 1} , it must be correct in the sense that R(ˆp) = f .

It is clear that Finite Length Generalization implies Length Generalization in the limit one can replace pass by arbitrary program ˆp and sticks to the same (correct) output ˆp once A outputs any program ˆp. Intuitively, a function class admits Finite Length Generalization if there exists a learning algorithm which can perfectly length generalize at some finite input length (and where the learning algorithm knows the length at which it length generalizes), rather than only generalizing in the limit. Thus, Finite Length Generalization is a very desirable property. However, it is in general too good to be true. We give a characterization of Finite Length Generalization in Lemma C.4. As a consequence of Lemma C.4, functions classes as simple as the set of all languages that are finite do not admit Finite Length Generalization. Lemma C.4 (Characterization of Finite Length Generalization). A function class FR admits Finite Length Generalization w.r.t. encoding system R if and only if for any f FR, there exists a natural number N such that f is the only function that is consistent with the training set DN(f ).

The proof of Lemma C.4 is straightforward and omitted.

Since Finite Length Generalization is in general too restrictive, we relax the definition to allow the learning algorithm to output a program ˆp with some information on the complexity measure of ground truth f , namely some c CR(f ). This leads to the definition of Finite Length Generalization with Complexity Information in Definition C.5. Interestingly, Definition C.5 is equivalent to non-asymptotic length generalization (Definition C.3). Definition C.5 (Finite Length Generalization with Complexity Information). A function class F FR admits Finite Length Generalization w.r.t. encoding system R and complexity measure C if there exists a Turing Machine (TM) A, which for any f F, on input of a training set DN(f ) for some N N and a natural number c CR(f ), it satisfies that:

Non-Asymptotic Length Generalization

1. A can only output either pass or some program ˆp {0, 1} . A outputs ˆp at least for one N N.

2. Whenever A outputs some program ˆp {0, 1} , it must be correct in the sense that R(ˆp) = f .

Lemma C.6 (Equivalence of Finite Length Generalization Definitions). For any encoding system R and complexity measure C satisfying Assumption 2.4, function class FR admits Finite Length Generalization with Complexity Information w.r.t. R and C if and only if length complexity of FR c , N(FR c ), is computably bounded in c. Thus it is also equivalent to FR admits non-asymptotic length generalization.

Proof of Lemma C.6. Suppose TM E enumerates elements in {0, 1} in non-decreasing order of their complexity according to C.

N(FR c ) computably bounded in c = FR admits Finite Length Generalization with Complexity Information w.r.t. R and C. Let F : N N be a computable upper bound on N(FR c ). Let A be given by Algorithm 3, on input DN(f ) and c CR(f ) N.

Algorithm 3 Learning Algorithm A Require: Dataset DN(f ); c N with c CR(f ) Ensure: Either pass or an program ˆp

1: if N < F(c) then 2: return pass 3: else 4: ˆp Amci DN(f )

5: return ˆp 6: end if

First, A only ever returns pass or some ˆp {0, 1} since Amci only ever returns elements of {0, 1} . Second, by Theorem 3.2, for c CR(f ) and N F(c), then R(Amci(DN(f ))) = f . Thus, whenever A returns an element ˆp {0, 1} , it is such that R(ˆp) = f , and there is at least one N where R(Amci(DN(f ))) = f .

FR admits Finite Length Generalization with Complexity Information w.r.t. R and C = N(FR c ) computably bounded in c. We need to prove that there exists some computable F which upper bounds N(FR c ). Suppose A satisfies Definition C.5. Let F be given by Algorithm 4.

Algorithm 4 Algorithm for F Require: Integer c N Ensure: Nmax

1: Nmax 0 2: for all p {0, 1} with C(p) c (Enumerate with E) do 3: N 0 4: while A(DN(R(p)), C(p)) = pass do 5: N N + 1 6: end while 7: Nmax max(Nmax, N) 8: end for 9: return Nmax

Algorithm 4 is computable, due to the guarantee of A that for any f , given c CR(f ), there exists some N where R(A(DN(f ))) = f , which ensures termination of Algorithm 4.

Regarding correctness, if there exists some c N where F(c ) < min{n : p p {0, 1} , C(p), C(p ) c , x, |x| n, R(p)(x) = R(p )(x)}, then there must exist two p, q {0, 1} with C(p), C(q) c , which are not equivalent, but which agree on all inputs of length at most F(c ). WLOG, suppose that C(p) C(q).

There exists Nq F(c ) where A(DNq(R(q)), C(q)) = pass . In particular, A(DNq(R(q)), C(q)) must return a ˆp which is equivalent to q, where q p. On the other hand, since Nq F(c ), we have DNq(R(q)) = DNq(R(p)). Thus, we

Non-Asymptotic Length Generalization

have shown that A(DNq(R(p)), C(q)) = pass and is a finite representation which is not equivalent to the ground-truth p. Since C(p) C(q), this yields a contradiction with the fact that whenever A is given an upper bound on the ground-truth complexity and A does not return pass , it must return a finite description ˆp which is equivalent to the ground-truth.

Figure C.3 sums up all these equivalences.

Non-Asymptotic Length Generalization (Definition 2.6)

Non-Asymptotic Length Generalization by Minimum Complexity Interpolator

Decidability of Equivalence Problem (Definition C.2)

Finite Length Generalization with Complexity Info (Definition C.3)

Length complexity N(FR c ) computably bounded in c Theorem 3.2

Figure 1. Summary of equivalence results between different characterizations of length generalization in Section 3 under some mild simplicity assumptions on the complexity measure C (Assumption 2.4). Each arrow represents an implication proven by the corresponding theorem.

D. Proofs on Length Generalization for DFAs and CFGs.

Proposition 2.7 (Non-Asymptotic Length Generalization for DFAs). Let RDFA be the DFA encoding system defined in Definition 2.2, and let CDFA be the number of states in DFA. Regular languages FR admits non-asymptotic length generalization w.r.t. encoding system RDFA and complexity measure CDFA. More specifically, there exists a learning algorithm A such that N RDFA A (c) 2c 2 for all c N.

In the following proof, we essentially describe and analyze the State Minimization Algorithm (Hopcroft & Ullman, 1979). Before proving the proposition, we will need a few concepts and Lemma D.1.

Suppose there is some ground-truth minimal DFA, D, of n states, with language L := L(D). Define ϵ as the empty string, of length 0. For two strings u, v, denote uv as their concatenation. For any x {0, 1} , denote f (x) = 1 x L. Denote Q as the set of n states for D and δ( , ) : Q {0, 1} Q as the state transition function for D, with start state q0. Let F Q be the set of accepting states. We say that states q, q Q are distinguished by string v {0, 1} if δ(q, v) F and δ(q , v) / F, or if δ(q, v) / F and δ(q , v) F.

In general, applying a string from the start state q0 in the DFA will cause the DFA to end up in some state q Q, and an identification can be made between the input string and the state it causes the DFA to end up in. Thus, to learn the DFA, we would like to find a mapping from {0, 1} to [n], which tells us which strings correspond to which states in the ground-truth DFA. To do this, we claim it is sufficient to consider the equivalence class given by the sets {E(u)}u {0,1} n, where:

u {0, 1} n, E(u) := {v {0, 1} n 2 : uv L}

If for u, u {0, 1} n, E(u) = E(u ), then we will claim that these two strings correspond to the same state. Otherwise, u, u correspond to different states. This is what is meant by an equivalence class over {E(u)}u {0,1} n.

Once a learning algorithm can make this identification between strings and the states they correspond to, then the learning algorithm can infer the transition function of the DFA by considering, for each u {0, 1} n 1, the states E(u), E(u0), and E(u1), via the sets {E(u)}u {0,1} n 1 and {E(u)}u {0,1} n. This gives the transition function. E(ϵ) corresponds to the start state. Strings u corresponding to accept states will be such that ϵ E(u). Note that we just need to consider

Non-Asymptotic Length Generalization

{E(u)}u {0,1} n instead of {E(u)}u {ϵ} {0,1} to characterize the states of D, since any u of length larger than n will cause the DFA to reach some state that is also reached by a string of length at most n.

We claim that proving the following property about our sets E(u) suffices to prove that there is a learning algorithm which identifies the ground-truth DFA. Lemma D.1. For any minimal DFA D with n 2 states and transition function δ, then with E(u) defined as above for each u {0, 1} n, we have u, u {0, 1} n, δ(q0, u) = δ(q0, u ) E(u) = E(u ).

First, we will show how to prove Proposition 2.7 with this Lemma.

Proof of Proposition 2.7. Suppose a DFA learning algorithm is given inputs of length 2n 2. For n 2, Lemma D.1 implies that the learning algorithm can identify the states of D by constructing the sets {E(u)}u {0,1} n, using its training data {(x, f (x)) : |x| 2n 2}. Each state is identified with an equivalence class of {E(u)}u {0,1} n 1, since each state in an n state DFA can be reached by a string of length at most n 1. The state transitions can be identified from {E(u)}u {0,1} n by considering, for each u {0, 1} n 1, the states E(u), E(u0), and E(u1). E(ϵ) corresponds to the start state. Strings u corresponding to accept states will be exactly those such that ϵ E(u). In the case that the learning algorithm only receives inputs of length at most 0 (i.e. it only receives (ϵ, f (ϵ))) or 1, it will simply output the DFA with language if the labels of all inputs received are 0 and it will output the DFA with language {0, 1} if the labels of all inputs received are 1. This handles the case where n = 1.

The learning algorithm above identifies the ground-truth DFA, requiring inputs of length at most 2n 2 when the groundtruth DFA has n states. By Theorem 3.2, the Minimum Complexity Interpolator Amci will also require inputs of length at most 2n 2 to identify the ground-truth DFA. This proves the Proposition.

Now we prove Lemma D.1.

Proof of Lemma D.1. Showing δ(q0, u) = δ(q0, u ) = E(u) = E(u ) is easy, since if δ(q0, u) = δ(q0, u ), then the states reached by u, u from q0 are the same, so no string of any length can distinguish them δ(q0, u), δ(q0, u ), so E(u) = E(u ).

We now show δ(q0, u) = δ(q0, u ) = E(u) = E(u ). We are given that the ground truth DFA has n states and is the minimal DFA for its language, so that there are no DFAs of fewer states with the same language. Given this, we claim that there are exactly n equivalence classes among {E(u)}u {0,1} n. Each equivalence class must correspond to a unique state, so each of n unique states can be identified with a unique E(u), finishing the proof of Lemma D.1.

We will now show that there are exactly n equivalence classes among {E(u)}u {0,1} n. For 0 i n 2, u {0, 1} n, define Ei(u) := {v {0, 1} i : uv L}. To do this, we claim that for any i < n 2, if the number of equivalence classes in {Ei(u)}u {0,1} n is less than n, then there must be some u, u {0, 1} n and vi+1 {0, 1}i+1 such that

Ei(u) = Ei(u )

and f (uvi+1) = f (u vi+1)

Suppose this were not the case. That is, suppose that for some i < n 2, (1) the number of equivalence classes in {Ei(u)}u {0,1} n is less than n. (2) Yet, u, u {0, 1} n, vi+1 {0, 1}i+1, Ei(u) = Ei(u ) = f (uvi+1) = f (u vi+1). We want to show that together, (1) and (2) contradict the minimality of the ground-truth DFA.

We claim that, by induction, (2) implies that for every k i, for every u, u {0, 1} n and vk+1 {0, 1}k+1, Ek(u) = Ek(u ) = f (uvk+1) = f (u vk+1). This is true for base case k = i by (2).

For the inductive step, suppose the claim is true for k > i. Suppose u, u {0, 1} n are such that Ek(u) = Ek(u ). Then vk {0, 1} k, f (uvk) = f (u vk). For any such string vk with length k, we can break vk into a prefix of k i bits and a suffix of i bits. Thus,

vk i {0, 1}k i, vi {0, 1}i, f (uvk ivi) = f (u vk ivi)

Each string uvk i and u vk i will correspond to some state reachable by some strings u , u {0, 1} n, in the sense

Non-Asymptotic Length Generalization

that δ(q0, uvk i) = δ(q0, u ) and δ(q0, u vk i) = δ(q0, u ), as each state in the DFA can be reached by a string of length at most n 1. Thus, Ei(u ) = Ei(u ), so that by (2), vi+1 {0, 1}i+1, f (u vi+1) = f (u vi+1). Since δ(q0, uvk i) = δ(q0, u ) and δ(q0, u vk i) = δ(q0, u ) imply that for any such vi+1, f (uvk ivi+1) = f (u vi+1) and f (u vk ivi+1) = f (u vi+1), then vi+1 {0, 1}i+1, f (uvk ivi+1) = f (u vk ivi+1). In short,

vk i {0, 1}k i, vi+1 {0, 1}i+1, f (uvk ivi+1) = f (u vk ivi+1)

Thus, vk+1 {0, 1}k+1, f (uvk+1) = f (u vk+1), completing the induction. We have proved the claim for all k i.

However, this will lead to a contradiction if we take k , since applying the guarantee for all k i, we have that

u, u {0, 1} n, Ei(u) = Ei(u ) = E (u) = E (u )

Where E (u) := {v {ϵ} {0, 1} : uv L}. We know that any two distinct states in a minimal DFA will be distinguished by some finite string (See Theorem 4.24 of (Hopcroft & Ullman, 1979)). Thus, for any u, u {0, 1} n,

Ei(u) = Ei(u )

= E (u) = E (u )

= δ(q0, u) = δ(q0, u )

So n = #(Equivalence classes in {E (u)}u {0,1} n) = #(Equivalence classes in {Ei(u)}u {0,1} n) < n, where the last inequality is due to (1). This is a contradiction.

Thus, while the number of equivalence classes in {Ei(u)}u {0,1} n is less than n, there must be some u, u {0, 1} n

and vi+1 {0, 1}i+1 such that

Ei(u) = Ei(u )

and f (uvi+1) = f (u vi+1)

This implies that the number of equivalence classes in {Ei+1(u)}u {0,1} n is at least 1 greater than the number of equivalence classes in {Ei(u)}u {0,1} n while the latter is less than n.

Finally, {E0(u)}u {0,1} n has 2 equivalence classes for any non-trivial DFA that whose language isn t or {0, 1} , and any n 2 state minimal DFA must be non-trivial, as the minimal trivial DFAs have at most one state. The two equivalence classes of {E0(u)}u {0,1} n are given by the states that are accepting and those that are not. For all i < n 2, the number of equivalence classes in {Ei+1(u)}u {0,1} n grows by at least 1 from that of {Ei(u)}u {0,1} n. We can have no more equivalence classes than n, and the i where {Ei(u)}u {0,1} n has n equivalence classes will be at most n 2. This implies Lemma D.1 is true if we let E(u) = En 2(u), u {0, 1} n.

Now, we will prove an impossibility result for linear CFGs. Proposition 3.5 ((Linear) CFGs only admit length generalization in the limit). Recall RL-CFG is the encoding system for Linear CFGs defined in Definition 2.3 and CCFG( G ) is the complexity measure that maps CFG G = (N, T, P, S = {0, 1}) to |N| + |T| + |P|. Then for any learning algorithm A, the length complexity, N RL-CFG A , is not computably bounded. That is, Linear CFGs do not admit non-asymptotic length generalization (w.r.t. standard CFG encoding system RCFG), and neither does the set of all CFGs.

Proof of Proposition 3.5. This follows directly from Lemma 3.4 and Theorem 3.2, with R being an encoding system which maps string encodings of linear CFGs to the corresponding language, F = FR being the languages recognized by linear CFGs, and C being the complexity measure described in Definition 2.3. Because RL-CFG satisfies Assumption 2.4, Lemma 3.4 and Theorem 3.2 imply that encoding system FR admits non-asymptotic length generalization w.r.t. R, C iff the language equivalence problem for R is decidable. However, by (Baker & Book, 1974), the latter is undecidable.

Non-Asymptotic Length Generalization

E. Proofs for Length Generalization of C-RASP

E.1. Proof of Theorem 5.5

Theorem 5.5 (C-RASP1 Length Generalization). Let F = C-RASP1 and C(f) = max(|a|, |b|, |d|), defined in Definition 5.3. Then T N, we have NAmci(T) O(T 2). That is, the Minimum-Complexity Interpolator, with complexity C and function class F, can length generalize given inputs of length O(T 2) when the ground-truth has complexity T.

Proof. With integer T and integers a, b, d [ T, T], a > 0, recall C-RASP1,T is the set of functions of the form:

fa,b,d(x) = 1[a ps(x) b n d > 0]

By Theorem 3.2, it is sufficient to show that for any tuples of integers (a, b, d), (a , b , d ) [ T, T]3 with a, a > 0, if there exists an x {0, 1} with fa,b,d(x) = fa ,b ,d (x), then there exists an x {0, 1} with |x | O(T 2) such that fa,b,d(x ) = fa ,b ,d (x ).

Suppose there is a string x that distinguishes fa,b,d and fa ,b ,d . We can rewrite f and f as follows.

f(x) := fa,b,d(x) = 1[ps(x) b

f (x) := fa ,b ,d (x) = 1[ps(x) b

Since f, f differ on x, either b

There are 3 categories for the value of a slope b

Type (i): b

a 1 (resp. b

Type (ii): b

a 0 (resp. b

Type (iii): b

a (0, 1) (resp. b

Below we will consider how to distinguish two functions f, f whose slopes b

a are in each of the categories.

b a is type (iii); b

a is either type (iii), (ii), or (i)

For any two 2D lines y = b

a and y = b

a , there is a lattice point of x coordinate x := |aa |(2 max(| d

a |)+ 1) that lies strictly above the line with smaller slope and un-strictly below the line with greater slope. This follows from the fact that when b

a, then the smallest that | b

a| can be is 1 |aa |. Thus, | b

a| |aa |(2 max(| d

a |) + 1 so that | b

a )| 1. Because the vertical gap between the two lines at horizontal coordinate x is at least 1, there is a lattice point (x , y ) Z2 with y (min( b

a ), max( b

ax + d a, b

a )]. In fact, for any x x , one can find such a lattice point with horizontal coordinate x, since the gap between the two lines will continue to grow as x increases for x x , and so the gap will always be at least 1.

We want to pick such a lattice point ( x, y) subject to four constraints:

1. y max( b

2. y > min( b

a ) 3. y x 4. y 0

Non-Asymptotic Length Generalization

a (0, 1), then it suffices to find such a lattice point either between y = b

a and y = x, between y = b

a and y = 0, or between y = b

a and y = b

a . For each of the 3 cases, any lattice point between the two lines specified in that case will satisfy all four constraints.

By the argument in the previous paragraph, for each of the three cases, we can construct such a lattice point with horizontal coordinate x , which will be at most the following, across the three cases.

x max h |aa |(2 max(|d

a |) + 1), |a 1|(2 max(|d

a|, 0) + 1), |a 1|(2 max(|d

a|, 0) + 1) i

2 max(|da , |ad |) + aa

Denote the vertical coordinate of this lattice point as 0 y x. Finally, any string x {0, 1} with |x | = x, consisting of y 0 ones and x y 0 zeros will distinguish f and f . This string will have length at most 3T 2.

a are type (ii)

a , say WLOG that b

a < 0, then after x coordinate at most T + 1, then lines {y = b

a } go below y = 0 and no lattice point of nonnegative y coordinate can be below one but above the other (i.e. both f, f become the all-ones function on strings of length at least T + 1). Thus, any distinguisher of f, f (in particular, the x presumed in the beginning of the proof of this Theorem) must have length at most T, and we take x = x.

a = 0, then in the case d

a < 0, the same argument implies that |x| T.

In the case that d

a 0, then a lattice point which lies between the two lines is (T + 1, 0). Thus, the string x = 0T +1 distinguishes f, f and has length T + 1.

b a is type (ii) and b

a is type (i)

The analysis of this case is similar to the Types (iii) versus (iii), (ii), (i) case. One can take the line of slope 1 and modify just its slope to be between (0, 1), apply the analysis in the Types (iii) versus (iii), (ii), (i) case to attain a lattice point which lies below the original line of slope 1 and above the line of slope 0. Such a lattice point corresponds to a string of length at most 2T which distinguishes f, f .

a are type (i)

Analogous to the argument about Type (ii) versus (ii). There will be a string of length at most T +1 which distinguishes f, f .

a If the slopes are at least 1 or at most 0, an analogous argument as Case 1, Type (ii) versus (ii) or (i) versus (i) will suffice, where it must be the case that the string x which was presumed to distinguish f, f has length at most T + 1.

Otherwise, if both slopes are in (0, 1), then we can take a lattice point (x , y ) which lies on the one of larger intercept. Such lattice points occur periodically with spacing a T, and we need only take one such lattice point where x y 0. We can find one such lattice point where x 3T.

Notation and Conventions. For the remaining sections, we will use lower-case x to denote bit-strings in {0, 1} .

We define [k] := {1, 2, . . . , k 1, k} as the first k positive integers.

We will use symbol f to denote functions in the relevant hypothesis class, like C-RASP or DFAs. These will be mappings from {0, 1} to {0, 1}. We will sometimes refer to functions in C-RASP1 and C-RASP2 as programs, which is synonymous with functions. We will say that two functions are equal if they agree on all strings {0, 1} , and that they are unequal if they are distinguished by at least one string x {0, 1} . We will call such an x a distinguisher of the two functions. To be clear, the word function here has a distinct meaning and type from test-functions.

Non-Asymptotic Length Generalization

We will use X to denote discrete test-functions and Y to denote continuous test-functions. When we say test-function without specifying whether it is discrete or continuous, assume we mean a continuous test-function. We will use (Bi(Y))i [k] (or (Bi)i [k] when clear from context) to represent the activations induced by a continuous test-function (see Definition 6.6). We will use symbol Y to denote a schema of continuous test-functions (see Definition E.5). When we talk about a 2D coordinate system, in the context of test-functions, we will use symbol x to denote the horizontal axis of this 2D coordinate system and y to denote the vertical axis. To be clear, x is distinct from the symbol x, which we use to denote bit-strings.

Regarding geometric objects, denote cl(S) as the closure of a set S. Denote Sc as the complement of set S. For a convex set S, the affine hull of S is the set of linear combinations of points in S. Denote dim(S) as the dimension of the affine hull of S. Denote the interior of a convex set S as int(S) and the relative interior of convex set S as ri(S), which is the set of points in S such that there is some non-zero radius such that the Euclidean ball centered at that point with that radius, intersected with the affine hull of S, is contained in S. Finally, sometimes we will talk about a geometrical set in Rk and its analog in RM, where M > k (what analog means, we won t go into here). Notationally, if we use symbol A Rk for the set in Rk, we will use symbol A(M) to denote a set that s analogous to A, but in RM.

A halfspace of Rd is a subset of Rd which satisfy a linear inequality over the d coordinates of Rd. A polytope is the intersection of a finite number of halfspaces. The polytopes we will be working with will be restricted to [0, 1]d for some dimension d. The faces of a polytope P [0, 1]d refer to the d 1 dimensional polytopes which form the boundary P. Each face is associated with a linear inequality (halfspace) which defines the face in the sense that points on the face satisfy the linear inequality with equality. A d-dimensional simplex is a d-dimensional polytope with d + 1 vertices and d + 1 faces.

E.2. Proof of Theorem 6.1.

We reiterate the definition of C-RASP2 here.

Definition 5.4 (C-RASP2). With integers T and 1 K T 2, let C-RASP2,K,T be the set of programs of the following form. Each program f has parameters 0 < z T, i [K], a(i), b(i), λi { T, . . . , T}, with a(i) > 0. We require that

for all i [K], b(i)

a(i) (0, 1) and is distinct from b(i )

a(i ) for i = i. We also require P

i [K] λi > z.

For any n > 0, on input x {0, 1}n, the first layer computes the values of K heads, {h(i)}i [K], on the n prefixes of x:

{{x1}, {x1, x2}, . . . , {x1, . . . , xn}} as follows: j [n], i [K], h(i) j = 1[ps(x)j > b(i)

a(i) j]. Subscript j indicates the value of a quantity on the jth prefix of x.

The second layer computes the output, which is the nth bit of the final sequence: f(x) = 1[P

i [K] λips(h(i))n > z n].

Then, C-RASP2 = S

1 T,1 K T 2 C-RASP2,K,T . Given a function f C-RASP2, let K(f) be the number of heads, h(i), in the first layer of f and let T(f) := max(maxi [K(f)] |a(i)|, maxi [K(f)] |b(i)|, maxi [K(f)] |λi|, |z|). The complexity measure is C(f) = T(f)K(f), the precision of the function s parameters to the power of the number of heads.

We will prove the following length generalization guarantee.

Theorem 6.1. Let F = C-RASP2 and C(f) = T(f)K(f), as in Definition 5.4. Then α N, NAmci(α) O(αO(1)).

Note that Theorem 6.1 is actually stronger than a result which says that we can learn C-RASP2,K,T with length O(T O(K)). This is because for a fixed T and K, C-RASP2,K,T only contains functions of precision at most T and at most K heads, whereas Theorem 6.1 also provides length generalization guarantees for C-RASP2 functions f with T(f) > T, K(f) < K or T(f) < T, K(f) > K such that T(f)K(f) T K.

Proof of Theorem 6.1. By Theorem 3.2, to upper bound NAmci(α), it suffices to bound:

min{n N : f = f C-RASP2 with C(f), C(f ) α, exists x {0, 1} n s.t. f(x) = f (x)}

Suppose f, f C-RASP2 are not equal and differ on x0 {0, 1} . For any n > 0, suppose f and f have the following 2-layer form on an arbitrary input x {0, 1}n.

Non-Asymptotic Length Generalization

j [n], i [K], h(i) j (x) = 1[ps(x)j > b(i)

f(x) = 1[ X

i [K] λips(h(i)(x))n > z n]

j [n], i [K ], (h(i)) j(x) = 1[ps(x)j > (b(i))

f (x) = 1[ X

i [K ] λ ips((h(i)) (x))n > z n]

Where f has K first-layer heads and integer parameters of precision T and f has K first-layer heads and integer parameters of precision T , where C(f) = T K α and C(f ) = (T )K α, and K T 2 and K (T )2. WLOG, suppose that T T . We will find a short string x {0, 1}O(αO(1)) such that f(x ) = f (x ).

Suppose the set of unique slopes R := { b(i)

a(i) }i [K] { (b(i))

(a(i)) }i [K ] (0, 1) between the first layer of f and f has size k, where max(K, K ) k K + K . We will denote R = {sj}j [k] (0, 1), where s1 is the largest slope and sk is the smallest slope, and the slopes are sorted in descending order so that s1 > . . . > sk. Let ord(1, i) : [K] [k] be the index within R of the ith slope of f, b(i)

a(i) . Let ord(2, i) : [K ] [k] be the index within R of the ith slope of f , (b(i))

(a(i)) . In the following exposition, we will refer to line i as the homogeneous, 2D line y = six, with slope si, i [k].

Definition 5.4 requires that z, z > 0, and that P

i [K] λi > z and P

i [K ] λ i > z . Since f differs from f on discrete test-function given by x0, then the non-trivial Lemma E.37 implies that there exists a continuous test-function Y1 that induces activations (B1(Y1), . . . , Bk(Y1)) which satisfies either Case I or Case II.

i [K] λi Bord(1,i)(Y1) > z and X

i [K ] λ i Bord(2,i)(Y1) < z

i [K] λi Bord(1,i)(Y1) < z and X

i [K ] λ i Bord(2,i)(Y1) > z

For the remainder of the proof, we will suppose Case I is true. The proof for Case II is entirely analogous to what we will present below, since we will not use the direction of the signs of the two halfspaces in the following proof. Denote the two halfspaces induced by the second-layer of f and f by H1 and H2 respectively.

H1 := {B Rk : X

i [K] λi Bord(1,i) > z}

H2 := {B Rk : X

i [K ] λ i Bord(2,i) < z }

By Corollary E.14 (Completeness of Basis Schema), there exists a continuous test-function Y2 of a basis test-function schema, Y , specified in Corollary E.14, such that (B1(Y2), . . . , Bk(Y2)) = (B1(Y1), . . . , Bk(Y1)). Thus,

(B1(Y2), . . . , Bk(Y2)) H1 H2

Non-Asymptotic Length Generalization

Let M be the number of segments in the basis schema Y of Y2. Note that our previous application of Lemma E.37 guarantees that Y is a basis schema of either one or two monotone curves (see Definition E.9), ensuring that

as opposed to the naive bound of M k2 via Corollary E.15. This fact will be useful at the end for achieving a better final bound.

Denote the lengths of the M segments of schema Y as (n1, . . . , n M) [0, 1]M. Let A(M)(Y ) be the set of valid segment lengths (n1, . . . , n M) for a continuous test-functions of schema Y , where a particular setting of (n1, . . . , n M) is valid if it obeys the constraints described by Lemma E.6 for schema Y and P

i [M] ni = 1.

A(M)(Y ) := {(n1, . . . , n M) : valid segment lengths of schema Y and X

i [M] ni = 1} [0, 1]M

Additionally, define A(Y ) as the analogous set to A(M)(Y ) in the space of activations rather than the space of segment lengths.

A(Y ) := {(B1(Y), . . . , Bk(Y)) : Y valid test-function of schema Y } [0, 1]k

There exists a linear map L : RM Rk which maps points in A(M)(Y ) to points in A(Y ). L {0, 1}k M is such that Lij = 1 segment j in schema Y lies above line i (that is, for every x in the domain of segment j, the y-value of the segment at x is at least si x) and hence contributes to the ith activation Bi(Y) of any test-function Y of schema Y . Using L, we can rewrite the inequalities which characterize H1, H2 in terms of (n1, . . . , n M).

H(M) 1 := {(n1, . . . , n M) :

i=1 λi Bord(1,i) > z} = {(n1, . . . , n M) :

j s.t. Lord(1,i),j=1 nj > z}

H(M) 2 := {(n1, . . . , n M) :

i=1 λ i Bord(2,i) < z } = {(n1, . . . , n M) :

j s.t. Lord(2,i),j=1 nj < z }

Since (λi)i [K], z are integers at most T in magnitude and (λ i)i [K ], z are integers at most T in magnitude, the coefficients

of the linear inequality for H(M) 1 (resp. H(M) 2 ), PK i=1 λi P

j s.t. Lord(1,i),j=1 nj > z (resp. PK

j s.t. Lord(2,i),j=1 nj < z ) are integers of at most KT T 3 (resp. K T (T )3) in magnitude, since for each j [M], at most every i [K] (resp. i [K ]) can contribute to the jth coefficient.

Now, we describe a few more properties of A(M)(Y ). Suppose the segment lengths of Y2 in schema Y is (n1(Y2), . . . , n M(Y2)) [0, 1]M. Then, we have:

(n1(Y2), . . . , n M(Y2)) A(M)(Y ) H(M) 1 H(M) 2 =

By Lemma E.16 A(M)(Y ) is a polytope: the intersection of a finite number of halfspaces. It follows that A(M)(Y ) is convex. Moreover, by Lemma E.16, we have dim(A(M)(Y )) = M 1. Because A(M)(Y ) H(M) 1 H(M) 2 = and H(M) 1 , H(M) 2 are open sets of dimension M, then by Lemma E.27, dim(A(M)(Y ) H(M) 1 H(M) 2 ) = dim(A(M)(Y )) = M 1.

We have that cl(A(M)(Y ) H(M) 1 H(M) 2 ) is the intersection of a finite number of halfspaces. Each face of the polytope cl(A(M)(Y ) H(M) 1 H(M) 2 ) is defined by one of these halfspaces. We now discuss the precision of the linear inequalities

Non-Asymptotic Length Generalization

which define the faces of cl(A(M)(Y ) H(M) 1 H(M) 2 ), where precision of an linear inequality with integer coefficients is the maximum magnitude of the integer coefficients, per Definition 5.2. The linear inequalities of the halfspaces which form the faces of A(M)(Y ) are such that there is a subset of at most 6K of them with precision at most T 2, while the remaining faces of A(M)(Y ) have precision at most (T )2. This is due to the following argument. First, because the (k, T)-configuration {si}i [k] is such that there is a subset of at most K of the k elements of {si}i [k] which are precision at most T, while the rest of {si}i [k] are precision at most T . Next, referring to Lemma E.6, the only faces of A(M)(Y ) with T 2 precision are ones that correspond to a segment of Y whose start-point or end-point is on one of the K lines of slope whose precision is T 2. Third, by Corollary E.14, any basis schema of one or two monotone curves will cross each of the k lines at most three times. Since each of the (at most) K slopes of precision T correspond to at most 3 crossing-points of Y with that the line of that slope, and each crossing point is adjacent to at most two segments of Y , then there are at most 2 3K segments of Y such that the start-point or end-point is on a line of slope that is precision T. Thus, at most 6K of the faces of A(M)(Y ) are defined by linear inequalities of precision at most T 2, while the remaining faces of A(M)(Y ) are defined by linear inequalities of precision at most (T )2. Note that the square (i.e. in T 2 and (T )2) comes from the form of the inequalities defining the faces of A(M)(Y ), stated in Lemma E.6. Finally, as argued in an earlier paragraph, the face given by H(M) 1 has precision at most T 3 while that given by H(M) 2 has precision at most (T )3. In summary, there is a subset of at most 7K faces of cl(A(M)(Y ) H(M) 1 H(M) 2 ) such that each face of the subset has precision at most T 3 while the faces not in the subset all have precision at most (T )3. In short, we ve shown that polytope cl(A(M)(Y ) H(M) 1 H(M) 2 ) satisfies the pre-conditions of Corollary E.21, which we will apply in the next step.

Now, we return to the process of converting Y2 into a short distinguisher of f, f . Let V denote the set of vertices of the polytope cl(A(M)(Y ) H(M) 1 H(M) 2 ). Let c [0, 1]M be the average of the vertices in V .

Label the coordinates of c as c := (n(c) 1 , . . . , n(c) M ). c is in the relative interior of cl(A(M)(Y ) H(M) 1 H(M) 2 ) (which is non-empty by Lemma E.25) so that c A(M)(Y ) H(M) 1 H(M) 2 . In particular,

j s.t. Lord(1,i),j=1 n(c) j > z

j s.t. Lord(2,i),j=1 n(c) j < z

We will now lower bound the margin (see Definition 6.10) of c to the faces of cl(A(M)(Y ) H(M) 1 H(M) 2 ), which includes the faces given by H(M) 1 , H(M) 2 . Suppose cl(A(M)(Y ) H(M) 1 H(M) 2 ) has N faces, and let {Li}i [N] denote the linear

inequalities which define each face of cl(A(M)(Y ) H(M) 1 H(M) 2 ), so that Li(c) R is the non-negative margin of c on the ith face. Noting that M 2k 2(K + K ), we apply Corollary E.21 on polytope cl(A(M)(Y ) H(M) 1 H(M) 2 ) to get the following lower bound on the margin of c on the faces of cl(A(M)(Y ) H(M) 1 H(M) 2 ).

i [N], Li(c) Ω( 1

|V | 1 αO(1) )

Then by Lemma E.22, we have |V | 3M 2, so we deduce that:

i [N], Li(c) Ω( 1

M 2 1 αO(1) )

Non-Asymptotic Length Generalization

In particular, the margins, γ1, γ2 of c on the two inequalities defining H(M) 1 and H(M) 2 will be at least:

j s.t. Lord(1,i),j=1 n(c) j z

M 2 1 αO(1) )

j s.t. Lord(2,i),j=1 n(c) j

M 2 1 αO(1) )

Now that we have shown that c has large margin, we need to augment it one final time before converting it into a discrete test-function. This process will find a nearby point c Ball r (c) {P

i [M] ni = 1} := {(n1, . . . , n M) RM :

||(n1, . . . , n M) c|| r} {P

i [M] ni = 1} such that c has both large margin to the faces of cl(A(M)(Y ) H(M) 1

H(M) 2 ) and low-precision coordinates. With ||Li||1 denoting the sum of the magnitudes of the coefficients of the linear inequality which defines the ith face of cl(A(M)(Y ) H(M) 1 H(M) 2 ), let:

γLB := 1 O(M 2αO(1)) γ1, γ2

r := γLB 2 maxi [N] ||Li||1

By linearity of the margin of a point in its coordinates (n1, . . . , n M), each point c Ball r (c) {P

i [M] ni = 1} will

have margin at least γLB r maxi [N] ||Li||1 γLB

2 to each face of cl(A(M)(Y ) H(M) 1 H(M) 2 ). This also means

that every such c is contained in A(M)(Y ) H(M) 1 H(M) 2 as γLB > 0.

By Lemma E.29, there exists a low-precision point c Ball r (c) {P

i [M] ni = 1}, denoted c := (n(c ) 1 , . . . , n(c ) M ),

such that for all i [M] n(c ) i is a rational number, and the least common denominator of all elements in the tuple (n(c ) 1 , . . . , n(c ) M ) is pc , where:

O(M max i [N] ||Li||1 3M 2αO(1) )

O(M 3 (M T 3) αO(1))

Let the tuple c = (n(c ) 1 , . . . , n(c ) M ) be the segment lengths of the continuous test-function Y of schema Y . Note that c has positive margin γLB

2 to all the faces of cl(A(M)(Y ) H(M) 1 H(M) 2 ) and is contained in A(M)(Y ) H(M) 1 H(M) 2 , so c is a valid setting of segment lengths of schema Y , respecting the constraints of Lemma E.6.

Finally, applying Lemma E.24 on the continuous test-function of schema Y with segment lengths c = (n(c ) j )j [M], we deduce that there exists a n0 O(pc αO(1)) so that for any integer multiple n of n0, there exists a discrete test-function X : {0, . . . , n} {0, . . . , n}, of length n, corresponding to string x {0, 1}n of length n, such that

i [k], |Bi(Y ) Bi(X )| T 2 + M

Non-Asymptotic Length Generalization

Where (Bi(Y ))i [k] are the activations induced by Y , a test-function of schema Y with segment lengths (n(c ) i )i [M], and (Bi(X ))i [k] the activations induced by X . Because i, |λi| T, |λ i| T T, then the difference between the margin of (Bi(Y ))i [k] and (Bi(X ))i [k] on H1 and H2 can be bounded by a term proportional to 1

i=1 λi Bord(1,i)(Y )

i=1 λi Bord(1,i)(X )| (2)

(max i [k] |Bi(Y ) Bi(X )|) ( max i [k],|λi| T

i=1 λi) (3)

(T 2 + M)KT

Using an analogous argument, the difference between PK

i=1 λ i Bord(2,i)(Y ) and PK

i=1 λ i Bord(2,i)(X ) is also bounded by an analogous expression.

i=1 λ i Bord(2,i)(Y )

i=1 λ i Bord(2,i)(X )| O(K MT 3

Together, these imply that for sufficiently large n, the difference in the margin of (Bi(X ))i [k] and (Bi(Y ))i [k] on H1 and H2, caused by the discrete test-function approximation, will be smaller than the margin of (Bi(Y ))i [k] on H1 and H2.

The latter equals the margin of c on H(M) 1 and H(M) 2 , by the definition of H(M) 1 and H(M) 2 as the analogous halfspaces to H1 and H2, which is lower bounded by γLB

2 . More precisely,

i=1 λi Bi(X ) z + γLB

i=1 λ i Bi(X ) z γLB

2 + O(K MT 3

To this end, since γLB = Ω( 1 M 2αO(1) ), it suffices for n to be the following value in order for activations (Bi(X ))i [k] induced by X to be in H1 H2 (and therefore to cause f and f to differ, as Y1 and Y do).

Ω( 1 M 2αO(1) ) O(max(K, K )MT 3

n ) > 0 n > O(M 3 max(K, K )T 3αO(1)) (6)

Let x be the string of length n corresponding to X , which is uniquely determined by X . For n = max(1 + O(M 3 max(K, K )T 3αO(1)), O(pc αO(1))), n will be sufficiently large to make the approximation error smaller than the margin of c per Equation 6, and also satisfy the condition required to apply Lemma E.24 in the previous part of this proof. Thus, with this value of n, x will cause f(x ) = 1, f (x ) = 0.

We noted previously in Equation (1) that M 2k 2(K + K ) for the basis schema Y as a result of Lemma E.37. Plugging this in for M, and noting that α max(T K, (T )K ), we conclude that the length of such an x distinguishing f, f need only be at most

Non-Asymptotic Length Generalization

n = max(1 + O(M 3 max(K, K )T 3αO(1)), O(pc αO(1)))

max(1 + O(M 3 max(K, K )T 3αO(1)), O(M 3 (MT 3) αO(1) αO(1))))

E.3. Lemmas for Completeness of Basis Schema

Goal. In the main proof, we will fix two arbitrary unequal f, f C-RASP2 and prove they have a short distinguisher. The goal of this section is to prove that the set of realizable activations A({si}i [k]) equals the union of the set of activations of a small number of basis test-function schema. This culminates in Corollary E.14.

Basic Definitions. Recall the definition of precision.

Definition 5.2 (p-precision). An integer of absolute value at most p is of p-precision. A rational number between [0, 1] is of p-precision if in simplest form, where the numerator and denominator are relatively prime, its denominator is at most p in magnitude. A tuple of rational numbers in [0, 1] is precision p if the least common denominator of its entries is at most p in magnitude.

Note we will say a halfspace defined via a linear inequality with integer coefficients has p-precision if each coefficient is at most p in magnitude.

Suppose f and f have K and K heads, respectively, and consider the set of max(K, K ) k K + K distinct slopes

from the parameters of the first layer of f and f : { b(i)

a(i) }i [K] { (b(i))

(a(i)) }i [K ]. Disregard for now which slope belongs to f or to f , and denote these k slopes as {si}i [k] (0, 1), sorted descending so that s1 is largest and sk is smallest. We will refer to {si}i [k] (0, 1) as a (k, T)-configuration.

Definition 6.4. A (k, T)-configuration is a set of k distinct T-precision rational numbers {si}i [k] (0, 1).

These k slopes {si}i [k] specify k homogeneous, 2D lines, of the form y = si x. Denote these k lines as l1, . . . , lk with line li having slope si for all i [k].

Recall the definition of Discrete Test-Functions.

Definition 6.5 (Discrete Test-Function). Given a (k, T)-configuration {si}i [k], a discrete test-function X, with respect to {si}i [k] and of length n < , is a function {0, 1, . . . , n} {0, 1, . . . , n} where X(0) = 0 and j [n], X(j) = X(j 1) or X(j) = X(j 1) + 1. The induced activations (B1(X), . . . , Bk(X)) of X with respect to the (k, T)-configuration are defined as: i [k], Bi(X) := 1

n Pn j=1 1[X(j) > si j]

For a string x {0, 1} , the discrete test-function induced by x is the set of 2D points {(j, ps(x)j)}j [|x|] where we will associate the y-axis for ps(x)j and the x-axis for j.

Recall two central objects: continuous test-functions and A({si}i [k]).

Definition 6.6 (Continuous Test-Function). Given a (k, T)-configuration {si}i [k], a continuous test-function Y, with respect to {si}i [k], is a 1-Lipschitz, monotone non-decreasing continuous function [0, 1] [0, 1], with Y(0) = 0. Continuous test-functions can only intersect the k lines {li}i [k] of slopes given by {si}i [k] at finitely many points. The induced activations (B1(Y), . . . , Bk(Y)) of Y w.r.t. {si}i [k] are: i [k], Bi(Y) := R 1 0 1[Y(j) > si j]dj.

Definition E.1. (A({si}i [k])) Given (k, T)-configuration {si}i [k], define A({si}i [k]) as the set of activations induced by continuous test-functions with respect to {si}i [k].

A({si}i [k]) := {(B1(Y), . . . , Bk(Y)) : Y continuous test-function w.r.t. {si}i [k]}

Regarding properties of continuous test-functions, first note that the scaling of Y can be set WLOG because of the homogeneity of the k lines. Thus, we let their domain be [0, 1].

Non-Asymptotic Length Generalization

Second, for any continuous test-function Y, we can let the end point of Y be on one of the lines {li}i [k]. Suppose the last line crossed by Y is li. Then we can adjust the segment of Y between its last crossing point at li and its endpoint so that the endpoint is also on line li. We can make this tweak so that no other lines {li}i [k] are crossed and so that (B1(Y), . . . , Bk(Y)) remains unchanged by this tweak. In short, the endpoint of any continuous test-function Y is, WLOG, (1, si) where li is the last line crossed by Y.

Lemma E.2. For any configuration {si}i [k], for any continuous test-function Y w.r.t. {si}i [k], suppose the last line {li}i [k] crossed by Y is li, for i [k]. Then, WLOG, we can let Y s end point be (1, si) without changing the activations induced by Y.

Proof. Suppose that the last line Y crosses is li at the point (x, si x). Then, the portion of Y on the interval [x, 1] is wedged between either the two lines li and li+1 or the two lines li and li 1, since Y will not cross any other line on the interval. The quantity of interest are the activations with respect to the k lines induced by Y:

i [k], Bi(Y) := Z 1

0 1[Y(j) > si j]dj

Suppose that j (x, 1], sij > Y(j) > si+1j. The k quantities {Bi(Y)}i [k] will be unchanged if we adjust the values of Y(j) for j (x, 1] so long as we retain that j (x, 1], sij > Y(j) > si+1j except on a set of measure 0. With Y allowed to be any continuous function with slopes in [0, 1] and with si (0, 1), we can adjust Y(j) to stay between lines li and li+1 but closely follow the line li in the sense that |Y(j) sij| > 0 can be made arbitrarily small at all points j (x, 1), and Y(1) = si (note, this end-point (1, si) violates the condition that sij > Y(j) > si+1j but only at a single point). This ensures that the modified test-function, call it Y , is such that i [k], Bi(Y) = Bi(Y ).

Suppose that j (x, 1], si 1j > Y(j) > sij. Then an analogous adjustment to Y(j) on the interval j (x, 1] can be made so that the endpoint is (1, si).

From now on, assume that each test-function will have starting-point at the origin (0, 0) and have end point on some line li {l1, . . . , lk}, at point (1, si). This will make the following definitions about segments and schema cleaner.

Define the span of a continuous test-function as the set of lines in {li}i [k] which the continuous test-function intersects at some point.

Definition E.3. The span of a test-function is the set of lines {l1, . . . , lk} that the test-function crosses at least once. That is, Y crosses li if there exists x where Y(x) = si x. Note that the span must be a contiguous subset of [k].

We ll also give names to the regions of the positive quadrant of the 2D plane between consecutive lines in {li}i [k].

Definition E.4. (Sectors) Given a (k, T)-configuration {si}i [k], a sector is a region of the 2D space between two consecutive lines. Define Sector1 as the sector strictly above line l1, and Sectork+1 as the sector below line lk. For i {2, . . . , k} define Sectori as the sector below line li 1 and strictly above line li.

Sectors are depicted in Figure 7. We ll now define segments and schema.

Definition E.5. (Segments and Schema) Given any (k, T)-configuration {si}i [k] (0, 1), define the following.

1. A segment is a restricted test-function S : [a, b] [0, 1] where [a, b] [0, 1] which maps a continuous subset [a, b] to [0, 1]. S is 1-Lipschitz and monotone non-decreasing. The segment s start-point (a, S(a)) and the end-point (b, S(b)) each lie on one of the k lines, in the sense that there exists some i, j [k] where S(a) = si a and S(b) = sj b, where i = j or |i j| = 1. No other points (x, S(x)), x (a, b) can lie on a line l1, . . . , lk.

2. A schema Y is a blueprint for a continuous test-function, specifying a sequence of lines {li}i [k] that any test-function of the schema must cross. It consists of an integer 0 < M < and two tuples {idx(i)}i [M] [k]M, {seci}i [M] [k + 1]M, where |idx(i) idx(i + 1)| 1 for all i [M 1]. If |idx(i) idx(i + 1)| = 1, then seci+1 is unique and must be max(idx(i), idx(i + 1)). If idx(i) = idx(i + 1), then seci+1 can be either idx(i + 1) or idx(i + 1) + 1. sec1 can be either idx(1) or idx(1) + 1.

Any continuous test-function of schema Y = ({idx(i)}i [M], {seci}i [M]) consists of exactly M segments S1, S2, . . . , SM whose domains are a partition of [0, 1]. For each i [M], the ith segment Si s end-point lies

Non-Asymptotic Length Generalization

Figure 2. Depiction of a Test-Function consisting of 4 segments. y-axis shows the prefix sum of input string x. x-axis shows the length of the prefix of input string x. The five lines have slopes {s1, . . . , s5}. Sector1 is the portion of the quadrant which is above l1. Sector6 is the portion of the quadrant below l5. For 2 i 5, Sectori is the portion of the quadrant below li 1 and above li.

on lidx(i). For i > 1, Si s start-point lies on line lidx(i 1), and S1 s start-point is the origin, (0, 0). In addition, the ith segment must be contained in Sectorseci.

Notationally, we denote ni [0, 1] as the length of the ith segment Si, but note that different test-functions of the same schema may have different segment lengths {ni}i [M], subject to some constraints we detail below. For a schema Y , denote

A(Y ) := {(B1(Y), . . . , Bk(Y)) : Y continuous test-function of schema Y w.r.t. {si}i [k]}

Figure 2 shows a test-function of a schema with 4 segments. Figure 7 shows four generic types of segments.

We will think of a schema as a list of M segments, for some M > 0. We will denote the length of the ith segment as ni. Let idx( ) be the mapping from [M] to [k] that gives the index of the line that segment i s endpoint is on.

For continuous test-functions, the first segment of length n1 has start-point at the origin, (0, 0). The last segment of length n M has end-point which lies on the line lidx(M). The ith segment has start-point (Pi 1 j=1 nj, sidx(i 1) Pi 1 j=1 nj) on line

lidx(i 1) (as long as i 2) and end-point (Pi j=1 nj, sidx(i) Pi j=1 nj) on line lidx(i) and has length ni. There is freedom in choosing {ni}i [M], so long as they satisfy the following constraints.

Lemma E.6. For any segment in any schema, the constraints given in Definition E.5 exactly characterize the range of values allowed for that segment. For all i [M], these constraints are:

(Segment Starts and Ends on Same Line) If idx(i 1) = idx(i) or i = 1, the only constraint on the segment s length is

Non-Asymptotic Length Generalization

Figure 3. * Start-point on l1, End-point on l2.

Figure 4. * Start-point on l2, End-point on l1.

Figure 5. * Start-point and End-point on l2, and in Sector2.

Figure 6. * Start-point and End-point on l2, and in Sector3.

Figure 7. Four types of Segments, based on which lines their start-point and end-point lie on, and the sector they are in.

(Segment Crosses Down) If idx(i 1) = idx(i) 1 and i 2, then ni ( sidx(i 1)

sidx(i) 1) Pi 1 j=1 nj

(Segment Crosses Up) If idx(i 1) = idx(i) + 1 and i 2, then ni ( 1 sidx(i 1)

1 sidx(i) 1) Pi 1 j=1 nj

Proof. Case: Segment Starts and Ends on Same Line. For segment i with length ni, if idx(i 1) = idx(i), then ni can be any nonnegative number. This is because for each n1, . . . , ni 1 and for all ni 0, there exists a segment of length ni that crosses line idx(i) at the start-point (P

j i 1 nj, sidx(i) P

j i 1 nj) and end-point (P

j i nj, sidx(i) P

j i nj) and nowhere in between. An example of such a segment is one which stays arbitrarily close to the line lidx(i) but doesn t cross it until the end-point (P

j i nj, sidx(i) P

j i nj), which is possible since each line has slope in the range (0, 1) while the test-function can have slopes at each point be in the range [0, 1].

Case: Segment Crosses Down. If idx(i 1) = idx(i) 1, then for any n1, . . . , ni 1, the minimum value of ni in terms of n1, . . . , ni 1 is achieved if the segment has slope of 0 at all points, the minimal segment , which will let the segment cross lines lidx(i 1) and lidx(i) most efficiently. Such a minimal segment will cross lines lidx(i 1) and lidx(i) at start-point (P j i 1 nj, sidx(i 1) P j i 1 nj) and end-point (P j i nj, sidx(i) P j i nj), respectively. The value of ni required for this minimum traversal can be calculated by:

Non-Asymptotic Length Generalization

j=1 nj = sidx(i 1)

= ni = (sidx(i 1)

The first equation follows from the fact that the zero-slope trajectory will enforce that the y-coordinate of the start-point of the segment sidx(i 1) Pi 1 j=1 nj equals that of the end-point sidx(i) Pi j=1 nj. Having ni be any smaller will not suffice to make the crossing, under the trajectory of the minimal segment (where the slope is 0 everywhere) and certainly under any other trajectories.

Now, having ni > ( sidx(i 1)

sidx(i) 1) Pi 1 j=1 nj is always possible, since the segment can always follow the trajectory of the

minimal segment to cross to line idx(i), and then use the extra slack, ni ( sidx(i 1)

sidx(i) 1) Pi 1 j=1 nj > 0 to closely follow line idx(i) until it crosses it at the designated end-point, (P

j i nj, sidx(i) P

j i nj). The latter phase reduces to the case where

idx(i 1) = idx(i). Thus, the allowed values for ni are ni ( sidx(i 1)

sidx(i) 1) Pi 1 j=1 nj.

Case: Segment Crosses Up. The setup of lines and [0, 1]-slope test-functions has a reflection symmetry about the line y = 1

2x. Thus, the argument for the constraint on values of ni for the idx(i 1) = idx(i) + 1 case reduces to that of the case idx(i 1) = idx(i) 1, except with reflected slopes 1 sidx(i 1) and 1 sidx(i). Plugging in these reflected slopes into the constraint for the Cross Down case yields ni ( 1 sidx(i 1)

1 sidx(i) 1) Pi 1 j=1 nj.

Another way to see how to derive the constraint by considering the minimal trajectory as one where the segment has slope 1 everywhere; and arguing that ni can be anything larger than the length required by the minimal trajectory.

(1 sidx(i))ni = (sidx(i) sidx(i 1))

= ni = (1 sidx(i 1)

1 sidx(i) 1)

The first equation above is derived as Object 1 of relative speed of (1 sidx(i)) to Object 2 takes ni time to close the initial gap of (sidx(i) sidx(i 1)) Pi 1 j=1 nj.

Partial Test-functions and Monotone Curves. We ll now define partial test-functions, which is a slight generalization of continuous test-functions where the start-point does not need to be (0, 0), which will be later in the later proofs. Definition E.7. (Partial Continuous Test-Function) Given a (k, T)-configuration {si}i [k], a continuous test-function Y is partial if it is allowed to have a start-point at (n1, si, n1), n1 [0, 1), i [k], instead of (0, 0). A partial test-function is undefined on [0, n1). The induced activations (B1(Y), . . . , Bk(Y)) of partial continuous test-function Y given k slopes {si}i [k] (0, 1) are defined as:

i [k], Bi(Y) := Z 1

n1 1[Y(j) > si j]dj

Schemas of partial test-functions can be thought of as a schema of a continuous test-function. We will still denote the lengths of the segments of the schema as {ni}i [M] for some M > 1, except that the test-function is undefined on the first segment s domain [0, n1).

Note that by a re-scaling argument, due to the homogeneity of the k lines {li}i [k], this definition also captures continuous test-functions where the start-point does not need to be at (0, 0) and the end-point need not have x-coordinate of 1.

We ll define type (I, k) and (II, k) partial test-functions, which are particular partial test-functions whose start-point is on a line whose index is either the smallest or largest element in the span of the test-function.

Non-Asymptotic Length Generalization

Definition E.8. (Partial test-functions of type (I, k), (II, k)) Given a (k, T)-configuration {si}i [k],

Define a type (I, k) partial test-function as a test-function [0, 1] [0, 1] that is undefined on [0, n1) for some 0 n1 < 1. It may span any consecutive subset of lines {a, . . . , b} {1, 2, . . . , k 1, k}, a b. With this span, we require that its start-point is at (n1, san1), that it is 1-Lipschitz and monotone non-decreasing, and that it can only intersect the k lines at finitely many points.

Define a type (II, k) partial test-function similarly as a test-function [0, 1] [0, 1] undefined on [0, n1). It may span any consecutive subset of lines {a, . . . , b} {1, 2, . . . , k 1, k}, a b. With this span, we require that its start-point is (n1, sbn1), that it is 1-Lipschitz and monotone non-decreasing, and that it can only intersect the k lines at finitely many points.

Note that for both type (I, k) partial test-functions, if the start-point is on la with a > 1, then its span cannot contain l1 and it cannot intersect l1 at any point. An analogous observation holds for type (II, k) partial test-functions.

We ll describe one primitive (partial test-function) schema that will be important for our basis test-function schema later: monotone curves.

Definition E.9. (Monotone Curve) Given a (k, T)-configuration {si}i [k], we define two schemas: Monnotone Up Curve and Monotone Down Curve. Both schema have k + 2 segments.

(Monotone Down) Begins at line 1 at (n1, s1 n1), goes above and recrosses line 1 at (n1 + n2, s1 [n1 + n2]), crosses each intermediate line {2, . . . , k 1} once, then crosses line k twice at (Pk+1 i=1 ni, sk Pk+1 i=1 ni) and (Pk+2 i=1 ni, sk Pk+2 i=1 ni).

(Monotone Up) Begins at line k at (n1, sk n1), goes below and recrosses line k at (n1 + n2, sk [n1 + n2]), crosses each intermediate line once, then crosses line 1 twice at (Pk+1 i=1 ni, s1 Pk+1 i=1 ni) and (Pk+2 i=1 ni, s1 Pk+2 i=1 ni).

Note n1 [0, 1) denotes the length of the first, empty segment on which the test-function is undefined; there are k + 1 nonempty segments. WLOG by homogeneity of the k lines that Pk+2 i=2 ni = 1.

Remark E.10. Test-functions of the monotone curve schema are monotone in the sense that they don t re-cross lines which they previously crossed, except for line 1 and line k. Remark E.11. Type (I, k), (II, k) test-functions are strictly more general than test-function since we can just set n1 = 0 to recover the usual test-function definition. Type (I, k) test-functions start on the top-most line while (II, k) test-functions start on the bottom-most line. Monotone Down Curves are Type (I, k) while Monotone Up Curves are Type (II, k)

Rearrangement Lemma for Monotone Curve.

Definition E.12. (Equivalence) Define two partial test-functions as equivalent if both test-functions have the same start-point, end-point, length, and induced activations over the k lines.

We will now introduce a central Lemma in proving that a certain set of schema is complete. This Lemma is about rearranging a test-function into an equivalent test-function with a simpler schema.

Lemma E.13. Given a (k, T)-configuration {si}i [k], suppose a type (I, k) test-function (resp. a type (II, k)) has its start point on line 1 and end point on line k (resp. start point on line k and end point on line 1) and spans lines {1, . . . , k}. Then there is an equivalent, monotone curve of the same length, start point, end point, and that induces the same (B1, . . . , Bk).

Proof. We will only prove the conversion of a type (I, k) test-function to a monotone (down) curve. The conversion of a type (II, k) test-function to a monotone (up) curve is an analogous argument with effective slopes s i = 1 si, i [k].

Suppose the type (I, k) test-function, Y, has M segments of length n1, . . . , n M, where n1 is the x-coordinate of the starting point of Y. Consider a monotone line Y(M) that, like Y, starts at (n1, s1 n1) and ends at (PM i=1 ni, sk PM i=1 ni). Y(M) is comprised of k + 2 segments of length n 1, n 2, . . . , n k, n k+1, n k+2 (where n 1 is an empty segment, denoting the coordinate of the start point), calculated from n1, . . . , n M as follows.

Non-Asymptotic Length Generalization

i [k + 1], n i+1 = X

j {2,3,...,M}:nj Sectori nj

First, we must show that this rearrangement forms a valid test-function that meets the constraints that would be enforced on n 1, n 2, . . . , n k, n k+1, n k+2 described in Lemma E.6.

First, the only constraint on n 2 and n k+2 is that they must be nonnegative.

Second, i [3, k + 1], the ith segment traverses sector i 1 and must cross from li 2 to li 1. By Lemma E.6, the following constraint is satisfied iff this crossing is possible:

We would like to show that (n i)i [k+2] meets these constraints. First, i [3, k +1], define ji := max{j {2, 3, . . . , M} : nj crosses Sectori 1} as the index of the last segment where Y crosses Sectori 1, starting on line li 2 and ending on line li 1. Then because the curve Y is continuous and its endpoint is on line k, then ji is monotone in i: j3 < j4 < . . . < jk+1. Then,

j {2,3,...,M}:nj Sectori 1 nj

nji just the last segment crossing Sectori 1 suffices

j=1 nj monotonicity of ji

j ji 1 s.t. nj Si 2 m=1 Sectorm nj

Thus, the monotone curve Y(M) is a valid monotone test-function. Y(M) has the same length as Y since it just rearranged the segments while preserving their length. Y(M) starts at line 1 and ends at line k, and it starts at (n1, s1n1) just like Y, so it must also end at the same point as Y(M) on line k. Y(M) induces the same activations (B1, . . . , Bk) as Y since the rearranged segment lengths stay in their original sectors in Y(M).

The proof for type (II, k) test-function is analogous.

Basis Schema. The following is the main result of this section. It says that the following finite set of basis schema is complete, in the sense that any continuous test-function is equivalent to some test-function whose schema is one of the basis schema.

Each basis schema described below is indexed by integer m [k] and a set of m tuples {(yi 1, yi 2)}i [m] {1, . . . , k}m. These m tuples parameterize m 1 monotone-curves that, when concatenated, yield the schema. For i [m 1], the ith tuple (yi 1, yi 2) [k]2 indicates that the ith monotone curve in the schema will have start-point on line yi 1 and end-point on line yi 2. The concatenation of all m 1 monotone curves yield the basis schema.

Non-Asymptotic Length Generalization

Corollary E.14. (Completeness of Basis Schema) Given a (k, T)-configuration {si}i [k], for any 1 m k, say that the list of tuples {(yi 1, yi 2)}i [m] {1, . . . , k}m is valid if they satisfy the following.

ym 1 = ym 2 i [m 1], yi 1 = yi 2 i [m 2], yi 1 < yi 2 = yi 2 = yi+1 1 > yi+1 2 > yi 1 i [m 2], yi 1 > yi 2 = yi 2 = yi+1 1 < yi+1 2 < yi 1 (y1 1, y1 2) = (1, k) or (y1 1, y1 2) = (k, 1)

For any m [k] and valid {(yi 1, yi 2)}i [m], define the basis schema, Y{(yi 1,yi 2)}i [m] as the concatenation of m 1 monotone curves, where for i [m 1], the ith monotone curve has start-point on line yi 1 [k] and end-point on line yi 2 [k]. The 1st monotone curve has start-point at the origin, (0, 0).

Then, the set of basis schemas over all m and valid {(yi 1, yi 2)}i [m] satisfying the above is complete in the following sense.

A({si}i [k]) = [

m [k],valid {(yi 1,yi 2)}i [m] A(Y{(yi 1,yi 2)}i [m])

Where A(Y{(yi 1,yi 2)}i [m]) [0, 1]k denotes the set of (B1, . . . , Bk) induced by any test-function of schema Y{(yi 1,yi 2)}i [m]. Finally, note that the number of basis schema with respect to {si}i [k] is Nk = 2k 1.

Refer to Figure 8 to get a sense for what the basis schema look like. Note that the y axis of the figure shows the normalized prefix sum. Also, Figure 9 is a depiction of the Completeness result for the case where k = 2.

Proof of Corollary E.14. Recall that A({si}i [k]) and A(Y{(yi 1,yi 2)}i [m]) for schema Y{(yi 1,yi 2)}i [m] are defined follows.

A(Y{(yi 1,yi 2)}i [m]) := {(B1(Y), . . . , Bk(Y)) : Y continuous test-function of schema Y{(yi 1,yi 2)}i [m] w.r.t. {si}i [k]}

A({si}i [k]) := {(B1(Y), . . . , Bk(Y)) : Y continuous test-function w.r.t. {si}i [k]}

First, because every test-function of a basis schema is a test-function,

A({si}i [k]) [

m [k],valid {(yi 1,yi 2)}i [m] A(Y{(yi 1,yi 2)}i [m])

It suffices to show that the converse inclusion holds.

A({si}i [k]) [

m [k],valid {(yi 1,yi 2)}i [m] A(Y{(yi 1,yi 2)}i [m])

To do this, we will prove that any arbitrary continuous test-function can be converted into an equivalent test-function (in the sense of Definition E.12), but which follows one of the basis schema. This conversion process is done via Algorithm 5, which partitions the input test-function into pieces, and then uses Lemma E.13 to convert each piece into a monotone curve, yielding a test-function of a basis schema.

More precisely, given an arbitrary test-function Y, the main idea is to partition Y into pieces, such that each piece is either a partial Type (I, β) or Type (II, β) test-function for some 1 β k. Each piece can then be rearranged into a monotone curve via an application of Lemma E.14. We use Algorithm 5 to attain {(yα 1 , yα 2 )}α [m 1] which characterize the appropriate basis-schema for which there exists an equivalent test-function to Y.

Recall that the span of a partial test-function is the set of lines {l1, . . . , lk} that it crosses at some point in its domain (where the lines are indexed from 1, . . . , k). The span of a test-function is a continuous subset of [k] since we assume that

Non-Asymptotic Length Generalization

Figure 8. Depiction of a Basis Schema consisting of 4 monotone curves. y-axis shows the normalized prefix sum of input string x. x-axis shows the length of the prefix of input string x. The five horizontal lines correspond to five lines with slopes {s1, . . . , s5}. Curve corresponds to Basis Schema with m = 5, (y1 1, y1 2) = (1, 5), (y2 1, y2 2) = (5, 2), (y3 1, y3 2) = (2, 4), (y4 1, y4 2) = (4, 3). Finally, note that each monotone curve consists of multiple segments. The first one has 5 segments, the second one has 4 segments, the third one has 3 segments, and the fourth one has 3 segments.

li has the highest slope si and {si}i [k] is sorted descending. Regarding notation in Algorithm 5, Spani is a tuple of two integers in [k]2, which represent the smallest and largest index of lines in the span of some partial continuous test-function. min(Spani) is the smaller element in the tuple, and max(Spani) is the larger element. {si}i [k] are the k slopes we fixed at the beginning, and Y(j) = sij indicates that test-function Y intersects the line y = six at (j, sij). Also, when we say that Y intersects the line li on [t, 1] to mean that there is some j [t, 1] where Y(j) = sij, for t 1.

Towards proving completeness of the basis schema, we will prove the following three claims. At the end, we will use these claims to argue that A({si}i [k]) S

m [k],valid {(yi 1,yi 2)}i [m] A(Y{(yi 1,yi 2)}i [m]).

1. Algorithm 5 terminates.

2. Algorithm 5 returns valid YPairs := {(yi 1, yi 2)i [m 1]} where m := |YPairs| + 1 with m k, and where the valid predicate is defined in the statement of Corollary E.14.

3. Max TB := {Ti}0 i m 1 [0, 1] is a set of real numbers where for each i {0, 1, , . . . , m 2}, Y restricted to the interval [Ti, Ti+1] can be rearranged into an equivalent monotone curve by Lemma E.13.

Proving Termination. At each iteration α, Algorithm 5 maintains variables Spanα, tα 1, and bα 1, which we claim satisfy the property that Spanα holds the smallest and largest index of lines {1, . . . , k} which Y spans when Y is restricted to interval [max(tα 1, bα 1), 1].

Non-Asymptotic Length Generalization

Figure 9. Depiction of A({si}i [k]) for k = 2 with two slopes s1 > s2. B1 is on the horizontal axis while B2 is on the vertical axis. When k = 2, there are only two basis schema: a single monotone (up) curve and a single monotone (down) curve. The dark blue triangle with vertices {(0, 0), (0, 1), ( s2

s1 , 1)} is the set of activations induced by test-functions of the monotone (down) curve basis schema. The light blue triangle with vertices {(1, 1), (0, 1), (0, s1 s2

1 s2 )} is the set of activations induced by test-functions of the monotone (up) curve. The completeness result says that the union of these two triangles equals A({s1, s2}).

First, suppose the span of Y is a single line (recall that the span of Y is always at least one line, since by Lemma E.2, we let the endpoint of any test-function be on some line). Then, Algorithm 5 initializes Span1 to be the span of Y, when restricted to interval [0, 1]. Algorithm 5 computes y1 1, y1 2, then terminates. Our claim holds.

For Y that span 2 or more lines, we will prove via induction that for each iteration α, Spanα, tα 1, and bα 1 satisfy our claim. As a base case, Span1 = Span(Y) satisfies our claim. As the inductive step, suppose Spanα holds the smallest and largest index of lines {1, . . . , k} which Y spans, when restricted to interval [max(tα 1, bα 1), 1]. Then, in iteration α, the variables tα and bα exist (recall that a continuous test-function only intersects the lines at finitely many points, so the max is well defined). We can never have that tα = bα as long as min(Spanα 1) < max(Spanα 1), which must be true at the start of iteration α, otherwise the algorithm would have terminated in iteration α 1. Thus, either tα > bα or tα < bα. If tα > bα, then the last point where Y intersects line min(Spanα) is later than the last point where Y intersects line max(Spanα). On the interval [max(tα, bα), 1], Y never intersects line max(Spanα) again. Thus, the smallest index of any line which Y intersects on the interval [max(tα, bα), 1] is min(Spanα). The largest index of any line which Y intersects on the interval [max(tα, bα), 1] must be less than max(Spanα) as tα > bα. Thus, the largest index line which Y intersects on the interval [max(tα, bα), 1] is given by the expression, max{i < max(Spanα) : i [k], Y intersects line li on [max(tα, bα), 1]}. This shows that line 9 correctly sets Spanα+1, completing the induction. A similar correctness argument can be made for the case where tα < bα. This proves that for all α, Spanα correctly holds the smallest and largest index of lines {1, . . . , k} which Y spans when it is restricted to interval [max(tα 1, bα 1), 1].

Non-Asymptotic Length Generalization

Algorithm 5 Decompose-Into-Monotone-Curves

1: Initialize: Span1 Span(Y), t0 0, b0 0 2: YPairs [ ] % List of all (yα 1 , yα 2 ) pairs 3: Max TB [0 ] % List of all max(tα, bα) values 4: for α {1, 2, . . . , k 1} do

5: tα max n j max(tα 1, bα 1) : Y(j) = smin(Spanα) j o

6: bα max n j max(tα 1, bα 1) : Y(j) = smax(Spanα) j o

7: if tα > bα then 8: yα 1 max(Spanα) and yα 2 min(Spanα)

9: Spanα+1 min(Spanα), max{ i < max(Spanα) : i [k], Y intersects line li on [max(tα, bα), 1]}

10: else 11: yα 1 min(Spanα) and yα 2 max(Spanα)

12: Spanα+1 min{ i > min(Spanα) : i [k], Y intersects line li on [max(tα, bα), 1]}, max(Spanα)

13: end if 14: YPairs YPairs {(yα 1 , yα 2 )} 15: Max TB Max TB {max(tα, bα)} 16: if max(Spanα) = min(Spanα) then 17: break 18: end if 19: end for 20: return (YPairs, Max TB)

For all α, we have that max(Spanα) min(Spanα) max(Spanα 1) min(Spanα 1) 1. Thus, the Algorithm terminates after at most |Span(Y)| 1 k 1 iterations since the largest that |Span(Y)| can be is k, when Y spans all k lines (by |Span(Y)|, we mean the number of lines which Y spans). In particular, if we let m := |YPairs| + 1 returned by Algorithm 5, then m |Span(Y)| k.

Proving Validity of YPairs. Towards the second claim, first note that if tα > bα in iteration α of Algorithm 5 and the minimum and maximum of Spanα+1 are not equal, then in the next iteration α + 1, it will be the case that bα+1 > tα+1. This is because the last point where Y crosses line min(Spanα) has x-coordinate tα, by line 5 in iteration α. However, by lines 8 and 9 in iteration α, it must also be true that in the next iteration α + 1, tα+1, defined in line 5 of iteration α + 1, equals max(tα, bα) = tα. Since we cannot have tα+1 = bα+1 (argued previously) and tα+1, bα+1 are both at least max(tα, bα), we have that bα+1 > tα+1 = tα. By lines 11 and 12 in iteration α + 1, Algorithm 5 will set yα 1 > yα+1 2 > yα+1 1 = yα 2 . Analogously, if tα < bα and Spanα+1 has a distinct minimum and maximum element, then bα+1 < tα+1, and yα 1 < yα+1 2 < yα+1 1 = yα 2 . This proves that the algorithm returns YPairs which is a set of at most m 1 |Span(Y)| 1 k 1 pairs {(yα 1 , yα 2 )}α [m 1] where each pair has a distinct minimum and maximum and satisfies the properties in Corollary E.14, except potentially for the property that (y1 1, y1 2) = (1, k) or (y1 1, y1 2) = (k, 1).

The last issue about ensuring YPairs satisfies the property that (y1 1, y1 2) = (1, k) or (y1 1, y1 2) = (k, 1) is simple to deal with. The issue arises when Span(Y) [k], so that Y does not span all k lines. However, we can note that any test-function which does not span all k lines can be thought of as part of a schema which does span all k lines, except that the segments of the schema which cross lines in [k] Span(Y) are the first segments of the schema, and their lengths are set equal to 0 for the particular case of Y (which is a valid setting of the lengths of the first segments, per Lemma E.6). Thus, though Y does not span k lines, it can be thought of as belonging to a schema which does. We will discuss this more at the end.

Converting Each Part into a Monotone Curve, to Convert Y into a Continuous Test-function of a Basis Schema. Third, with Max TB := {Ti}0 i m 1, for each α [m 1], the pair (yα 1 , yα 2 ) generated by an iteration of Algorithm 5 corresponds to the interval [Tα 1, Tα], in the sense that

(Tα 1, Y(Tα 1)) = (Tα 1, Tα 1 syα 1 )

(Tα, Y(Tα)) = (Tα, Tα syα 2 )

Non-Asymptotic Length Generalization

We have already argued that the span of the partial test-function given by restricting Y to [Tα, 1] is given exactly by Spanα, where either yα 1 = min(Spanα) and yα 2 = max(Spanα); or yα 1 = max(Spanα) and yα 2 = min(Spanα). Thus, the restriction of Y on interval [Tα 1, Tα] is a Type (I, max(Spanα) min(Spanα) + 1) or Type (II, max(Spanα) min(Spanα) + 1) partial test-function, for which Lemma E.13 applies.

By Lemma E.13, the restriction of Y on interval [Tα 1, Tα] is equivalent to a monotone curve of span Spanα, which has start-point (Tα 1, syα 1 Tα 1) and end-point (Tα, syα 2 Tα). The monotone curve will have the same start-point, end-point, length, and induced activations as Y restricted1 to [Tα 1, Tα]. Thus, we can construct such a monotone curve for each α [m 1] and concatenate these monotone curves together. The resulting test-function is equivalent to Y, as each individual monotone curve is equivalent to the corresponding restriction of Y.

We claim the resulting test-function is of one of the basis schema described in Corollary E.14. This essentially follows from the validity of YPairs outputted by Algorithm 5, which we justified in the previous section. To reiterate a key point, if Span(Y) = [k], then YPairs satisfies the conditions of Corollary E.14 exactly. On the other hand, if Span(Y) [k] is a strict (continuous) subset of [k], then still we can view the concatenation of monotone curves above as a test-function of a basis schema in Corollary E.14. To see this, we observe that for any such Y, there is a basis schema Y such that if we set the lengths of the segments of the monotone curves in schema Y whose span is a strict superset of Span(Y) to 0, then the remaining monotone curves are given by YPairs outputted by Algorithm 5 on input Y. This basis schema Y would be given as follows. Given YPairs outputted by Algorithm 5 on input Y, we pre-pend either one or two pairs to YPairs. If min(Span(Y)) = 1 or max(Span(Y)) = k, we pre-pend a single pair to YPairs: (k, 1) in the first case and (1, k) in the second case. If min(Span(Y)) > 1 and max(Span(Y)) < k, we pre-pend two pairs (1, k), (k, min(Span(Y))) to YPairs. The resulting list of pairs with one or two pairs pre-pended, which we call L, will have at most k 1 pairs total. L will satisfy all requirements in Corollary E.14. As we argued that we can simply set the lengths of the monotone curves corresponding to the pre-pended pairs in L to 0, the final test-function we attained by concatenating all the monotone curves together in the previous paragraph is of the basis schema corresponding to the list of pairs L.

Thus, the concatenation of monotone curves described above, based on the output by Algorithm 5, is of some basis schema specified in Corollary E.14. This proves that any continuous test-function Y is equivalent (in particular, induces the same activations) to some test-function of a basis schema specified in Corollary E.14. This implies that A({si}i [k]) S

m [k],valid {(yi 1,yi 2)}i [m] A(Y{(yi 1,yi 2)}i [m]), as desired.

Finally, regarding the number of basis schemas, we can first partition up the schema based on which of the k lines the end-point is on. Let f(k, i) be the number of basis schema w.r.t. a configuration of k slopes whose end-point is on line i [k]. We claim that f(k, i) = k 1 i 1 , from which it follows that the total number of basis schema is P

i [k] k 1 i 1 = 2k 1.

For i [k], to see that f(k, i) = k 1 i 1 , let Ii := [k + 1] {i, i + 1}. Consider all permutations σ(Ii) which contain {1, 2, . . . , i 1} and {k + 1, . . . , i + 2} as a subsequence. We claim there is a one-to-one correspondence between any such σ(Ii) and a basis schema Y{(yi 1,yi 2)}i [m]. Given Y{(yi 1,yi 2)}i [m], let g({(yi 1, yi 2)}i [m]) be a permutation of Ii where the ordering of element j Ii in the permutation is the relative ordering of sector j based on the last time where schema Y{(yi 1,yi 2)}i [m] passed through sector j. The mapping is injective because for any two {(yi 1, yi 2)}i [m] = {((yi 1) , (yi 2) )}i [m ], the first pair such that (yi 1, yi 2) = ((yi 1) , (yi 2) ) will be such that yi 2 = (yi 2) , and one of these schema will have visited Sectoryi 2 or Sector(yi 2) for the last time while the other will return to it later. The surjectivity of the mapping can be checked easily.

There are i 1 elements in {1, 2, . . . , i 1} and k 1 elements in Ii, so there are k 1 i 1 such permutations.

Corollary E.15. Given a (k, T)-configuration {si}i [k], each basis test-function schema specified in Corollary E.14 crosses the k lines of slopes {si}i [k] a total of at most k2 times.

Proof. By Corollary E.14, each basis schema consists of at most k monotone curves. The ith monotone curve has a span at most the span of that of the previous monotone curve minus 1, so since the first monotone curve spans k lines, there will be at most k monotone curves. Each monotone curve intersects the k lines each once, except for the two lines at the top and bottom of its span. However, these can be reduced to one intersection when you concatenate alternating monotone lines

1Here we are applying Lemma E.13 on restrictions of test-functions where the end-point is not at 1, but we can do this due to the homogeneity of the setup

Non-Asymptotic Length Generalization

together as such.

Thus, the number of times the test-functions of any basis schema crosses the k lines is Pk i=1 i = k(k+1)

Convexity of Schema.

Lemma E.16. (Convexity of activation set of test-functions of a schema) Given a (k, T)-configuration, {si}i [k], consider any schema continuous test-functions. Suppose the schema specifies M segments. Let A(M) be the set of valid (n1, . . . , n M) of segment lengths with PM i=1 ni = 1, where a valid setting of (n1, . . . , n M) satisfies the constraints described in Lemma E.6. Then A(M) is a convex set of dimension M 1. Moreover, A(M) is a simplex.

Proof. A(M) is the intersection of the linear subspace over (n1, . . . , n M) given by PM i=1 ni = 1 along with M halfspaces of the form in Lemma E.6:

i [M], ni pi

Each pi, qi will be determined from the slopes of the lines that segment i first intersects and last intersects. By Lemma E.6, there are three cases:

Case 1: Segment i Crosses From Line j to Line j + 1. Then, ni ( sj

sj+1 1) Pi 1 j=1 nj.

Case 2: Segment i Crosses From Line j + 1 to Line j. Then, ni ( 1 sj+1

1 sj 1) Pi 1 j=1 nj.

Case 3: Segment i Crosses From Line j to Line j. Then, pi

First, since A(M) is the intersection of convex sets, it is convex.

Second, regarding dimension of A(M), since all elements {si}i [k] of the configuration are distinct and in (0, 1), each of the M halfspaces acts on a different subset of the variables {ni}i [M]:

{n1}, {n1, n2}, {n1, n2, n3}, . . . , {n1, . . . , n M 1}, {n1, . . . , n M}

In particular, no combination of inequality constraint can form an equality constraint which implies that the intersection of the M halfspaces has dimension M. Thus, A(M) has dimension M 1 due to the additional constraint PM i=1 ni = 1.

Third, A(M) can be thought of as a polytope over M 1 variables, once we substitute in n M = 1 PM 1 i=1 ni, while it has exactly M faces given by the M halfspaces, with n M = 1 PM 1 i=1 ni substituted into the inequalities. The M faces of A(M) remain distinct even after the substitution since the only face whose linear inequality included n M will have a bias term of 1 after the substitution, while the other faces linear inequality remain homogeneous. Since A(M) has M faces and has dimension M 1, it is a simplex.

E.4. Lemmas for Margin of Point in a Polytope.

Average of Vertices of a Polytope. As a motivation for the following definition, recall that the centroid of a d-dimensional simplex is the average of its d + 1 vertices. Informally speaking, the centroid has the nice property that it is far from each face of the simplex. This property is useful for our proof, though we will need to extend this average of vertices notion to general convex polytopes.

Definition E.17. (Average of Vertices) Given a convex polytope P of N vertices v1, . . . , v N, we will define the average-ofvertices of P as c = 1

i [N] vi P.

Note that for a general convex polytope P, the average of its vertices will not, in general, be its centroid.

Margin of Average of Vertices of Polytope with poly(T) poly(K) Precision Faces.

Non-Asymptotic Length Generalization

Definition 6.10 (Margin). Given a linear inequality L over M variables and a point x RM, define L(x) as the difference between the left-hand-side and right-hand-side of the inequality when the coordinates of x are plugged into L. Let L(x) = 0 x satisfies the inequalities with equality. We say L(x) is the margin of x for L.

Often, we will mention the notion of margin in the context of a polytope P, where we are interested in the margin of a point x P onto a face of P. For any face F of P, F will be defined as the boundary of the halfspace given by some linear inequality L. In this case, WLOG, we will define the margin of any point x in P to F to be L(x). Moreover, we will define L so that L(x) is non-negative for x P. As such, we will sometimes refer to the margin of a point x P onto a face of P as the positive margin to emphasize this point.

We start with the following Lemma.

Lemma E.18. Consider a nonempty (M 1)-dimensional convex polytope P RM 1. Suppose the faces of P consists of N halfspaces over variables {ni}i [M 1], where each halfspace is given by a linear inequality with integer coefficients of magnitude at most p. For j [N], define Lj as the linear inequality for the jth face.

Then, for any j [N], for any vertex x of P which does not lie on the jth face of P, then the positive margin Lj(x) is lower bounded as follows.

Proof. For each j [N], represent the linear inequality Lj as a vector in vj [ p, p]M such that x RM 1, v j

is the margin of x on the constraint given by Lj. In addition, in our definition of the vector representations vj, we want to

ensure that the vectors vj are such that j [N], v j

> 0 for all vertices x which do not lie on the jth face of P.

This is always possible due to the convexity of the polytope, which is contained in the intersection of the halfspaces defined by each of its faces.

As an example of vector representations of inequalities,

n1 + 2n2 + 3n3 > 4 (1, 2, 3, 4)

2n1 3n2 + 10n3 9 ( 2, 3, 10, 9)

WLOG, we will prove the statement for the 1st face of P. Pick any vertex x which does not lie on the 1st face of P (i.e. |L1(x)| > 0). x is the intersection of M 1 distinct faces of P, whose indices we denote as {ji}i [M 1] [N] {1}. Let A [ p, p]M M be a matrix such that for all i [M 1], the ith row of A is vji the vector representation for Lji. Let the Mth row be v1, the vector representation of L1. Write A in block form as:

A = A b c d

where A [ p, p](M 1) (M 1) and b, c [ p, p]M 1 and d [ p, p], so that c d

is the vector representation of L1.

We have that i [M 1], Lji(x) = (Ai) x 1

, where Ai denotes the ith row of A. Since x lies on all the faces whose

indices are in {ji}i [M 1],

Non-Asymptotic Length Generalization

i [M 1], Lji(x) = 0

L1(x) = (AM) A 1b 1

Next, we note that

|A| = | A b 0 d c A 1b

= |A|(d c A 1b)

L1(x) = ||A|

where the margin is strictly positive since x1 does not lie on face 1. Its margin has a minimum value of | 1

the numerator |A| is an integer while the denominator is upper bounded by (

Mp)M. The upper bound on the determinant of A is by the Geometric-Mean-Quadratic-Mean inequality on the eigenvalues of A, where |A| [ 1 M 1||A||2 F ] M 1

Finally, this lower bound holds for any other face of P and any vertex of P which does not lie on that face by an analogous argument.

We now provide a Lemma which can improve the margin lower bound in the case that we know some additional information about the aforementioned matrix A defined in the setting of Lemma E.18.

Lemma E.19. Suppose (T, K), (T , K ) N2 are such that K T 2 and K (T )2. WLOG, suppose T T . Suppose M N such that max(K, K ) M 2(K + K ), and suppose constants c, d = O(1). Suppose matrix A ZM 1 M 1

1. 1 i min(c K, M 1), Ai { O(T d), . . . , O(T d)}M 1

2. min(c K, M 1) i M 1, Ai { O((T )d), . . . , O((T )d)}M 1

Then |A| O(T O(K) (T )O(K )).

Proof. If T, T 2, by the homogeneity of the determinant, we can factor out a factor of O(T d) from each of the first min(c K, M 1) rows of A and a factor of O((T )d) from the remaining rows of A.

|A| (O(T d))min(c K,M 1) (O((T )d))M 1 min(c K,M 1)| A|

Where A is an (M 1) (M 1) matrix such that the largest magnitude of any entry in A is 1. Since T, T 2, then there is a universal constant 1 < ϵ = O(1) such that:

Non-Asymptotic Length Generalization

|A| (T ϵ d)min(c K,M 1) ((T )ϵ d)M 1 min(c K,M 1)| A|

By the same Quadratic-Mean-Geometric-Mean argument in Lemma E.18,

| A| 1 M 1|| A||2 F M 1

Suppose that K K . Then M 4K so that (T ϵd)min(c K,M 1) ((T )ϵd)M 1 min(c K,M 1) T ϵcd K (T )4ϵd K . Meanwhile, since M 4K and K (T )2, then M M

2 (4(T )2)2K , so that |A| (T ϵd)min(c K,M 1) ((T )ϵd)M 1 min(c K,M 1)M M

2 T ϵcd K (T )4ϵd K (4(T )2)2K O(T O(K) (T )O(K )).

Now suppose that K K . Then M 4K so that (T ϵd)min(c K,M 1) ((T )ϵd)M 1 min(c K,M 1) T ϵcd K (T )4ϵd K T ϵcd K T 4ϵd K. Meanwhile, since M 4K and K T 2, then M M

2 (4T 2)2K, so that |A| (T ϵd)min(c K,M 1) ((T )ϵd)M 1 min(c K,M 1)M M

2 T ϵcd K T 4ϵd K (4T 2)2K O(T O(K) (T )O(K )).

In the case T = 1 and T = 1, then M = O(1), so |A| O(1). In the case T 2 but T = 1, then K = (T )2 = 1, and following a similar argument as the case where T, T 2 and K K yields the desired bound.

As a consequence, we have the following finer-grained version of Lemma E.18.

Lemma E.20. Suppose α N, (T, K), (T , K ) N2 are such that T K α and (T )K α and K T 2 and K (T )2. WLOG, suppose T T . Suppose M N such that max(K, K ) M 2(K + K ). Suppose d1, d2 = O(1).

Consider a nonempty (M 1)-dimensional convex polytope P RM 1. Suppose the faces of P consists of N halfspaces over variables {ni}i [M 1]. Suppose that there is a subset of at most d1 K faces of P such that each face of the subset is given by a linear inequality whose coefficients have precision at most O(T d2) while the faces not in the subset all have precision at most O((T )d2). For j [N], define Lj as the linear constraint for the jth face.

Then, for any j [N], for any vertex x of P which does not lie on the jth face of P, then the positive margin Lj(x) is lower bounded as follows.

Lj(x) Ω( 1 αO(1) )

Proof. We follow the proof of Lemma E.18 up to the step where we ve defined A and A and where we deduce that for any vertex x of P which does not lie on the first face of P,

L1(x) = ||A|

Because A satisfies the pre-conditions Lemma E.19 once we swap of rows appropriately, then |A| O(T O(K)(T )O(K )) O(αO(1)). Thus, L1(x) Ω( 1 αO(1) ). An analogous argument holds for all the other faces of P.

The following corollary applies Lemma E.20 to our setting of interest.

Corollary E.21. Suppose α N, (T, K), (T , K ) N2 are such that T K α and (T )K α and K T 2 and K (T )2. WLOG, suppose T T . Suppose M N such that max(K, K ) M 2(K + K ). Suppose d1, d2 = O(1).

Denote the coordinates in an M-dimensional Euclidean space E as (n1, . . . , n M). Consider a nonempty (M 1)- dimensional polytope P E, such that cl(P) has vertices V , with |V | M, and N faces. Suppose P is contained in the

Non-Asymptotic Length Generalization

(M 1)-dimensional subspace given by P

i [M] ni = 1. Suppose there is a subset of at most d1 K faces of P such that each face of the subset is given by a linear inequality whose coefficients have precision at most O(T d2) while the faces not in the subset all have precision at most O((T )d2). For j [N], define Lj as the linear constraint for the jth face.

Then the average of the |V | vertices of P will have margin at least Ω( 1

|V | 1 αO(1) ) for all (M 2)-dimensional faces of P.

Proof. To reduce this to the setting of Lemma E.20, first substitute n M = 1 PM 1 i=1 ni in the inequalities for all N faces of P so that each inequality is an equivalent inequality over (n1, . . . , n M 1) with integer coefficients increased by a factor of at most 2 (which will ultimately be absorbed by the Big-O notation).

Applying Lemma E.20 on P, where n M = 1 PM 1 i=1 ni was substituted into the inequalities defining faces of P, implies that for every (M 2)-dimensional face F of cl(P) and every vertex x of cl(P) which does not lie on F, then the margin of x on F is at least Ω( 1 αO(1) ). All vertices which lie on F have margin 0.

Consider c, the average of vertices {x1, . . . , x|V |} of P.

For any face F of P with constraint given by LF ( ), at least one of the |V | vertices {x1, . . . , x|V |} does not lie on F, as F is (M 2)-dimensional while P is (M 1)-dimensional. By linearity of the margin,

LF (c) = 1 |V |

i=1 LF (xi)

|V | 1 αO(1) )

Finally, we need one more Lemma to bound the number of vertices of the polytope of interest in the main proof.

Lemma E.22. Denote the coordinates in an M-dimensional Euclidean space E as (n1, . . . , n M). Consider a nonempty (M 1)-dimensional polytope P E, such that cl(P) has vertices V , with |V | M, and N faces. Suppose P is contained in the (M 1)-dimensional subspace given by P

i [M] ni = 1, and P is the intersection of an (M 1)-dimensional simplex and two halfspaces. Then, the number of vertices of P is at most 3M 2.

Proof. Let P = A H1 H2, where A is the (M 1)-dimensional simplex, and H1, H2 are the halfspaces. First, A will have M vertices and M faces.

The new vertices formed by intersecting H1 with A can be upper bounded by counting the number of tuples of M 1 faces of A, as each vertex is formed by the intersection of M faces. H1, when intersecting A will be able to form at most M M 1 = M new vertices. H2, when intersecting A H1, will be able to form at most M+1 M 1 = M(M+1)

2 new vertices.

The total vertices of A H1 H2 is at most 2M + M(M+1)

E.5. Lemmas for Discrete Approximation.

In this section, we are interested in showing that an activation (B1, . . . , Bk) realized by some continuous test-function Y can also be approximately realized by some discrete test-function X. The difficulty is that discrete test-functions consist of a sequence of lattice points {(j, ps(x)j)}j [|x|], where successive lattice points differ by a step of (+1, 0) (corresponding to xj = 0) or (+1, +1) (corresponding to xj = 1).

First, here is an auxiliary Lemma that provides a strategy to construct a trajectory of lattice points where successive lattice points differ by a step of (+1, 0) or (+1, +1) and where the trajectory stays between two lines y = b

ax and y = d

cx. The strategy offers a way to iteratively find the next lattice point in the trajectory.

Non-Asymptotic Length Generalization

Lemma E.23. For any two 2D lines with slopes 1 > b

c > 0 where a, b, c, d T, there exists an infinite sequence of bits x {0, 1} such that for all n T 2 we have that b

an ps(x)n > d

cn. That is, the discrete test-function given by x is above the lower line and below the upper line.

Proof. min0 a,b,c,d T, b

c) 1 T 2 , so for n T 2, ( b

c)n 1. Thus, for every n T 2, there is at least one lattice point (n, yn) Z2 with b

Now let s consider a strategy to jump from (n, yn) to (n + 1, yn+1). For all n T 2, from any such lattice point, (n, yn), either (n + 1, yn) or (n + 1, yn + 1) will be in ( d

c(n + 1), b

a(n + 1)]. Thus, there is always a continuation of the trajectory that remains between the two lines; one either takes a step along (+1, 0) (corresponding to xn+1 = 0) or along (+1, +1) (corresponding to xn+1 = 1) to reach (n + 1, yn) or (n + 1, yn + 1), respectively.

Finally, the trajectory from (0, 0) to (T 2, y T 2) can always be made by some trajectory using increments (+1, +1) and (+1, 0) since y T 2 T 2 b

The following Lemma says that we can approximate any continuous test-function Y with a length n discrete test-function, for n above some threshold.

Lemma E.24. (Discrete Approximation to Continuous Test-Function) Suppose N, (T, K), (T , K ) N2 are such that T K α and (T )K α and K T 2 and K (T )2. WLOG, suppose T T . Suppose we are given a (k, T)- configuration {si}i [k], where a subset of {si}i [k] of size at most K have precision at most T, while the remaining entries of {si}i [k] have precision at most T . For any schema Y of M segments, suppose Y is any continuous test-function of schema Y , with segment lengths (n1(Y), . . . , n M(Y)) [0, 1]M (where we assumed WLOG that P

j nj(Y) = 1). Suppose every nj(Y) is a rational number and that the common denominator of all (nj(Y))j [M] is p.

Then there exists an n0 O(p α2) so that for any positive integer multiple n of n0, there exists a discrete test-function X of length n so that

i [k], |Bi(Y) Bi(X)| T 2 + M

Proof. Denote idx(i) [k] as the line which the end-point of segment i of the schema Y (the schema of Y) lies on.

First, given the k slopes s1, s2, . . . sk (0, 1), suppose that the denominators of these slopes are a(1), . . . , a(k), where i [k], si = b(i)

a(i) . There is a subset U [k] of size at most K such that {a(i) : i U} are all positive integers of precision at most T. Further, {a(i) : i / U} is a set of at most K positive integers of precision at most T . Let d be the least common multiple of {a(1), . . . , a(k)}. Note that d T K (T )K α2.

Now, set n0 = d p, and consider any multiple n of n0. Consider the quantities, {ni(Y)}i [M] := {n ni(Y)}i [M], which are the segment lengths Y, rescaled by a factor of n. We will refer to this rescaled Y instead of Y for convenience of notation in the proof; like a discrete test-function of length n, the rescaled Y will be a map from [0, n] [0, n]. Denote δi := Pi j=1 nj(Y) for all i [M] and δ0 = 0.

We will construct a discrete test-function which follows the same schema as the rescaled Y on the interval [T 2, n], though on the interval [0, T 2], X will have no guarantees.

First, we will define M + 1 lattice points where the discrete test-function will cross over the k lines ( crossing points ). Second, we show how to connect those crossing points with a sequence of lattice points, where consecutive lattice points differ by increments of the 2D vectors, (+1, +1) and (+1, 0) (so that the sequence of lattice points is like a discrete test-function). At the end, we will argue that the constructed discrete test-function X and the rescaled Y will satisfy the guarantee that i [k], |Bi(Y) Bi(X)| M+T 2

Step 1: Defining the Crossing Points of the Discrete Test-Function on the k Lines. The rescaled continuous test-function Y, where the input domain and the output is scaled by n, will cross the k lines l1, . . . , lk at the M + 1 points,

{(δi, sidx(i)δi)}i [M] {(0, 0)}

Non-Asymptotic Length Generalization

Due to the way we have set n, these are all lattice points. Since n = d p, and all ni(Y) have common denominator p, then ni(Y) is an integer for all i [M], so δi is an integer for all i [M]. Second, because of the factor of d, sidx(i)δi is an integer. They are therefore valid points for the discrete test-function to go through, and we will design a discrete test-function that crosses the k lines at {(δi, sidx(i)δi)}i [M] {(0, 0)} for all δi T 2.

Step 2: Connecting the Crossing Points. We ll describe a strategy for constructing the discrete test-function that connects these crossing points. For any segment in the continuous schema, it will either cross between two consecutive lines, or it will cross a line and then recross the same line. We need only show how to do these types of crossings in a way that only crosses the appropriate lines at the start and endpoint of the segment {(δi, sidx(i)δi)}i [M] and at no point in between. Consider the (i + 1)th segment of Y, which has start-point (δi, sidx(i)δi) and end-point (δi+1, sidx(i+1)δi+1). There are three cases:

The segment crosses down (segment start-point on line lj and end-point on line lj+1 for some j [k 1])

The segment crosses up (segment start-point on line lj+1 and end-point on lj for some j [k 1])

The segment crosses and re-crosses the same line (segment start and end-point on lj for some j [k])

The strategies below will apply when the segment s start-point s x-coordinate is at least T 2 (i.e. δi T 2). After describing these, we will handle the corner case where a segment s start-point s x-coordinate is less than T 2 but its end-point s x-coordinate is larger than T 2.

Case 1: Segment Crosses Down. Suppose that we must connect the following start and end-point with a sequence of lattice points which lie between lines lj, lj+1:

(δi, sj δi), (δi+1, sj+1 δi+1) Z2

The corresponding segment of the rescaled continuous test-function is the (i + 1)th segment with length ni+1(Y)) Z.

First, there exists a sequence (trajectory) of lattice points {(xt, yt)}0 t ni+1(Y) where (x0, y0) = (δi, sj δi), (xni+1(Y), yni+1(Y)) = (δi+1, sj+1 δi+1) where for all 0 t < ni+1(Y), xt+1 = xt + 1; and yt+1 = yt + 1 or yt+1 = yt. This is because the rescaled Y is a continuous test-function which connects these two lattice points and which is 1-Lipschitz and non-decreasing. We want to further show that there exists such a trajectory of lattice points {(xt, yt)} such that 1 t ni+1(Y) 1, sj xt yt > sj+1 xt.

The strategy to do so constructs a trajectory from the start-point (δi, sj δi) to the end-point (δi+1, sj+1 δi+1) and consists of two phases:

(Phase 1) First, the following iterative procedure is applied, starting with (x0, y0) = (δi, sj δi) and t = 0:

From the current lattice point (xt, yt), choose (xt+1, yt+1) = (xt + 1, yt) or (xt + 1, yt + 1) depending on which one satisfies sj xt+1 yt+1 > sj+1 xt+1. At least one of the two will always satisfy this condition if x0 T 2, as argued in Lemma E.23. If both are possible, then choose either.

This is repeated until, yt, the y-coordinate of the current lattice point equals sj+1δi+1.

(Phase 2) The trajectory only takes steps along the direction (+1, 0) so that (xt+1, yt+1) = (xt + 1, yt) until the end-point is reached.

In Phase 1, we will always be able to find a next point (xt+1, yt+1) between lj, lj+1 from a current point (xt, yt) between lj, lj+1, if x0 T 2 as argued in Lemma E.23. By induction, all such lattice points will lie between the two lines lj, lj+1. Phase 1 will terminate because yt is non-decreasing in t and cannot remain bounded by any constant as t increases to , in order to stay between lj, lj+1, since sj+1 > 0. Thus, there exists some minimal t such that yt = sj+1δi+1 for the first time. Because the lattice points always stay between lj, lj+1, then at time t , we must have that xt < δi+1. Thus, Phase 1 terminates and Phase 2 can commence.

It is clear that Phase 2 will yield a set of lattice points that are between lj, lj+1 with the last lattice point being (δi+1, sj+1 δi+1) as desired.

Non-Asymptotic Length Generalization

This connects the two points with a trajectory that does not intersect lj and lj+1 at any points besides the start and endpoint, as neither phase crosses the two lines.

Case 2: Segment Crosses Up.

Suppose that we must connect the following start and end-point with a sequence of lattice points which lie between lines lj, lj+1:

(δi, sj+1 δi), (δi+1, sj δi+1)

The argument is analogous. First, there exists a sequence of lattice points {(xt, yt)} where (x0, y0) = (δi, sj+1 δi), (xni+1(Y), yni+1(Y)) = (δi+1, sj δi+1) where for all 0 t < ni+1(Y), xt+1 = xt + 1 and yt+1 = yt + 1 or yt+1 = yt. This is because the rescaled Y is a continuous test-function which connects these two lattice points and which is 1-Lipschitz and non-decreasing. We want to further show that there exists such a trajectory of lattice points {(xt, yt)} such that 1 t ni+1(Y) 1, sj xt yt > sj+1 xt.

The strategy to do so consists of two phases:

(Phase 1) First, the following iterative procedure is applied, starting with (x0, y0) = (δi, sj+1 δi) and t = 0:

From the current lattice point (xt, yt), choose (xt+1, yt+1) = (xt + 1, yt) or (xt + 1, yt + 1) depending on which one satisfies sj xt+1 yt+1 > sj+1 xt+1. At least one of the two will always satisfy this condition if x0 T 2, as argued in Lemma E.23. If both are possible, then chose either.

This is repeated until the current lattice point (xt, yt) and the end-point (δi+1, sj δi+1) can be connected with a line of slope 1. That is, sj δi+1 yt = δi+1 xt.

(Phase 2) The trajectory only takes steps along the direction (+1, +1) so that (xt+1, yt+1) = (xt + 1, yt + 1) until the end-point is reached.

The analysis of this strategy is analogous to that in Case 1.

Case 3: Segment Crosses and Recrosses the Same Line, lj.

Suppose that we must connect the following start and end-point with a sequence of lattice points.

(δi, sj δi), (δi+1, sj δi+1)

This can always be accomplished by a discrete test-function if δi T 2. There are two subcases: (1) if the trajectory needs to cross above line j then recross line j, and (2) if the trajectory needs to cross below line j and then recross line j. In case (1), we require the sequence of lattice points to lie between lines lj 1, lj while in case (2), we require the sequence of lattice points to lie between lines lj, lj+1.

A simple strategy for subcase (1) is to first set (x1, y1) = (δi + 1, sjδi + 1) (i.e. move one step along to (+1, +1) from the start-point). Note that (δi + 1, sjδi + 1) will not lie above line j 1 since δi T 2, so (sj 1 sj)δi 1 and sjδi + 1 sj 1(δi + 1). The trajectory will now be above line j and below line j 1. Next, use Lemma E.23 to iteratively find the next lattice point (xt+1, yt+1) from the current lattice point (xt, yt) by taking steps that are either (+1, +1) or (+1, 0) while staying between lj, lj 1 until yt = sj δi+1. Then, take the remaining steps along (+1, 0) until the end-point is reached.

An analogous strategy exists for subcase (2). Move one step according to (+1, 0) so that (x1, y1) is below line j and above line j+1. Use Lemma E.23 to move to the next lattice point strictly between line j and j+1 until (δi+1 xt = sj δi+1 yt).

Notice that this strategy is similar to those in Cases 1 and 2, and the analysis of them is the same.

Non-Asymptotic Length Generalization

Step 3: Constructing the Final Discrete Test-Function. We have proved that for each i [M], the two crossing points (δi 1, sidx(i 1)δi 1) and (δi, sidx(i)δi) can be connected by some sequence of lattice points that only crosses the k lines {l1, . . . , lk} at the crossing points, as long as δi 1 T 2. The lattice points differ from each other by increments of (+1, 0) and (+1, +1), like a discrete test-function. Each sequence of lattice points can be identified with one segment of the rescaled Y so that the number of lattice points, excluding (δi 1, sidx(i 1)δi 1), equals ni(Y).

Before constructing the final discrete test-function, we will need to construct the rest of the lattice points of the discrete test-function corresponding to segments of Y whose start-point is not larger than T 2.

Suppose that the M0th segment of Y is such that 0 δM0 1 T 2 1 < δM0, where M0 [M]. We split the M0th segment up into two pieces. Consider the restriction of the M0th segment to the following domains: [δM0 1, T 2] and [T 2, δM0]. We will construct a sequence of lattice points for the restriction of the M0th segment on [T 2, δM0] using similar techniques as in Step 2. We will then construct a sequence of lattice points from (0, 0) to (T 2, y ) for some y , to complete the discrete test-function. There are three cases to consider.

Case 1: idx(M0 1) = j, idx(M0) = j + 1 for some j [k 1].

We have that (sj sj+1)(T 2) 1, so there exists a lattice point (T 2, y ) such that y (sj+1T 2, sj T 2], with y Z. We then construct a sequence of lattice points from (T 2, y ) to (δM0, sidx(M0)δM0) using the same strategy as in Step 2, Case 1. This sequence of lattice points will be such that consecutive lattice points differ by an increment of (+1, +1) or (+1, 0) and all lattice points are between lj and lj+1.

We also construct a sequence of lattice points from (0, 0) to (T 2, y ). The only requirement of this sequence of lattice points is that consecutive lattice points differ by an increment of (+1, +1) or (+1, 0). Since y sj T 2 < T 2, this will be possible.

Case 2: idx(M0 1) = j + 1, idx(M0) = j for some j [k 1].

Similar to Case 1. We have that (sj sj+1)(T 2) 1, so there exists a lattice point (T 2, y ) such that y (sj+1T 2, sj T 2], with y Z. We then construct a sequence of lattice points from (T 2, y ) to (δM0, sidx(M0)δM0) using the same strategy as in Step 2, Case 2. This sequence of lattice points will be such that consecutive lattice points differ by an increment of (+1, +1) or (+1, 0) and all lattice points are between lj and lj+1.

We also construct a sequence of lattice points from (0, 0) to (T 2, y ). The only requirement of this sequence of lattice points is that consecutive lattice points differ by an increment of (+1, +1) or (+1, 0). Since y sj T 2 < T 2, this will be possible.

Case 3: idx(M0 1) = j, idx(M0) = j for some j [k].

There are two subcases: either segment M0 is contained in Sectorj or in Sectorj+1 (Sectors are defined in Definition E.4). We will just consider the first subcase, as the second is analogous. Define s0 = 1 and l0 as the homogeneous, 2D line with slope s0.

We have that (sj 1 sj)(T 2) 1, so there exists a lattice point (T 2, y ) such that y (sj T 2, sj 1T 2], with y Z. We then construct a sequence of lattice points from (T 2, y ) to (δM0, sidx(M0)δM0) using the same strategy as in Step 2, Case 1. This sequence of lattice points will be such that consecutive lattice points differ by an increment of (+1, +1) or (+1, 0) and all lattice points are between lj 1 and lj.

We also construct a sequence of lattice points from (0, 0) to (T 2, y ). The only requirement of this sequence of lattice points is that consecutive lattice points differ by an increment of (+1, +1) or (+1, 0). Since y sj 1T 2 T 2, this will be possible.

We now construct the full discrete test-function by concatenating all these sequences of lattice points into a single sequence of lattice points. For each i [M0, M], use the aforementioned constructions to construct a sequence of lattice points with start-point (max(δi 1, T 2), sidx(i 1) max(δi 1, T 2)) and end-point (δi, sidx(i)δi). Remove the start-point from each of these sequences of lattice points, and concatenate all the sequences of lattice points for i [M]. Finally, concatenate the sequence of lattice points from (0, 0) to (T 2, y ), for y as described in Case 3 above. This sequence of lattice points is a valid discrete test-function since consecutive lattice points differ by an increment of (+1, +1) or (+1, 0). We will refer to the constructed discrete test-function as X.

Non-Asymptotic Length Generalization

Properties of the Final Construction. For i [M0, M], there is a correspondence between the ith segment of Y and the sequence of lattice points with start-point (max(δi 1, T 2), sidx(i 1) max(δi 1, T 2)) and end-point (δi, sidx(i)δi). The difference between the number of lattice points in the latter and the length of the former is at most T 2. In addition, all except possibly one of those lattice points (the end-point) lie in the sector in which the ith segment of Y lies in.

Define (Bi(Y))i [k] and (Bi(X))i [k] as follows.

i [k], Bi(Y) := Z 1

0 1[Y(j) > si j]dj

i [k], Bi(X) := 1

j=1 1[X(j) > si j]

For any i [k], Bi(X) depends on the total number of lattice points (out of n) which are strictly above line i, while Bi(Y) depends on the total length of segments which lie above line i. The maximum deviation between Bi(X) and Bi(Y) is upper bounded by 1

n multiplied by the number of lattice points (j, X(j)), j [n], which are not in the Sector which their corresponding segment is in, which is upper bounded by T 2

n + # times Y crosses line i

n . More precisely, for any i [k],

|Bi(X) Bi(Y)| 1

δm 1<j δm 1[(j, X(j)) is not in the same Sector that segment m of Y is in]

n[T 2 + #(j [0, 1] : Y(j) = si j)]

n term arises from the lattice points (j, X(j)) for j [T 2], which have no guarantee of being in the Sector of the corresponding segment of Y. The # times Y crosses line i

n term arises from the fact that for j > T 2, only (j, X(j)) which lie on line i may potentially lie in a different Sector than that of the corresponding segment of Y. This is upper bounded by M since Y has only M segments and so can cross line i at most M times.

E.6. Auxiliary Lemmas

Convex Geometry Lemmas. Lemma E.25. (Theorem 6.2 of (Rockafellar, 1970)) Let C be a nonempty convex set. Then its relative interior, ri(C), is also nonempty. Lemma E.26. (Theorem 6.1 of (Rockafellar, 1970)) Let C be a convex set in Rn. Let x ri(C) and z cl(C). Then for all 0 λ < 1, (1 λ)x + λz ri(C).

Lemma E.27. Let C be a nonempty convex set in Rn. Consider an open halfspace H := {x Rn : P

i [n] λixi > α}. If C H = , then dim(C H) = dim(C).

Proof. Suppose that p C H. If p ri(C), then since H is an open set, we are done.

If not, p C := cl(C) ri(C). By Lemma E.25, since C = , there exists x ri(C). By Lemma E.26, 0 λ < 1, pλ := λp + (1 λ)x ri(C).

We claim there exists some setting of λ 1 such that pλ H. Since H is open, there exists δ > 0 such that Ballδ(p) := {x Rn : ||x p||2 < δ} H. Taking λ > 1 δ ||x p||2 ensures that ||pλ p||2 < δ = pλ H. Thus, pλ H ri(C).

Lemmas Regarding Existence of Low Precision Element in L Ball.

Lemma E.28. Let N > 0 be an integer. For any x R, the interval [x, x + 1

N ] contains at least one element of the form m

N for integer m.

Non-Asymptotic Length Generalization

Proof. For N > 0, the set S = { m

N : m Z} is such that the minimum distance between any two consecutive elements is 1 N . Since the length of the interval [x, x + 1

N , it must contain some element of S, or else this would imply that there is some pair of consecutive elements in S such that the distance between them is greater than 1

N . Lemma E.29. Let N > 0 be an integer. For any x = (x1, . . . , xd) Rd with P

i [d] xi = 1, the L ball {x Rd : ||x x || 1 N } {x : P i [d] x i = 1} contains a point x Qd whose fractional coordinates have a least common denominator of at most d N.

Proof. Consider Ball 1 d N (x) := {x Rd : ||x x || 1 d N } {x Rd : P

i [d] x i = 1}.

Along each dimension i {1, 2, . . . , d 1}, there must exist a number in the interval [xi 1 d N , xi + 1 d N ] of the form mi d N for integer mi by Lemma E.28. We can set x i = mi

d N for all i [d 1]. The last coordinate x d will need to be set to 1 P

i [d 1] mi d N so that x {x : P

i [d] x i = 1}. Finally, |x d xd| P

i [d 1] |x i xi| d 1

d N 1 N . Hence, we have that x Ball 1 N (x) and that each coordinate is of the form of mi

d N for some integer mi.

Relating to Technical Lemma E.30. Lemma E.30. If for every f C-RASP2 with K(f) heads, we have that z > 0 and P i [K(f)] λi > z, then for any f, f C-RASP2 that are not equal, with K and K heads respectively, then f and f will differ on some continuous test-function, Y, which strictly satisfies the second layer inequalities. That is, with max(K, K ) k K + K distinct heads between f and f , let (B1(Y), . . . , Bk(Y)) be the activations induced by Y. Then, either:

i=1 λi Bord(1,i)(Y) > z

i=1 λ i Bord(2,i)(Y) < z

i=1 λi Bord(1,i)(Y) < z

i=1 λ i Bord(2,i)(Y) > z

To prove this, we will need Lemma E.33 and E.34. First, define the following notion of connectedness for subsets of Euclidean space.

Definition E.31. (Definition 2.4 of (Freiwald, 2014)) Two subsets A and B of a metric space X are said to be separated if both cl(A) B = and A cl(B) = . A set E X is connected if E is not the union of two nonempty, separated sets.

We will utilize the following Lemma (implicitly) in the proof below to show that a union of sets S j [N] Aj is connected, by iteratively arguing that for each j [N] {1}, Aj is not separated from S

k<j Aj (i.e. they share an edge ).

Lemma E.32. (Theorem 2.9 in (Freiwald, 2014)) Suppose C and {Cα}α I are connected subsets of X and that for each α, Cα and C are not separated. Then C S

α I Cα is connected

Next, fix f, f and the k distinct slopes {si}i [k] across the first layers of both programs. Let s consider the k-dimensional set A({si}i [k]) [0, 1]k. A({si}i) is equal to the union of a finite number of convex polytopes by Corollary E.14 and Lemma E.16.

A({si}i [k]) = [

m [k],valid {(yi 1,yi 2)}i [m] {1,...,k}m A(Y{(yi 1,yi 2)}i [m])

Here are two key Lemmas we will need.

Non-Asymptotic Length Generalization

Lemma E.33. (Dimension of Polytopes) Given a (k, T)-configuration {si}i [k], then for each basis schema Y{(yi 1,yi 2)}i [m], the set of activations A(Y{(yi 1,yi 2)}i [m]) has dimension k.

Proof. Consider any basis schema Y that spans k lines with M segments, as described in Corollary E.14. It suffices to show k independent degrees of freedom in A(Y ).

Given (B1, . . . Bk) consider the full-rank linear map, Q : Rk Rk

Q(B1, . . . , Bk) = (B1, B2 B1, B3 B2, . . . , Bk Bk 1)

Denote Q A(Y ) as the set consisting of points attained by applying Q on all points in A(Y ). We have dim(Q A(Y )) = dim(A(Y )) dim(ker(Q) A(Y )) so since Q has rank k, dim(ker(Q) A(Y )) dim(ker(Q)) = 0. So to prove that dim(A(Y )) = k, it suffices to prove dim(Q A(Y )) is k.

By Lemma E.16, schema Y will span all k lines and will have at least one segment in each of the k + 1 sectors. First, set the lengths of all the segments that are in sector k + 1 to 0. This can always be done because by Lemma E.6, there is no constraint on any segment in Sectork+1 since such a segment would not be crossing two distinct lines. Say that there are M non-zero-length segments left that are each in the sectors {1, 2, . . . , k}. Let (n1, . . . , n M) be the lengths of these segments.

There is a linear mapping QL : RM Rk that maps (n1, . . . , n M) to a vector v Rk, where for all i [k], vi is the sum of the length of segments that are contained in in Sectori. This map is the multiplication of Q and L, where L is a linear map L : RM Rk which maps segment lengths of Y , (n1, . . . , n M), to activations (B1, . . . , Bk). In matrix form, L {0, 1}k M is such that Lij = 1 segment j in schema Y lies above line i and hence contributes to activation Bi.

Now, for each sector i [k], let:

ji := max{j [M] : nj Sectori}

be the index of the last segment in the schema that is in Sector i.

Now, for each i [k], consider a setting of (n1, . . . , n M) where for all j < ji, nj = 0, but nji > 0. Denote the setting for i [k] as Ni := (n(i) 1 , . . . , n(i) M ). We claim that the k vectors, {QL(Ni)}i [k] are linearly independent. Since {QL(Ni)}i [k] Q A(Y ), this would imply the first point of the Lemma. The claim that {QL(Ni)}i [k] are linearly independent can be argued by induction.

As a base case, suppose that i0 [k] is the sector where the Mth segment of the schema lies in. Then QL(Ni0) = ei0. The singleton set is linearly independent.

For each 1 r k, denote Ir := {i [k] : ji is one of the r-largest of the set {ji}i [k]}, so I1 = {i0}. Suppose by the inductive hypothesis that for 1 r k, the r vectors {QL(Ni) : i Ir} are linearly independent. Now, let i be the sector such that ji is the (r + 1)th largest of the set {ji}i [k]. (n(i ) 1 , . . . , n(i ) M ) will be such that n(i ) ji is nonzero, which implies that the i th entry of QL(Ni ) is nonzero. In contrast, for all i Ir, the i th entry of QL(Ni) is zero, because for all j > ji , there is no segment nj that lies in sector i , by definition of ji , so by the construction of Ni where all segments of index less than ji are set to 0, then the i th entry of QL(Ni) is 0 for all i Ir. Thus, the r +1 vectors {QL(Ni) : i Ir+1} are linearly independent. It follows that Ik is a set of k linearly independent vectors in Q A(Y ), so dim(Q A(Y )) = k.

Lemma E.34. (Connectedness of Interiors of Polytopes) Given a (k, T)-configuration {si}i [k], the set S m [k],valid {(yi 1,yi 2)}i [m] {1,...,k}m int(A(Y{(yi 1,yi 2)}i [m])) is connected.

Proof. Let us first recall the set of basis schemas in Corollary E.14. For any 1 m k and {(yi 1, yi 2)}i [m] {1, . . . , k}m

Non-Asymptotic Length Generalization

ym 1 = ym 2 i [m 1], yi 1 = yi 2 i [m 2], yi 1 < yi 2 = yi 2 = yi+1 1 > yi+1 2 > yi 1 i [m 2], yi 1 > yi 2 = yi 2 = yi+1 1 < yi+1 2 < yi 1 (y1 1, y1 2) = (1, k) or (y1 1, y1 2) = (k, 1)

We can the test-function schema, Y{(yi 1,yi 2)}i [m] as the concatenation of m 1 monotone curves, where the ith monotone curve goes from lines yi 1 to yi 2 for i [m 1]. The set of these test-function schemas over all valid {(yi 1, yi 2)}i [m] satisfying the above is complete by Corollary E.14.

Let A(Y{(yi 1,yi 2)}i [m]) denote the set of activations induced by test-functions of schema Y{(yi 1,yi 2)}i [m]. A(Y{(yi 1,yi 2)}i [m]) is a convex set by Lemma E.16. Thus, each set int(A(Y{(yi 1,yi 2)}i [m])), by itself, is connected.

Now, imagine a graph where each basis schema Y{(yi 1,yi 2)}i [m] is a node. Create an edge between basis schema Y and Y if int(A(Y )) int(A(Y )) = . By Lemma E.32, to prove our desired conclusion, it is sufficient to show that this graph is connected. For this, we will need to prove which edges exist and that these edges suffice to connect the graph.

Categorize the basis schema into levels, where schema of the same levels have the same number of monotone curves (i.e. the parameter m [k]).

Edges Within Same Level. Consider any two basis test-function schemas, Y{(yi 1,yi 2)}i [m] and Y{((yi 1) ,(yi 2) )}i [m], with m monotone curves. Suppose the two schemas 1st through (m 2)-th monotone curves are the same, but the (m 1)th (i.e. their last) monotone curve s endpoint is different by 1 index. That is,

i [m 2], yi 1 = (yi 1) and yi 2 = (yi 2)

ym 1 1 = (ym 1 1 )

ym 1 2 = (ym 1 2 ) 1

We claim that the set of activations of these two adjacent schema must share an interior point:

int(A(Y{(yi 1,yi 2)}i [m])) int(A(Y{((yi 1) ,(yi 2) )}i [m])) =

Suppose Y{(yi 1,yi 2)}i [m] has M segments and Y{((yi 1) ,(yi 2) )}i [m] has M segments2. Because the two schema have the same monotone curves except the last monotone curve, then there are two possibilities:

Either ym 1 1 < ym 1 2 and (ym 1 1 ) < (ym 1 2 ) . One can interpret these monotone curves as moving downward , since the (m 1)th monotone curve of Y{(yi 1,yi 2)}i [m] (resp. Y{((yi 1) ,(yi 2) )}i [m] has M ) has start-point on line ym 1 1 (resp. (ym 1 1 ) ) and end-point on line ym 1 2 (resp. (ym 1 2 ) ). Note the lines with lower index have higher slope by convention (i.e. s1 is largest).

Or ym 1 1 > ym 1 2 and (ym 1 1 ) > (ym 1 2 ) . One can interpret these monotone curves as moving upward .

WLOG, consider just the first case (the second one utilizes an analogous argument). Then, since ym 1 1 < ym 1 2 , the last monotone curve is moving downward. Suppose WLOG that the monotone curve which starts at line (ym 1 1 ) and ends at line (ym 1 2 ) moves down one more line than the monotone curve which starts at line ym 1 1 and ends at line ym 1 2 .

2Note: the two schema have the same number of monotone curves, but different numbers of segments

Non-Asymptotic Length Generalization

Thus, M = M +1. Moreover, the set of segments which make up the two schema are similar: for all j [M], segment j of Y{(yi 1,yi 2)}i [m] lies in the same sector as segment j of Y{((yi 1) ,(yi 2) )}i [m]. Meanwhile, segment M +1 of Y{((yi 1) ,(yi 2) )}i [m] has no counterpart in Y{(yi 1,yi 2)}i [m].

To show connectedness, we need to exhibit a point (B1, . . . , Bk) [0, 1]k that is in the interior of A(Y{(yi 1,yi 2)}i [m]) and A(Y{((yi 1) ,(yi 2) )}i [m]).

For schema Y{(yi 1,yi 2)}i [m] (resp. Y{((yi 1) ,(yi 2) )}i [m]), there is a linear transformation L (resp. L ) that maps the set of valid

segment lengths A(M)(Y{(yi 1,yi 2)}i [m]) [0, 1]M (resp. A(M )(Y{((yi 1) ,(yi 2) )}i [m]) [0, 1]M ) of schema Y{(yi 1,yi 2)}i [m] (resp. Y{((yi 1) ,(yi 2) )}i [m]) to the set of activations A(Y{(yi 1,yi 2)}i [m]) [0, 1]k (resp. A(Y{((yi 1) ,(yi 2) )}i [m]) [0, 1]k). Being rank k (see bullet point 1 of Lemma E.33), L (resp. L ) will map interior points in A(M)(Y{(yi 1,yi 2)}i [m]) (resp.

A(M )(Y{((yi 1) ,(yi 2) )}i [m])) to interior points of A(Y{(yi 1,yi 2)}i [m]) (resp. A(Y{((yi 1) ,(yi 2) )}i [m])). Thus, it suffices to exhibit two settings of the lengths of the segments of the two schema (a1, . . . , a M) and (b1, . . . , b M ) = (b1, . . . , b M, b M+1), that are interior points in their respective polytopes A(M)(Y{(yi 1,yi 2)}i [m]) (resp. A(M )(Y{((yi 1) ,(yi 2) )}i [m])), that give rise to the same (B1, . . . , Bk) [0, 1]k.

Towards this goal, we are interested in the regime where b M+1 = δ 1 and j [M], aj bj, and where each segment length has slack ϵ with respect to its constraint defined in Lemma E.6. Suppose by Lemma E.6 that the constraints for the first M segments for schema Y{((yi 1) ,(yi 2) )}i [m] are:

i=1 bi, j [M]

We ll now describe how to set {aj}j [M] and {bj}j [M ]. We will largely ignore the normalization conditions that P j [M] aj = 1 and P j [M ] bj = 1, but only require that the total length is the same: P j [M] aj = P j [M ] bj > 0.

For ϵ, δ > 0, set b1 = 1 and for j {2, . . . , M}, sequentially assign bj = pj

qj Pj 1 i=1 bi + ϵ. Set b M+1 = δ (where the only constraint for b M+1 is that it is nonnegative, being the last segment). {bj}j [M+1] is a valid set of segment lengths respecting the constraints imposed by schema Y{((yi 1) ,(yi 2) )}i [m] as per Lemma E.6. {bj}j [M+1] is also an interior point

of A(M )(Y{((yi 1) ,(yi 2) )}i [m]), since each constraint has slack ϵ > 0 or δ > 0.

Now, suppose that the (M + 1)th segment of Y{((yi 1) ,(yi 2) )}i [m] lies in sector i . Because each basis schema spans all k lines, then there exists some j [M] such that segment j in Y{(yi 1,yi 2)}i [m] is also in sector i . Set {ai} as follows.

j [M] {j }, set aj = bj Finally, set aj = bj + δ

We now need to argue this is a valid assignment of {ai}, respecting the constraints on segments imposed by the schema Y{(yi 1,yi 2)}i [m]. Because the first M segments of Y{((yi 1) ,(yi 2) )}i [m] are the same as the M segments of Y{(yi 1,yi 2)}i [m], then these segments have the same constraints.

It follows that for all j j , aj meets its constraint. For subsequent segments, due to the extra δ contribution on aj , we will need to argue that their constraint is still met with some nonzero slack. The tth constraint after j will have its slack lessened by at most δ [maxp,q Z,|p|,|q| T ( p

q )t] δT M δT k2, where we used that t M k2 by Corollary E.15.

Thus, as long as δ < ϵ T k2 , then this setting of {aj}j [M] will still satisfy each constraint of Y{(yi 1,yi 2)}i [m] with nonzero slack.

i=1 ai, j [M]

Non-Asymptotic Length Generalization

Thus, {aj}j [M] is an interior point of A(M)(Y{(yi 1,yi 2)}i [m]).

Finally, the segments {aj}j [M] and {bj}j [M+1] induce the same {Bi}i [k] since the sum of the segment lengths in each sector is the same. This proves that:

int(A(Y{(yi 1,yi 2)}i [m])) int(A(Y{((yi 1) ,(yi 2) )}i [m])) =

Edges Between Levels L and L + 1. Consider any two schema Y{(yi 1,yi 2)}i [m 1], Y{((yi 1) ,(yi 2) )}i [m] with m 2 and m 1 monotone curves, respectively, where Y{((yi 1) ,(yi 2) )}i [m] s first m 2 monotone curves are the same as Y{(yi 1,yi 2)}i [m 1] s m 2 monotone curves. Moreover, let schema Y{((yi 1) ,(yi 2) )}i [m] be such that:

In the case that (ym 1 1 ) < (ym 1 2 ) , then (ym 1 1 ) + 1 = (ym 1 2 ) .

In the case that (ym 1 1 ) > (ym 1 2 ) , then (ym 1 1 ) 1 = (ym 1 2 )

Instead of a schema of m 2 monotone curves, schema Y{(yi 1,yi 2)}i [m] can also be thought of as a schema of m 1 monotone curves, where the last monotone curves will only cross the line with index ym 2 1 = ym 2 2 . Thus, these two schema can be thought of as having their final (m 1)-th monotone curves endpoints differ by one line, which reduces to the case above with two schema in the same level. Using exactly the same argument as in the previous section with two schema in the same level, we can conclude that

int(A(Y{(yi 1,yi 2)}i [m 1])) int(A(Y{((yi 1) ,(yi 2) )}i [m])) =

Edges Between Two Connected Components. In our graph where basis schema are nodes, we now have two connected components:

Schemas where the first monotone curve is a monotone (up) curve from lines k to 1

Schemas where the first monotone curve is a monotone (down) curve from lines 1 to k.

Within each of these connected components, we have proved connectedness with the previous two cases, by connecting schemas of the same prefix (of monotone curves) and the same level, and schemas of the same prefix and different levels.

The nodes in the first connected component will all connect in a tree to the schema consisting of a single monotone curve from lines k to 1. The nodes in the second connected component will all connect in a tree to the schema consisting of a single monotone curve from lines 1 to k.

Let Y be the schema for m = 2 with a single monotone (up) curve from line k to 1, and Z be the schema for m = 3 with two monotone curves: one monotone (down) curve from line 1 to k and then a monotone (up) curve from k to 2. These are two representatives from the two aforementioned connected components. We claim that:

int(A(Y )) int(A(Z)) =

We will start with an interior point for Z. Suppose it has segment lengths z1, . . . , zk+1, . . . , z2k, where z1 is above line 1, zk+1 is below line k, and segment z2k lies above line 2 and below line 1, with end-point on line 2.

Again, ignore the normalization condition that P

i [2k] zi = 1. For ϵ > 0, we assign values to the segments as follows.

2 i 2k 1, zi = pi

Non-Asymptotic Length Generalization

Note that since z2k is the last segment of the schema, it need not cross a sector and need only cross and re-cross line 2. Therefore, it has no constraints under schema Z except that z2k 0. However, the key idea is that we can choose for z2k to be sufficiently large so as to satisfy the constraint needed to cross to line 1 from line 2.

Such a {zi}i [2k], after normalization, exists in the (relative) interior of A(M)(Z).

We claim that rearranging the segments {zi}i [2k] into a monotone (up) curve (y1, . . . , yk+1) with slack ϵ will always be possible for any ϵ > 0. This essentially follows from Lemma E.13 as we have intentionally set z2k = p2k

q2k P2k 1 j=1 zj + ϵ large enough to cross from line 2 to line 1 (and have its end-point on line 1). More precisely, the following constraints are met by {zi}i [2k].

k + 2 i 2k, zi = pi

j i 1:zj in sector (2k+3 i) zj + ϵ

Then, we obtain a valid monotone curve with segment lengths (y1, . . . , yk+1) by setting:

y1 = zk+1 yk+1 = z1 yi = zi+k + zk+2 i, 2 i k

Thus, the large value of z2k enables the re-arranged test-function of monotone schema to cross from line 2 to line 1. Finally, there is no constraint on yk+1, so all constraints of the schema Y are satisfied. This re-arrangement will have slack min(ϵ, y1, yk+1) ϵ because each of the original constraints in Z on the segments z1, . . . , z2k 1 as well as the pseudo-constraint we imposed on z2k only got looser in this rearrangement, and there is no constraint on yk+1. Thus, the rearranged monotone curve, after normalization of the segment lengths, is in the interior of A(M)(Y ).

{yi} and {zi} are both interior in their respective polytopes and induce the same activations (B1, . . . , Bk) since the segments of {yi} are just a re-arrangement of the segments of {zi}, and P

i [2k] zi = P

i [k+1] yi. Thus, they give rise to the same interior point (B1, . . . , Bk) int(A(Y )), (B1, . . . , Bk) int(A(Z)), so int(A(Y )) int(A(Z)) = .

These edges demonstrate that the graph is connected, proving the Lemma.

Proof of Lemma E.30.

Lemma E.35. (Lemma E.30, restated) If for every f C-RASP2 with K(f) heads, we have that z > 0 and P

i [K(f)] λi > z, then for any f, f C-RASP2 that are not equal, with K and K heads respectively, then f and f will differ on some continuous test-function, Y, which strictly satisfies the second layer inequalities. That is, with max(K, K ) k K + K

distinct heads between f and f , let (B1(Y), . . . , Bk(Y)) be the activations induced by Y. Then, either:

i=1 λi Bord(1,i)(Y) > z

i=1 λ i Bord(2,i)(Y) < z

Non-Asymptotic Length Generalization

i=1 λi Bord(1,i)(Y) < z

i=1 λ i Bord(2,i)(Y) > z

Proof of Lemma E.30. Fix f, f and their k distinct slopes {si}i [k]. Suppose f, f differ on some discrete test-function X. We want to show that there must be a continuous test-function that strictly satisfy the two inequalities strictly and distinguishes f and f .

Let s consider the k-dimensional set A({si}i) [0, 1]k. A({si}i) is equal to the union of a finite number of convex sets by Corollary E.14 and Lemma E.16.

A({si}i [k]) = [

m [k],valid {(yi 1,yi 2)}i [m] {1,...,k}m A(Y{(yi 1,yi 2)}i [m])

Now consider two halfspaces H1, H2, induced by the second layers of f and f .

H1 := {B Rk :

i=1 λi Bord(1,i) > z}

H2 := {B Rk :

i=1 λ i Bord(2,i) > z }

First, we cannot have that H1 = H2, as then f and f must share the same heads in their first layer and also have the same coefficients in the second layer. Thus, f = f , contradicting the original assumption that f and f are not equal (as witnessed by X).

Thus, H1 = H2. Denote

L1 := {B Rk :

i=1 λi Bord(1,i) = z}

L2 := {B Rk :

i=1 λ i Bord(2,i) = z }

H1 = H2 implies L1 = L2. Now, there are two cases, given by whether L1 and L2 have a nonzero intersection or not.

Case 1: L1 L2 =

H1, H2 divide Rk into 4 quadrants. Denote the (+, +) quadrant as H1 H2, (+, ) quadrant as H1 Hc 2, ( , +) quadrant as Hc 1 H2 and ( , ) quadrant as Hc 1 Hc 2.

Towards contradiction, let us consider all possible A({si}i) such that the following Condition I holds.

(Condition I) there is no B A({si}i) in the ( , +) and (+, ) quadrants such that:

Non-Asymptotic Length Generalization

i=1 λi Bord(1,i) > z

i=1 λ i Bord(2,i) < z

i=1 λi Bord(1,i) < z

i=1 λ i Bord(2,i) > z

For ease of notation, index each basis schema of Corollary E.14 with j [Nk], where Nk (= 2k 1) is the total number of basis schema, so that A({si}i) = SNk j=1 Aj. Thus, Condition I must also apply to each Aj, j [Nk].

In order that Aj satisfies Condition I, there is only three convex possibilities:

1. (Type I) Aj is a convex subset of a (k 1)-dimensional plane that has nonzero intersection with the (k 2)-dimensional intersection of O = L1 L2 and that is a subset of cl(H1) cl(H2) Hc 1 Hc 2

2. (Type II) Aj is a convex subset of the quadrant cl(H1) cl(H2), where:

i=1 λi Bord(1,i) z

i=1 λ i Bord(2,i) z

3. (Type III) Aj is a convex subset of the quadrant, Hc 1 Hc 2, where:

i=1 λi Bord(1,i) z

i=1 λ i Bord(2,i) z

Lemma E.33 implies that it is impossible for any Aj to be Type I since the dimension of Type I sets is k 1 but Aj is dimension k.

Lemma E.34 implies that either all Aj are type II or all Aj are type III, which can be seen as follows. Suppose towards contradiction that there was an Aj of type II and a separate Aj of type III. First, no point O = L1 L2 can be an interior point of any Aj, because O is a k 2 dimensional subspace, and it is impossible for an L2-ball centered at any point in O of any nonzero radius and dimension k to be completely contained in [cl(H1) cl(H2)] [Hc 1 Hc 2] SNk j=1 Aj, as such a ball would always have a non-empty intersection with H1 int(Hc 2) and H2 int(Hc 1). Thus, O is disjoint from S

j [Nk] int(Aj). Because there is an Aj of type II and a separate Aj of type III, part of SNk j=1 int(Aj) is contained

in cl(H1) cl(H2) and part of it is contained in Hc 1 Hc 2. However, since no point in O can be in SNk j=1 int(Aj), then SNk j=1 int(Aj) is not connected (i.e. it is the union of two non-empty, separated sets). This contradicts Lemma E.34. Thus, all Aj must be type II or they must all be type III.

Non-Asymptotic Length Generalization

Finally, observe that (0, . . . , 0) A({si}i) = SNk j=1 Aj, realized by a test-function which has slope 0 everywhere. Also,

(1, . . . , 1) A({si}i) = SNk j=1 Aj, realized by a test-function which has slope 1 everywhere.

Suppose all Aj were type II. At least one Aj must contain (0, . . . , 0), but if (0, . . . , 0) is in the quadrant, cl(H1) cl(H2), this implies:

i=1 (λ i 0) z

i=1 (λi 0) z

This contradicts the assumption that z, z > 0.

Now, suppose all Aj were type III. At least one Aj must contain (1, . . . , 1), but if (1, . . . , 1) is in the quadrant, Hc 1 Hc 2, this implies:

i=1 (λ i 1) z

i=1 (λi 1) z

This contradicts the assumption that PK i=1 λi > z and PK

i=1 λ i > z .

In short, z > 0 and P

i [K(f)] λi > z for all functions in f C-RASP2 = A({si}i) does not satisfy Condition I.

Thus, there exists B A({si}i), realized by a continuous test-function, in the ( , +) and (+, ) quadrants such that:

i=1 λi Bord(1,i) > z

i=1 λ i Bord(2,i) < z

i=1 λi Bord(1,i) < z

i=1 λ i Bord(2,i) > z

Case 2: L1 L2 = .

The argument here is similar. We know L1, L2 are parallel hyperplanes which never intersect. We will argue that it is still true that {Aj}j [Nk] are either all type II or all type III, and then reach a contradiction with the assumption that z > 0 and P

i [K(f)] λi > z for all functions in C-RASP2. Given that L1, L2 are parallel, consider two subcases.

Let (+, ), (+, ) denote the subcase where halfspace H1, H2 have the same parity, but their intercept term is different. An example of this are two halfspaces: x + y > 2, x + y > 1 .

Non-Asymptotic Length Generalization

Let ( , +), (+, ) be the subcase where H1, H2 have opposite parities. An example of this are the two halfspaces: x + y < 2, x + y > 1 .

Subcase 1: (+, ), (+, ) In this case, in order that the planes are not identical, there is a nonzero difference in the intercept term between L1, L2. By Lemma E.34, since S

j [Nk] int(Aj) is connected, then {Aj}j [Nk] are either all type II or all type

III, via a similar argument as Case 1, which contradicts that z, z > 0, and PK i=1 λi > z and PK

i=1 λ i > z respectively. Thus, Condition I cannot hold.

Subcase 2: (+, ), ( , +) In this case, either cl(H1) cl(H2) = or int(Hc 1) int(Hc 2) = . Once again, {Aj}j [Nk] are either all type II or all type III, contradicting that z, z > 0, and PK i=1 λi > z and PK

i=1 λ i > z respectively. Thus, Condition I cannot hold.

Here is a Corollary of Lemma E.34 which will be useful in strengthening Lemma E.30, which will let us improve the final bound from O(T O(K2)) (αO(log α)) to O(T O(K)) (αO(1)). Corollary E.36. (Connectedness of int(A2,3({si}i [k]))) Define A2,3({si}i [k]) as the set of activations by basis schema where m {2, 3}. That is, there is only either one or two monotone curves in the schema.

A2,3({si}i [k]) := [

m {2,3},valid {(yi 1,yi 2)}i [m] {1,...,k}m A(Y{(yi 1,yi 2)}i [m])

Then, we claim that S

m {2,3},valid {(yi 1,yi 2)}i [m] {1,...,k}m int(A(Y{(yi 1,yi 2)}i [m])) is connected.

Proof. We have that A2,3({si}i [k]) := A(Y{(1,k),(k,k)}) S

j {2,...,k 1} A(Y{(1,k),(k,j),(j,j)}) A(Y{(k,1),(1,1)}) S j {2,...,k 1} A(Y{(k,1),(1,j),(j,j)}). The connectedness of int(A(Y{(1,k),(k,k)})) S

j {2,...,k 1} int(A(Y{(1,k),(k,j),(j,j)})) int(A(Y{(k,1),(1,1)})) S

j {2,...,k 1} int(A(Y{(k,1),(1,j),(j,j)})) follows from inspecting the edges in the graph construction in Lemma E.34.

In the sections where we analyzed edges between nodes in the same level and between consecutive levels, it follows that:

j {2, . . . , k 1}, node Y{(1,k),(k,k)} is connected to node Y{(1,k),(k,j),(j,j)}

j {2, . . . , k 1}, node Y{(k,1),(1,1)} is connected to node Y{(k,1),(1,j),(j,j)}

In the final section of the proof of Lemma E.34, we showed that

node Y{(k,1),(1,1)} is connected to node Y{(1,k),(k,2),(2,2)}

As a corollary, we can strengthen Lemma E.30 to the following. Lemma E.37. (Stronger version of Lemma E.30) If for every f C-RASP2 with K(f) heads, we have that z > 0 and P

i [K(f)] λi > z, then for any f, f C-RASP2 that are not equal, with K and K heads respectively, then f and f will differ on some continuous test-function, Y, which strictly satisfies the second layer inequalities. That is, with max(K, K ) k K + K distinct heads between f and f , let (B1(Y), . . . , Bk(Y)) be the activations induced by Y. Then, either:

i=1 λi Bord(1,i)(Y) > z

i=1 λ i Bord(2,i)(Y) < z

Non-Asymptotic Length Generalization

i=1 λi Bord(1,i)(Y) < z

i=1 λ i Bord(2,i)(Y) > z

In addition, Y will be of a basis schema Y with either one or two monotone curves: Y {Y{(yi 1,yi 2)}i [m] : m {2, 3}, valid {(yi 1, yi 2)}i [m] {1, . . . , k}m}. In particular, the number of segments M in the schema of Y will be at most 2 k instead of k2 for general basis schema.

Proof. We repeat the proof of Lemma E.30, but with int(A2,3({si}i [k])) substituted for A({si}i [k]) and Corollary E.36 substituted for Lemma E.34. Note that int(A2,3({si}i [k])) still contains the points (0, . . . , 0) [0, 1]k and (1, . . . , 1) [0, 1]k as well, as both of these are realized by a test-function of where the slope everywhere is equal to 0 everywhere and 1 everywhere, respectively, which are both of schemas with just one monotone curve.