# spatiocausal_patterns_of_sample_growth__9adfcdbf.pdf

Spatio-Causal Patterns of Sample Growth

ANDRE F. RIBEIRO , Harvard University, USA and University of Sao Paulo, Brazil

Different statistical samples (e.g., from different locations) offer populations and learning systems observations with distinct statistical properties. Samples under (1) Unconfounded growth preserve systems ability to determine their variables effects on outcomes-of-interest (and lead, therefore, to interpretable black-box predictions). Samples under (2) Externally-Valid growth preserve their ability to make predictions that generalize across out-of-sample variation. The first generates predictions that generalize over sample populations, the second over their common unobserved factors. We illustrate these theoretic patterns in the full American census from 1840 to 1940, and samples ranging from the street-level all the way to the national. This reveals new conditions for the generalizability of samples over space and time, and connections among the Shapley value, counterfactual statistics, and hyperbolic geometry.

JAIR Associate Editor: Quanquan Gu

JAIR Reference Format: Andre F. Ribeiro. 2025. Spatio-Causal Patterns of Sample Growth. Journal of Artificial Intelligence Research 83, Article 22 (July 2025), 19 pages. doi: 10.1613/jair.1.15675

1 Introduction

Large scale and high-dimensional geospatial datasets currently offer rich opportunities for predictive and Geo-AI applications [32, 15, 19, 33] (e.g., disease incidence, ecological behavior, electoral results, crime occurrence, economic growth, recommendation systems). While it is common practice to train regression and classification models in data collected across distinct locations, little is known about how their out-of-sample accuracy ( predictiveness ) and biasedness (e.g., black-box fairness ) [7, 34] are expected to change across spatial extensions. The first indicates whether predictions derived from the sample will be close to their true values for a population in conditions different from those at data collection (i.e., whether they will generalize ), and the latter whether they will systematically favor individual populations (e.g., as result of insufficient sample sizes, unobserved variables, or other failures in sample selection). Understanding these issues is important because they allow us to answer crucial questions for learning systems: Can predictions made for a given population, with data from one location, be used in others? Does collecting larger samples improve prediction accuracy for that first population? We first formulate functions describing fairness-generalizability tradeoffs across space, revealing their connections to hyperbolic geometry and theoretic experimental designs. We then consider 100 years of the American census (and all variables in the census) as case study. For each cross-section (decade), we consider the important task of predicting economic growth for over 60K individual locations under increasing spatial samples. We demonstrate how (1) generalizability tradeoffs evolve across spatial levels, and (2) repeat the validation of generalizability limits derived in [27] for the spatial domain, and with the current census micro-data. Let 𝑆: 𝑋𝑚 [0, 1] describe any learning system or agent using an input sample 𝑋𝑚with 𝑚variables to derive a classification decision, 𝑆(𝑋). Our central goal is to formulate how the generalizability of these systems changes

Corresponding Author.

Author s Contact Information: Andre F. Ribeiro, ribeiro@alum.mit.edu, Harvard University, Cambridge, Massachusetts, USA and University of Sao Paulo, Sao Carlos, Sao Paulo, Brazil.

This work is licensed under a Creative Commons Attribution International 4.0 License.

2025 Copyright held by the owner/author(s). doi: 10.1613/jair.1.15675

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

22:2 Ribeiro

across space (i.e., to what extent a model assembled in a location will hold for others), and, thus, to identify a parametric functional form F that can describe accuracy bounds across possible 𝑆,

F 𝑑𝑥0 = max 𝑆

n ACC 𝑆;𝑋 h 𝑥0, 𝑑𝑥0 i o , (1)

where ACC indicates the accuracy of models trained in samples 𝑋[𝑥0; 𝑑𝑥0] encompassing all observations at distances less than or equal to 𝑑𝑥0 from 𝑥0. The specific way in which F changes across space offer limits and opportunities to algorithmic and agent learning systems and their performance. Strict bounds on the uncertainty of predictions afforded to algorithms or agents using local data can have a profound impact on the usefulness, scope, and quality of their strategic decisions or the recommendations they offer. A non-decreasing F , with increasing 𝑑𝑥0, indicates that 𝑆is able to learn a model that is accurate across locations, while a decreasing function indicates that learning fails to generalize across locations. A complementary issue would be to what extent 𝑆would be able to identify the effect (or importance) of individual variables to prediction using the same sample. To illustrate these two fundamental sample characteristics, consider a set of binary attributes 𝑋4 = {𝑎,𝑏,𝑐,𝑑} observed across US locations, such as, respectively, recorded presence of crime, police stations, banks and icecream shops. The behavior of these entities are possibly interconnected, which implies that any calculated statistic 𝑦 R over 𝑎is in fact an statistic over 𝑦(𝑎| 𝑏,𝑐,𝑑). Because of this, we say that a local sample has low External Validity (EV), since any changes in factors {𝑏,𝑐,𝑑} (or the many other factors that can conceivably affect crime), can invalidate the statistic. At the same time, because banks often appear together with ice-cream shops in commercial and affluent neighborhoods, we are not sure whether their presence plays any essential role when predicting crime incidence (i.e., whether they have only a spurious relationship to crime). Because of this, we say that such samples are confounded (CF). How can these two statistical issues be addressed and quantified? Comparison of crime incidence between this location and a second with banks but no ice creams shops, everything else constant, would lend evidence to the fact that ice cream shops are not driving crime up. These types of ideal what-if statements, where the effect of an outcome is observed under a single difference (while holding other factors constant), are often called counterfactual statements [23, 29]. In the hypothetical case where all such conditions can be observed, the problem of determining whether a factor is relevant to prediction is fairly easy to solve. These conditions have been formulated mathematically in the study of experimental designs [22]. The more challenging, and practical, aspect of this problem is, however, the case of unobserved conditions: often the relevant factors that change across locations are neither observed nor held constant. The issue of unobserved confounding is particularly serious for effect estimation, as effect statistics calculated from the sample might then also reflect the effect of exogeneous or unobserved variation that cannot be easily discounted by typical regression and effect estimation methods [31, 23]. Although in these conditions effect observations are flawed or noisy counterfactual observations, we still refer to them as counterfactual observations for short. We address these problems by studying spatial sample growth: we start with a sample with a single unit 𝑥0 (all conditions unobserved), we then progressively add other units to the sample at larger distances to 𝑥0, 𝑑𝑥0, progressively

decreasing its number of unobserved conditions. We consider ACC 𝑆;𝑋 h 𝑥0, 𝑑𝑥0 i across these samples, and, in particular, how EV and CF change across scales. The two previous problems reflect two key, but distinct, learning problems [18]: supervised out-of-sample prediction and factor effect (or importance) estimation. Supervised prediction focuses on making accurate predictions of an outcome in unseen data using a training sample, while factor effect estimation aims to measure the relevance of specific factors for prediction and model selection. These problems are, however, closely connected [18, 7], as complete and correctly specified models lead to accurate predictions. We will demonstrate that combinatorial properties of samples impact these two traditional problems differently and

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

Spatio-Causal Patterns of Sample Growth 22:3

fully observed

hold constant:

is hyperbolic

is half-golden

fully unobserved

Gilbert Shannon Reeds (GSR) shuffle

1/6 2/6 3/6 4/6 5/6 6/6

shuffle-and-cycle ( 3)

GSR shuffle

scale of first square (d/dsq)

hyperbolic sample size increases

all effect observations (including interactions)

all effect obs. of a (backgrounds with a, and without)

Fig. 1. Can counterfactual effect observations made in one location be used in another (are they externally valid, EV)? Can their independent effects or relevance to prediction be distinguished from others (are they unconfounded, CF)? (a) a 𝑚-dimensional hypercube and Pascal triangle with 𝑚rows portrays the full set of counterfactual effect observations with 𝑚factors in a sample 𝑋𝑚(𝑚= 3), a 𝑚 𝑚Latin-Square ( square ) portrays all effect observation backgrounds, more counterfactual effect observations increase guarantees over the generalizability and bias of samples effect observations, (b) serial and parallel interleaving of in and out of sample factors during sample growth and their expected sample sizes and growth rates, (c-d) sample sizes under EV-CF growth follow hyperbolic forms, with a half-golden rate of effect-to-background rate for the high-dimensional case, (e) illustration of Gilbert Shannon Reeds (GSR) shuffling as a mechanism to increase generalizability (EV) in partially-observed samples, and a shuffle-and-cycle strategy as an alternative which also guarantees decreases in factor effect confoundness (CF), (f) hyperbola for samples with distinct 𝑁 𝑎/𝑁+𝑎(left) and expected accuracy for samples with small or large numbers of unobserved factors (right).

reveal tradeoffs across samples and locations. Fig.1(d) outlines the two central contributions of the proposed framework, which establish hyperbolic forms for F (𝑑𝑥0) and consequent asymptotic limits to black-box accuracy. We first briefly formulatethese contributions. After this summary, we relate the approach to known black-box importance estimation and causal effect estimators, and, finally, formulate the proposed combinatorial-geometric relationships in detail.

2 Sample Counterfactual Observation Limits Let 𝑋= {𝑎,𝑏,𝑐, ..., [𝑚]} be a set of (observed or unobserved) binary factors1 characterizing a population 𝑥, 𝑥 [0, 1]𝑚, and 𝑦(𝑥) be a measurement over the population, 𝑦(𝑥) R. Behind many effect or importance estimation procedures is an experimental procedure (Sect. Related Work): add 𝑎to every variation of other populations; with each insertion, observe before-after outcome differences, Δ𝑦(𝑎). Consider a sample growth process where we start with a fixed sample unit 𝑥0 and, as we observe each new unit, we also observe their differences from 𝑥0, in respect to both 𝑥and 𝑦. Starting with a population 𝑥0, each possible sample growth trajectory (e.g., 𝑥0,𝑥1,𝑥2, ...) is an temporally or spatially ordered observation of the impact of gaining, or

1where [𝑚] is the 𝑚-th factor in a sample with variables 𝑋𝑚.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

22:4 Ribeiro

losing, a set of factors on 𝑦. In other words, each step is a counterfactual effect observation for sample unit 𝑥0, Δ𝑦(𝑥0 𝑥1) = 𝑦(𝑥0) 𝑦(𝑥1) (where the former difference is over sets and the latter over scalars). Each effect observation is thus defined by: (1) a difference in factors (what changed, 𝑥0 𝑥1 𝑋𝑚), an intersection of factors (what remained the same, 𝑥0 𝑥1 𝑋𝑚), and finally, a difference in outcomes (an observed scalar effect , Δ𝑦(𝑥0 𝑥1) R). For each trajectory and time, we can consider the biasedness and generalizability afforded by the accumulated samples to, for example, black-box predictors of 𝑦, and whether their performance is related to the increasing set of accumulated counterfactuals.

2.1 Sampling Effect-to-Background Ratios (𝜔) The growth space for any collected sample is then the imaginary space that contains all of the conceivable ways in which we could have assembled 𝑋from any one of its individual sample units 𝑥0 (Sect. Combinations, Permutations and Partial Permutations). It is often only partially observed. This is a problem in samples where factors cannot be assumed to be (1) independent or (2) in-sample. That s because (effect) observations can then be contingent on (1) the order or history of the growth process, or (2) out-of-sample factors. In practice, two timescales determine the statistical properties of the sample growth process: the rate 𝑁+𝑎at which individual factor differences are observed (i.e., when their counterfactual effects can be observed), and the rate 𝑁 𝑎at which the backgrounds in which they are observed change. If the relationship between these two timescales is such that effects are observed under a large number of backgrounds, then we can be more confident about their generalisability (i.e., that effect observations will likely be reproducible in future backgrounds). On the other hand, if observations are such that different populations are observed under the same backgrounds, we can be more confident about their unconfoundness (i.e., that effects observed reflect the same uncontrolled out-of-sample variation across all sample populations). More precisely, the background 𝐷 𝑎 D(𝑋 {𝑎}) of an effect observation of factor 𝑎, is the instantaneous condition in which the effect is observed, Δ𝑦[ 𝑎| 𝐷 𝑎]. We will consider definitions for D(𝑋 {𝑎}) where it corresponds to the set of all possible values over the set of other factors, D(𝑋 {𝑎}) = 𝑋 {𝑎}, or their permutations2, D(𝑋 {𝑎}) = Π(𝑋 {𝑎}), and when 𝑋contains both observed and unobserved factors. For illustration, imagine all ways we can observe backgrounds across growth trajectories of the example in Sect. Introduction, 𝑋= {𝑎,𝑏,𝑐,𝑑}. In order to consider effect observations for 𝑎, we must insert this individual factor in all its possible 3! = 6 backgrounds observable during growth. After each insertion, we can then observe changes Δ𝑦(𝑎) in the outcome-of-interest, 𝑦, to understand 𝑎 s effect on 𝑦. Another way of saying this is that we need to keep 𝑎constant, while cyclically permuting all other sample factors, Fig.1(a). That is, we are called to observe the effect of 𝑎under the cyclic permutations of 𝑑 𝑚 1 factors present in the sample. For a factor 𝑎and sample with 𝑚factors, this defines a map1

𝜏(𝑥𝑚) : {𝑏,𝑐, ...[𝑚 1]} {𝑐, ...[𝑚 1],𝑏}, (2)

whose iteration 𝜏(𝑥𝑚),𝜏2(𝑥𝑚), ...𝜏𝑚(𝑥𝑚) have a shifting action in the original permutation and generates the cyclic group3 of order 𝑚 1 [11]. Whether we require only 𝑎to be inserted in all background values with factors other than 𝑎(a single cyclic background permutation 𝜏(𝑥)), or repeat this for all 𝑚sample factors, will change the guarantees we can make in respect to the EV and CF of effect observations. At the limit, these two cases correspond to 1 or 𝑚 1 iterations of the recursive definition of a factorial, 𝑚! = 𝑚 (𝑚 1)!

2where Π(𝑋) is the set of all permutations of the set 𝑋. 3The cyclic group generated by 𝜏is related in an obvious way to the set of all permutations of 𝑚elements (i.e., to the 𝑆𝑚symmetric group), where 𝜏𝑖(𝑥), 𝑖< 𝑚, correspond to one partial permutation, 𝜏𝑖(𝑥) 𝑆𝑚, for all4 𝑥 P(𝑋) (Sect. Combinations, Permutations and Partial Permutations). 4where P(𝑋) is the power-set of 𝑋and we use high caps to indicate counts over unique values.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

Spatio-Causal Patterns of Sample Growth 22:5

Define, therefore, variable 𝑁+𝑥0 = |P(𝑥0)| indicating the number of unique in-sample effect observations4

in 𝑥0, and 𝐷 𝑥0 (𝑋) = |D(𝑋 𝑥0)| their out-of-sample backgrounds. The total number of counterfactual effect observations (i.e., all effects under different backgrounds) for the factors 𝑥0 is then given by

0 < 𝑁+𝑥0 𝐷 𝑥0 (𝑋) 2|𝑥0| (𝑚 |𝑥0|)!, (3)

(Sect. A Representation for Sample Effect Observations) which is large in the factorial-based definition of a background , 𝐷 𝑥0(𝑋) 𝑁+𝑥0(𝑋). A key question then becomes, as sample dimensions grow, what is the asymptotic number of backgrounds that effects will typically be observed under?

2.2 Unobserved-Observed Combinatorial Shuffling (Accuracy) To illustrate how models trained in the samples with the previous characteristics can have their accuracy constrained, we can relate the previous growth process to a traditional combinatorial randomization scheme such as Gilbert Shannon Reeds (GSR) shuffling [10, 11]. Like before, start with a sample containing only one population 𝑥0 𝑋𝑚(set of binary attributes). Let a population with the same attributes as 𝑥0 be represented by the string containing only 1 values (for factors 𝑎,𝑏,𝑐...). A string containing differences can then be written as 𝜎(𝑥0) = (𝜎1, 𝜎2, 𝜎3, . . .), where 𝜎𝑡 {0, 1}. An in-sample counterfactual effect observation corresponds to measuring outcomes before and after applying the operation in Eq.(2), 𝜏(𝜎𝑡, 𝜎𝑡+1, 𝜎𝑡+2, . . .) = (𝜎𝑡 1, 𝜎𝑡, 𝜎𝑡+1, . . .). We can use the same representation for out-of-sample, or yet unobserved, factors, leading to a second sequence 𝜎 1, 𝜎 2, 𝜎 3, . . . until we observe all factors. Sample growth can then be represented as the bi-directional string with positive values in-sample and negative out-of-sample, 𝜎= (. . . , 𝜎 2, 𝜎 1, 𝜎0, 𝜎1, 𝜎2, . . .). In an increasing spatial sample, the current scale corresponds to the zero-index string position. Similar to the GSR, this bi-directional string may be represented, in turn, by two real numbers 0 𝑥,𝑧 1 as

𝑡=0 𝜎 𝑡2 (𝑡+1), 𝑧(𝜎) =

𝑡=0 𝜎𝑡+12 (𝑡+1), (4)

The shifting action 𝜏for two separate strings is known as a dyadic transformation, which can be thought as a folding or shuffling operation between them - mapping each 𝑥to a distinct 𝑧in each iteration. In this case, the transformation is between a set of factors, 𝑥0, and possible background, D(𝑋 𝑥0). The right diagram in Fig.1(e) illustrates the result of a GSR shuffle for a 6-letter example with half variables observed. With each shuffle, each in-sample factor, {𝑎,𝑏,𝑐} (colored bars), effect is observed under different backgrounds (e.g., the effect of 𝑎is observed under the background of 𝑑and of 𝑏under 𝑒). We would therefore expect the generalizability (EV) of effect observations to increase with each operation. There are, however, two practical problems with this scheme. First, since each factor effect is observed in a different background, each effect observation is confounded by a different factor (e.g., the observed effect of 𝑎reflects the influence of unobserved factor 𝑑). Second, GSR shuffling only shuffles factors under a 1-step Markovian assumption, disregarding higher-order and non-additive factor interactions. Fig.1(e) illustrates an alternative shuffle-and-cycle scheme, where each shuffle step is followed by 𝑚cyclic permutations 𝜏,𝜏2, ...,𝜏𝑚of in-sample factors. This alternative has three advantages: (1) each factor is observed under every background, and thus effects are confounded in equal proportion across sample populations (thus allowing us, for example, to factor-out these effects more easily [27]), (2) every effect is now observed for the same permutation of unobserved factors ({𝑑,𝑐,𝑒} in the figure), and thus generalizability increases at a common rate for all factors and populations, and without Markovian assumptions, (3) this is a limiting process for every sample (Sect. Combinations, Permutations and Partial Permutations).

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

22:6 Ribeiro

2.3 Hyperbolic Geometry (Sample Sizes)

To better understand the relationship between sample sizes and out-of-sample performance, we need to consider what sample sizes are required to generate all ways of interleaving samples in and out factors. Fig.1(b) outlines two equivalent sample growth patterns that achieve this ( serial , parallel ). In the left, each new background is interleaved with all previous effects serially , analogous to the previous factorial scheme. A first out-of-sample factor,𝑑 𝑋 𝑥0, requires observation of 𝑁+𝑥0=3 effects, 𝑥0={𝑎,𝑏,𝑐}, a second requires 𝑁+𝑥0+1 effect observations, 𝑥1={𝑎,𝑏,𝑐,𝑑}, etc. In the example at the right, several out-of-sample factors are interleaved with several in-sample simultaneously, analogous to a GSR shuffle. Effects for 𝑥0 are observed, at first, under backgrounds {𝑑,𝑒, 𝑓}, then {𝑓,𝑑,𝑒}, etc. Both strategies lead to a geometric series of distinctbackground for individual effects, but at different rates. Sample sizes in both cases can be described by the hyperbola,

The equation expresses that, for each new in-sample factor, 𝑁+𝑥0, we are able to re-measure their effects in each new unobserved background, 𝐷 𝑥0, thus increasing their generalizability (EV). The quantity 𝜔describes, in turn, the speed in which EV is expected to increase for individual in-sample populations, Fig.1(c). Sample sizes (as opposed to effect background counts) follow known exponential, (1 1/𝑚)𝑡 𝑒 1, and binomial, 2 1, growth rates in these two cases, 0 < 𝑡 𝑚(𝑚 1), Fig.2(b). When all 𝑚variables are observed across the same number of backgrounds (i.e., in balanced samples5), samples following Eq.(5) have sizes, 𝑛𝑥0, increasing according to a Fibonacci sequence, 2 𝑛𝑥0 = 𝑁+𝑥0 + 𝐷 𝑥0, and thus asymptotically assuming half-golden background-to-effect rates,

𝑁+𝑥0 𝐷 𝑥0 𝜙

In conclusion, these equations describe systems that permute effect observations, but whose number of effect observations are limited to under-factorial sample sizes. Samples following half-golden background-to-effect ratios, Eq.(6), observe effects across approximately the same number of backgrounds across all its populations, leading to effect estimates that have, simultaneously, increasing EV and small CF sustained throughout sample growth.

Fig.1(f) illustrates the expected relationship between maximal accuracy of samples, ACC 𝑆;𝑋 h 𝑥0, 𝑑𝑥0 i , and sample sizes. The diagram shows sample size divided by square size vs. 𝐴𝐶𝐶(Sect. A Representation for Sample Effect Observations). In complete systems (top-curve), the observation of one square (full set of sample effect observations) is enough to generate accurate effect observations. In incomplete systems (bottom-curve), the number of effect observations necessary for accuracy can grow factorially with the number of unobserved factors - requiring very large samples to achieve similar levels of accuracy. This is a conservative estimate which can be abated by increases in periodicity and independence in out-of-sample factors. In real systems, where factors are typically highly correlated, it seems to be a reasonable upperbound for accuracy. We demonstrate these alternative patterns of sample growth in Sect. Results (EV vs. EV-CF) using

Descriptive statistics (sample factor distributions, extreme-value distributions, rankings and autocorrelation), Large-scale black-box prediction of economic growth (multiple state-of-the-art supervised methods), calculated across samples ranging from the street to national spatial level, and over 100 years of the US census.

5this is analogous to notions of balancedness in experimental designs [22] but require milder conditions than equal-size populations, being observable in multi-frequency and multi-scale processes, and being observed in real-world systems, as demonstrated in Sect. Results.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

Spatio-Causal Patterns of Sample Growth 22:7

3 Related Work The Shapley-value [28] has become an essential tool across disciplines to estimate the importance of variables from the output of black-box systems (i.e., whose inputs can be manipulated exhaustively at will)[21, 3, 7]. The value can be interpreted as the enumeration of all counterfactual effect observations in a fully-observed system. This makes the Shapley value an instance of an U-Statistic and a permutation-based statistic [20, 13]. The value 𝜑(𝑎) was devised first to quantify the importance of a given player 𝑎in a 𝑚-player game, and it can be written as

𝜑(𝑎) = 1 (𝑚 1)!

𝑦(𝑃𝜋 𝑎 {𝑎}) 𝑦(𝑃𝜋 𝑎) , (7)

where Π(𝑋 {𝑎}) enumerate all permutations 𝜋of a set of size 𝑚 1, 𝑦is a game utility measure, and 𝑃𝜋 𝑎 is a possible coalition6 among players (not including 𝑎), formed in the order 𝜋. Each quantity under bracket is a counterfactual observation of the effect of 𝑎(i.e., under all distinct backgrounds and their orderings). Eq.(3) counted the number of such observations for each population in a sample. The value is an ideal, as its calculation is NP-complete [5] and, when quantifying variable importance, it assumes there are no unobserved causes in the sample. Due to sample correlations this equation cannot be used, as well, with random sampling. Despite these shortcomings, the Shapley-value formulate crucial relationships among permutation of inputs when calculating sample statistics and their unbiasedness or fairness [7, 3]. Calculating the expected number of permutations that can be enumerated in samples, or locations, as proposed here, should thus sets strict bounds for their unbiasedness, and offer a finer-grained illustration of these relationships. While the relationship between the Shapley-value and fairness of black-box predictors is known [28, 21], their relation to generalization is perhaps more surprising [27]. A quantity that becomes central to the formulation of accuracy bounds, Eq.(1), and tradeoffs between the two previous learning problems, is the growth rate, 𝜔, in enumerable permutations across systems spatial levels. Because the Shapley value cannot be calculated in practice, random sampling is often employed as an approximation in Shapley-based importance ranking [21]. The notion of sample squares (Sect. A Representation for Sample Effect Observations) was devised from the sample s set of observed permutations , with which effects can be calculated without assumptions of independent and identically distributed factors. Using square sampling, is advantageous not only for samples with factor correlations, but, particularly, incomplete samples, where, as formulated, assumptions of large random sampling become unrealistic due to their factorial requirements on effect observations. These gains were demonstrated in [27] and are revisited in Sect. Results. Furthermore, the theoretic relationships here elucidate sampling size requirements and EV-CF tradeoffs and limits for non-parametric variable importance and effect estimation. A key element of the previous solution is that sample units in the same location share a large number of unobserved ( external ) and uncontrolled factors. Studying samples with increasingly inclusive and distinct counterfactual effect observations can reveal conditions for effect generalization across samples and space. Consider the single factor case. Let 𝑋 {𝑎} be the set of external factors for population {𝑎}. In a random sample with a single treatment indicator 𝑎, it follows that IE{ 𝑝[ 𝑎= 1 | D(𝑋 {𝑎})] } = 2 1, as, at each 2 time intervals, we are expected to rebalance (random variables are bold). This is the rationale underlying, for example, Randomized Control Trials [22]. A location with this property has a single balanced population, {𝑎}, and common external factors, 𝑋 {𝑎}. We can alternatively say that 𝑝[ 𝑎= 1 | D(𝑋 {𝑎})] = 0.5, or, 𝑎 (𝑋 {𝑎}) | D(𝑋 {𝑎}), which are typical non-confounding conditions [29, 26]. If units in the location share the same uncontrolled factors and have the same number of members with 𝑎as without 𝑎, then expected outcome differences between them correspond to 𝑎 s effect, conditional on the common variation, IE[Δ𝑦(𝑎) | 𝑥0 = D(𝑋 {𝑎})] = 𝑦(𝑥0+{𝑎}) 𝑦(𝑥0 {𝑎}). Learning systems and agents in such locations operate with fair estimates of 𝑎 s impact (albeit, with low EV). In a square, in contrast, all 𝒎sample factors are balanced simultaneously

6a player set describing a possible cooperation structure in the game with value 𝑦(𝑃𝜋) when formed.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

22:8 Ribeiro

Binomial (CF)

hold constant

possible combinatorial populations without a, N-a, and a Pascal's triangle with (m-1) rows

Fig. 2. (a) a 𝑚 𝑚Latin-Square ( square ) for a sample unit 𝑥0 (𝑚= 4) and the combinatorial relationships among the sample units placed across its different cells (Venn diagram, factor intersections are shown in grey and singleton differences in color), (c) Binomial ( 1

2 ), Fibonacci ( 1

𝜙), and Exponential ( 1

𝑒) rates across squares lead to hyperbolic relations among population sizes, each square s triangle altitude (dashed) is related to samples effect-to-background rate, 𝜔, (c) a sample population sweep for factor 𝑎, where the rate of insertion of 𝑎in populations is held constant, the figure illustrate population sizes as dots and three phases of sample growth: initial (no population has 𝑎), balanced (the same number of populations have and don t have factor 𝑎), and possibly selected (where all populations have 𝑎).

(𝑚> 1). While single-factor balance requires a binomial series, balancing several requires Fibonacci - i.e., square altitude expansion (Sect. Sample EV-CF Growth Patterns). Each population, in this case, follows asymptotic sample size rates IE[ 𝑁+𝑎

2 . Square accumulation thus increases the EV of all its populations simultaneously [27] - making them useful, for example, to understand sample accuracy limits across scales, Eq.(1).

4 The Combinatorial-Hyperbolic Relationship in Sample Growth 4.1 A Representation for Sample Effect Observations

The set of all counterfactuals accumulated by sample growth at one instant can be visualized with a Latin-Square ( square ), Fig.2(a). Fig.1(a) illustrates two standard ways of visualizing the complete set of 2|𝑥0| in-sample effects: as the number of edges of an hypercube of dimension |𝑥0| or sums of Pascal triangle s |𝑥0|-th row. The square will serve, in addition, as basis for non-parametric effect estimates across sample factors. For a fixed unit or population 𝑥0, it represents a stratification, or placement , of all other populations, 𝑥𝑖, across square cells, with repetition. The completeness or incompleteness of squares, for each 𝑥0, will have a stipulated impact on the EV or CF of their effect observations. In particular, for 𝑚factors (𝑎, 𝑏, 𝑐,..., [𝑚]) the square enumerates all singleton effect observations possible from the sample s 𝑚-way effect differences. Its first column contains counterfactual effect observations for {𝑎,𝑏,𝑐, ..., [𝑚]} (i.e., conditioned on all other𝑚 1 factors being the same as, or overlapping with , 𝑥0). The second column contains singleton effect observations possible from the previous observations (with size 1 difference and 𝑚 2 intersection with 𝑥0). These effect observations are thus conditioned on one further factor observation (i.e., on the factor in the preceding column). The third column contains singleton effect observations possible from the previous observations (size 1 difference and 𝑚 3 intersection, etc.). Fig.2(a) illustrates these combinatorial patterns with Venn diagrams for each cell, where a cell s pairwise intersecting factors are grey and singleton differences are colored. This iterative procedure enumerates all possible singleton effect observations

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

Spatio-Causal Patterns of Sample Growth 22:9

in a sample. The square diagram shows only the singleton effects (cells), with their conditioning factors implied. Fig.1(a) illustrates that the square contains all in-sample backgrounds (each a Pascal s triangle) for a fixed factor 𝑎, and thus 2 2𝑚 1 unique effect observations of 𝑎. Each of its diagonals contains all observations for the effect of a given sample factor. Taking columns to mark time progression, the main diagonal thus marks the point of insertion of factor 𝑎(e.g., insert 𝑎in populations {𝑑}, {𝑑,𝑐}, {𝑑,𝑐,𝑏}). Its lower triangle records the populations without 𝑎(with size 𝑁 𝑎) and the upper triangle with 𝑎(with size 𝑁+𝑎) . The square of order 𝑚 𝑚, as a whole, contain effect observations where all factors are observed under all 𝑚-cycles of a fixed permutation (e.g., {𝑎,𝑏,𝑐,𝑑} in Fig.2(a)). Squares of increasing orders thus captures effect observations under increasing Markovian orders (i.e., conditioned across larger times or backgrounds). The relationship of sample permutations to measurements unbiasedness is a cornerstone of the most widely accepted Theory of Non-parametric Statistics, U-Statistics [20, 13] and of Shapley value based estimates of variable importance (Sect. Related Work). The relationship to generalizability has been discussed in [27], and is reviewed, and expanded, below. The full set of Í𝑚 𝑡=0 𝑚 𝑡 = 2𝑚effect observations observable in a sample of dimension 𝑚collect 1 square for each of its sample populations, 𝑥𝑖 𝑋𝑚. It suggests then a natural sample limit for the generalization of effects. Fig.1(c) illustrate the resulting phases for samples under growth with a factor 𝑎: no unit includes 𝑎(initial), as many units include 𝑎as not (balanced), and all include 𝑎(selected). Eq.(5) should hold across all such scenarios (𝑁+𝑎, 𝑁 𝑎>0).

4.2 Combinations, Permutations and Partial Permutations

The statistical concept of a population is often associated with combinatorial combinations, as a set of sample units with a given attribute combination (e.g., high-income white males). There are thus 𝑚 𝑡 = 𝑚! 𝑡!(𝑚 𝑡)! populations of size 𝑡. A problem with this definition is that it leaves unspecified all non-population factors. To define a population we imagine, instead, that we fix the 𝑚 𝑡population factors and vary (i.e., permute ) all 𝑡non-population (i.e., external ) factors. This leads to a combinatorial structure known as a partial permutation. The number of partial permutations for a population of size 𝑡is 𝑚 𝑡 𝐷𝑚 𝑡, where 𝐷𝑚 𝑡is the number of derangements (permutations without overlaps),

The full set of 𝑚! permutations of size 𝑚, and all sample growth trajectories, can then be formulated as sets of partial permutations, using a well-known definition for factorials,

= h cosh(𝑚 1) + sinh(𝑚 1) | {z }

cosh(𝑚 1) sinh(𝑚 1)

i (𝑚 1)! + 1. (10)

The term 𝐶𝑚= Í𝑚 𝑡=0 𝑚 𝑡 in Eq.(9) corresponds to a single Pascal triangle and half-square (i.e., one set of all differences) for each sample population, and Eq.(9) to all squares. The number of observed permutations in a sample can thus be specified succinctly by its number of squares and their derangements. Samples with no missing variables require the observation of few derangements (no relevant exogeneous variation) for accurate effect observations, while incomplete samples require the observation of many derangements [27]. The odd and even parts of Taylor s expansion of Eq.(9) leads to hyperbolic trigonometric functions, Eq.(10) (proof in

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

22:10 Ribeiro

the Supplementary Material). They indicate the period in which full sets of permutations are collected. The parametric equations for the hyperbola s right branch in (𝑥,𝑦) cartesian coordinates, Eq.(5), are 𝑥= 𝜔 cosh(𝑁+𝑎) and 𝑦= 𝜔 sinh(𝑁 𝑎), Fig.1(f). We will see that these quantities are related to in-sample effect to background sampling rates, 𝜔. This quantity will be essential to describe statistical tradeoffs across growing samples.

4.3 Sample EV-CF Growth Patterns

According to the previous, an unbiased effect estimate for 𝑎is an average across possible effect backgrounds, and constitutes an U-Statistic [20]. There are 𝐹𝑚,𝑛= Í𝑚 1 𝑡=0 𝑚 𝑡 𝑡 such sequential observations7. To generate all of them, we need to fix each effect observation s first, second, third, etc. factors in order. 𝐹𝑚,𝑛corresponds to the sum of the number of observations necessary to fix any first factor, 𝑚 1 1 = 𝑚 1 , then 𝑚 2 2 to fix a second from the remaining, etc. until all 𝑚 1 factors are used. The relationship in Eq.(5) corresponds to the Cartesian equation of a rectangular hyperbola, 𝑁+𝑥0 𝑁 𝑥0 = 𝑐, where 𝑐is constant (although well-known, this is formulated in the Supplementary Material for completeness). According to the previous, these two quantities have different limits, however, 𝑁+𝑥0 [1,𝐶𝑚] and 𝑁 𝑥0 [1, 𝐹𝑚,𝑛]. The relationship can thus describe sample limits in large-populations by substituting 𝑁+𝑥0 = 𝐶𝑚and in 𝑁 𝑥0 = 𝐹𝑚,𝑛in Eq.(5). As formulated next, the same result can be established from known rates across Pascal s triangle. The two previous quantities, 𝐶𝑚and 𝐹𝑚,𝑛, appear in Pascal s triangle (adjacent side and altitude), Fig.2(b). Since the main diagonal marks 𝑎 s possible times-of-insertion , the square s upper triangle contains the set of all counterfactuals with 𝑎, and the lower, without 𝑎, Fig.2(b). In respect to effect observations, we say that each individual effect observation is observed under 𝐹𝑚,𝑛in-sample backgrounds for each derangement 𝐷𝑛(or twice this value in balanced samples). The sample effect-to-background enumeration rate 𝜔, at time 𝑡, is thus defined as 𝜔= 𝐹𝑚,𝑛

𝐷𝑛(𝑡), or, the number of in-sample background observations, 𝐹𝑚,𝑛(𝑡), per derangement, 𝐷𝑛(𝑡), across all populations in the sample. The growth of 𝐶𝑚, 𝐷𝑛and 𝐹𝑚,𝑛for 𝑁+𝑎or 𝑁 𝑎assume Pythagorean relations8, Fig.2(b),

Eq.(11) suggests the visualization of sample growth as hyperbolae9 with increasing radius 𝐷𝑛, Fig.1(f). In this limiting expression of Eq.(5), 𝐶𝑚corresponds to all possible individual sample populations, 𝑁+𝑥0 and 𝑁 𝑥0, and 𝐹𝑚to in-sample effect observation backgrounds. Fig.1(f) shows the hyperbolic asymptotes 𝐶𝑚= 𝐹𝑚,𝑛and 𝐶𝑚= 𝐹𝑚,𝑛(dashed). They represent growth with constant EV, 𝐷𝑛= 0. The figure also shows the asymptotic sample (vertical black line) where exactly all observations have factor 𝑎, 𝐹𝑚,𝑛= 0. Under this condition, no estimator, algorithm, or agent is able to estimate 𝒂 s effect non-parametrically. It represents the sample with minimum EV, while outward hyperbolae, samples with increasing EV. Unique background count growth in this direction follows a Fibonacci series, whose rate is the Golden number. It is well known that the rows, columns and diagonal of Pascal s triangle are associated with binomial, exponential and fibonacian rates. Notice then that 𝐷𝑛 𝐶𝑚 [1/𝑒, 1/2], as growth can range between 𝐶𝑚

𝑚= 2, and 𝐷𝑛

𝑛= 1/𝑒, Fig.2(b). The first is due to 𝐶𝑚= 2𝑚, and the second was famously established by Euler [30]. In the previous nomenclature, the first is associated with balanced or Unconfounded (CF) growth, and the second with EV sample growth. The golden ratio is associated, in contrast, with high-dimensional balanced growth of samples and populations, EV-CF growth, and with squares, Eq.(11).

7 𝑚 𝑡 𝑡 = 0, when 𝑡> 𝑚. 8the equation uses the Pythagorean theorem in its reciprocal form, as it includes the triangle s altitude. 9the equation for a hyperbole is ( 𝑥

𝑏)2 = 𝑟, with 𝑎and 𝑏its vertices and 𝑟radius.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

Spatio-Causal Patterns of Sample Growth 22:11

More specifically, squares are associated with the assumption that 𝐶𝑚

𝐹𝑚,𝑛is constant across factors (i.e., hyperbolae with constant radius), Eq.(11). It indicates that factors diagonals are the same size, and the population structure is, overall, a square . Finally, the known hyperbolic relationships

𝑡𝑎𝑛ℎ(𝑛) = sinh(𝑛)

2 = 𝜔 1, (12)

suggest expressing sample effect-to-background enumeration rates 𝜔in terms of tanh(𝑛)10. Also note that this definition for 𝜔coincides with that of Lorentz factor 𝛾[8, 14], best known as a time correction between frames-of-reference in the physical sciences. Here, it preserves frequency relations among factors, 𝑁+𝑎/𝑁 𝑎, under changes of basis of the type 𝑥=𝑥0+{𝑎} and 𝑥=𝑥0 {𝑎}. As suggested by Borel [4], it is natural to think of the transformation as a hyperbolic rotation (analogously to the typical trigonometric). We will illustrate many of these mathematical abstractions using real-world spatial data in Sect. Results.

5 Results We will now illustrate the formulated combinatorial and statistical generalizability limits in an important realworld problem: out-of-sample economic growth prediction across increasing spatial extensions (i.e., samples with increasing census individuals). Data used encompasses microdata of American decennial censuses from 1840 to 1940, and approximately 65 billion individual-level records. This time range corresponds to the decades of American urbanization. We consider the economic and demographic changes as we go, spatially, from the household spatial-level, 𝑑0 in lat-lon distances, all the way to the national level, for each studied year. We thus create samples with units at arithmetically increasing levels, 𝑑𝑡+1=𝑑𝑡+Δ𝑑(starting from 𝑑0). We repeat this for approximately 60K American locations, 𝑥0. Each full spatial analysis is then reproduced independently across years (thus avoiding issues related to extended longitudinal data). Fig.3(b, top-right) shows two locations in New York City, which share a large amount of external variation (i.e., economic and demographic variations across the rest of the country). The resulting nation-wide transversal captures combinatorial patterns of populations differences and overlaps in samples, for all 𝑥0, as we increase scale. Our main goal is to illustrate how, consequently, generalizability change across spatial-scales, according to the stipulated model and limits. We first study sample correlations and sizes, demonstrating they follow the previous hyperbolic relationships. We then repeat previous out-of-sample prediction tasks with this new census data and increasing spatial levels - thus adding to previous evidence presented for a combinatorial counterfactual model for sample generalizability [27] .

5.1 Descriptive Statistics (Sample Sizes and Correlations) We illustrate the consequences of Eq.(12) to sample properties using Autocorrelation functions (correlations) and hyperbolic co-tangent (coth) regressions (sample sizes) in large-scale census data. These considerations will be key to solving our main problem, Eq.(1), as the accuracy of agents and algorithms operating in samples are directly determined by sample sizes and their combinatorial patterns [27]. Economic distribution across space can be described by the primary occupation and industry of all census individuals [1, 16] (e.g., carpenter or executive assistant ). We start with this set of variables, and discuss the full set of variables, including non-economic, in the next section. Fig.3(c) illustrates empirical frequencies for all occupations (each a curve) at 4 different spatial-levels in Massachusetts (MA) and New York (NY), 1880. They were the country s economic centers until the 19th century. The distribution has the familiar shape of a wave that moves to the left. New York reaches a stationary shape at a lower level 𝑑𝑠𝑞. We demonstrate these correspond to levels where squares are completed across factors.

10with 𝑛= arctanh(𝜔 1) = arctanh( 𝐷𝑛

𝐹𝑚,𝑛), which, lets 𝑛be the number of accumulated derangements per fixed 𝐹𝑚,𝑛(i.e., per square), as expected.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

22:12 Ribeiro

New York City

0 1000 2000 3000

0.0 0.1 0.2 0.3 0.4 0.5

1e 11 1e 03 1e+05

1.8 1.2 0.6 0

1.8 1.2 0 0.6

0 100 200 300 400 500

0.0 0.1 0.2 0.3 0.4 0.5

1e+03 5e+03 5e+04 5e+05

0 50 100 150 200 0 50 100 150 200

0 25 50 75 100 x

0 200 400 600 800 d

Pensylvania

0 100 200 300 d

New York Iowa Maryland

0 25 50 75 100 d

2.5 5.0 7.5 10.0

0 200 400 600 800 d

0 25 50 75 100 d

0 300 600 900 d

195x195 square

0 250 500 750 1000 t

Figure 3. (a) sinh and cosh functions, (b) increasing spatial-levels at two example locations (national and city-levels), (rightmost panel) finest spatial-level for New-York City, (c) occupation frequency ranks vs. location across 4 example scales, each curve is an occupation, (d) enumerated Latin-Squares histograms for Massachusetts and New York, the latter has a square with almost all occupations, (e) periodogram of 𝑐𝑜𝑠ℎ(100)+𝑠𝑖𝑛ℎ(100), sinh(100) and cosh(100), (f) per-occupation periodogram and series example from (c), (g) auto-correlation vs. spatial-level trace catenaries (free-hanging ropes) for each occupation (1880), probability distribution of their slack (red, sidepanel) indicate a fixed 𝜔per factor at 0.81 (red horizontal line), Eq.(13), (h) standardized catenaries across all years, boxplots (red, sidepanel) show slack invariance and constant ratio between sinh and cosh growth for all locations, years and occupations, 𝑚 (1 1

𝑒) (red vertical line) is a fixed point in binomial-exponential (EV-CF) to exponential (EV) rate transitions.

All squares in a location can be enumerated through an expensive computational procedure [27]. Fig.3(d) shows histograms, where each color corresponds to one of 220 occupations. NY has a spatial square that extends to almost all occupations, while MA has missing occupations (horizontal gaps) in comparison.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

Spatio-Causal Patterns of Sample Growth 22:13

5.1.1 A Hanging-Rope Model for Unbiased Sample Growth. The Catenary is a hyperbolic curve with long scientific history and describes a free-hanging rope [9]. Unlike circles and geodesics, they are sums of exponentials. Their equation in (𝑥,𝑦) Cartesian coordinates is 𝑦= cosh (𝑥), and their length is 𝑙= sinh(𝑥), making them useful to demonstrate the previous model, Eq.(12), and increases in enumerable permutations across spatial levels. We demonstrate that both the observed shape, Fig.3(h), and parameters, Fig.3(h, boxplots), of spatial correlations follow predictions from the previous model. Before considering catenaries, however, Fig.3(a,e) illustrate the overall shape of the previous hyperbolic functions, and their frequency-based representations. Fig.3(a) depicts cosh(𝑛) and sinh(𝑛), and, Fig.3(e) the periodogram of 𝑐𝑜𝑠ℎ(𝑛)+𝑠𝑖𝑛ℎ(𝑛), sinh(𝑛) and cosh(𝑛) where 𝑛= 100. Fig.3(f) illustrates the empirical periodogram of curves in Fig.3(c), which resemble the simulated. Fig.3(g) shows auto-correlation function (ACF) for all locations across 5K spatial-levels (as those illustrated in Fig.3(c)), until the state level (Sect. Methods, Supplementary Material). They trace catenaries. The horizontal line 𝑦= 1.0, of unitary correlation, is associated with the limit 𝐹𝑚,𝑛= 0 where, despite the increasing scale, no population differences are added to the increasing samples. Each single catenary is a set of samples with constant 𝐶𝑚/𝐹𝑚,𝑛, which is a defining property of squares, Eq.(11). Fig.3(g) illustrates 4 typical cases among states. Plots for all states are available in the Supplementary Material. Maryland has linear decreases in auto-correlation. From 1840, the USA economy and cities become increasingly interdependent. After 1900, no longer any state had such linear correlation signatures. Periodic and linear (zig-zag) auto-correlations, with period 𝑚/2, are related to non-increasing EV, Fig.1(f, black vertical line). Periodic and exponential correlations, without growth, correspond to catenaries with ℎ= 0 (where a system returns to its original state after a lag). The defining characteristic of the catenary is 𝑦/ 𝑥= 𝑙/ℎ, where 𝑙is its length and ℎits slack , or, difference in height, 𝑦, between its two hanging points. Standardizing catenaries [9] (i.e., making 𝑙unitary and ℎconstant)11 thus makes its slack ℎindicate 𝜔 during sample growth,

which, according to Eq.(6), should assume half-golden values for small 𝐷 𝑥0. Fig.3(h) shows standardized catenaries for all years and locations. It indicates that tanh per factor remains constant across a range of levels, up to 𝑑𝑠𝑞, starting at the local. This was anticipated by Eq.(12). The rate, up to 𝑑𝑠𝑞, is 81% of correlation. Fig.3(h, sidepanel, red) shows box-plots for ℎ, across all levels, years, occupations, and locations. For all spatial-levels below 𝑑𝑠𝑞, factors remain balanced, with binary-exponential rates (i.e., hyperbolic functions with period 𝑚/2). Levels above 𝑑𝑠𝑞reverse to exponential growth. We called this a transition between EV-CF and EV growth. This is indicated in plots by the dislocation of the catenary center from𝑚/2 to𝑚(1 1/𝑒) (red vertical lines). We reproduce the same results, Fig.3(g,h), with standard Pareto regressions in Sect. Methods of the Supplementary Material - as alternative to these graphical depictions. Fig.4(d) shows estimated levels 𝑑𝑠𝑞for all states, across years. Levels 𝑑𝑠𝑞converge across years for all states. New York has 2-level squares. Fig.3(h, upright-panel) shows catenaries for its lower-level square, and Fig.4(b) illustrate levels cartographically. The two squares factors are disjoint (gray, lower factors taking exponential rates in higher). American states have had through their histories very different work forces and regional distributions. While catenary lengths are different across occupations, Fig.3(g), their slack (and cosh-sinh growth ratio) remains invariant across all locations, years and occupations, Fig.3(h). These plots thus graphically illustrate the previous combinatorial constraints for the unbiasedness and predictiveness of learning systems across space. We return to this discussion in Sect. Predictive Statistics.

5.1.2 Permutations in Heterogeneous Samples. Zipf s law and distribution are central to the study of city size distributions [24, 12, 2]. The law is based on a frequency ranking of studied factors, and thus, on one of their

11 𝑦/ 𝑥= cosh(𝑥)/ sinh(𝑥) = sinh(𝑥)/cosh(𝑥) = tanh(𝑥).

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

22:14 Ribeiro

dsq, ~81% corr.

1920 1900 1870 1850 1880 1910 1930

0.018 0.050 0.135 0.368 1.000

New York Montana Wyoming N. Mexico Florida Oregon Mass. Delaware Maryland Connect. Vermont Michigan Minnes. N. Dakota Alabama Georgia N. Carol. Texas Oklaho. Pennsyl. Wisconsin Missouri Mississ. W. Virg.

Arizona Nevada Idaho Illinois Utah Maine R. Island Ohio Washing. N. Hamp, N. Jersey Iowa Nebraska S. Dakota Arkansas Louisiana S. Carolina Kentucky Colorado Indiana Kansas Virginia Tennessee California

BIC likelihood ratio

0.5 0.625 0.750 0.875

Pensylvania New York Iowa Maryland

0 100 200 300 0

0 100 200 0

0 100 200 300

5 10 15 occ.

0 100 200 300

0 100 200 300

~0.81m ~0.19m

Pop. per sq. mile

1-10 10-25 25-50 50-100 100-250 250-500 500-1000 1000-2500 2500-5000 > 5000

2nd level (1/36)

occ. occ. occ.

occ. occ. occ. occ.

var [ y(a)]

( max [ACC])

Figure 4. (a) coth and tanh functions, colored arrows illustrate square factor frequency increases with scale, (b) New York (NY) state population density (left), NY has squares at two levels 𝑑𝑠𝑞, at 0.018 and 0.65 lat-lon distances (red circles), (c) schematic depiction of frequency rank vs. spatial-scale plots in (h), (d) 𝑑𝑠𝑞(spatial-level of first square) for states and years, (e) BIC likelihood of coth over a Zipf model, (f) state-of-the-art growth prediction s accuracy with increasing spatial-level 𝑑, 𝑑𝑠𝑞is diagonal (dashed), (g) min. (green) and max. (red) frequency ranks across locations, each curve is a scale, blue curves indicate square size, which follows a tanh function, (h) coth model, as illustrated in (a), with empirical data.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

Spatio-Causal Patterns of Sample Growth 22:15

permutations. It is, here, associated with homogeneous samples (i.e., samples with little across-factor variation). Fig.4(a) depicts the overall shapes of tanh(𝑛) and coth(𝑛) functions. Fig.4(h) shows occupations minimum frequency rank, 𝑟0 (green), across all locations in increasing spatial levels, as well as their maximum rank, 𝑟𝜔 (red). The former is the minimum frequency ranking order of one given occupation across all the level s locations. The latter is the maximum (these are formulated explicitly in Sect. Methods). The latter is related to Zipf s frequency rankings and the Pareto distribution (Sect. Methods), as the three are Power-laws. Each curve in the figure corresponds to one spatial-level and occupation. With a homogeneous sample, we expect one highest-rank industry across all locations, and thus 𝑟𝜔 𝑟0 = 1. What we observe, however, is that factors are ranked in constant-sized ranges, as visualized in squares. Each factor is the highest ranked in some location, the second in other, etc. These rankings define an arithmetic series 𝑟0,𝑟0 + 1,𝑟0 + 2, ...,𝑟𝜔 for each factor. The series has mean 𝑟= 𝑟0+𝑟𝜔

2 , which is also shown (blue). The previous model predicts both that 𝑟𝜔 𝑟0 is constant, and that it reflects the enumeration rate 𝜔. Fig.4(h) shows that empirical rankings have constant 𝑟𝜔 𝑟0, with increasing 𝑟0. A closer examination of both branches (red and green) reveals they correspond to the positive and negative sections of the coth(𝑛) = 1/tanh(𝑛) function, Fig.4(i,a). Imagine the following process: pick a location 𝑥0, and its most and least-frequent factors (i.e., with rank 1 and 𝑚). Label them, respectively, 𝑎and 𝑧. Balance 𝑧to match 𝑎 s frequency. Move one spatial-level up, pick another 𝑧, balance, and repeat. This is the process described by Eq.(5). Each square row corresponds to a single derangement and background, where 𝑛+𝑎= 𝜔 𝐷𝑎is the number of units in cell 𝑎. The cost to balance each 𝑧is thus 𝑛+𝑎/𝜔 per row. For all locations 𝑥0, and levels 𝑑0 𝑑 𝑑𝑠𝑞,

𝜔 𝑛+𝑧 𝑋 {𝑎} = 0,

𝑛+𝑎 𝑛+𝑧 𝑋 {𝑎} coth(𝑛) = 0. (14)

The coth function has the interesting property of separating, by sign, each location s background and effect phases, and describe more directly how squares are completed. This is illustrated in Fig.4(a) as one hyperbolic rotation, with subsequent square derangements leading to others. Methodologically, this suggests fitting a coth function to observed frequency ranks. A Zipf-distribution can be fit by Power-law or Pareto distribution regressions (Sect. Methods). Enumeration rate increases imply increasing permutations - and thus differences between min. and max-frequency ranks. This predicts that Zipf-Pareto regressions will become increasingly inaccurate (compared to coth), as cities become more heterogeneous. Fig.4(e) shows increase of up to 18 times fit likelihood favoring the coth model throughout the studied period, according to a Bayesian Information Criterion.

5.2 Predictive Statistics

What impact does the presence of squares in samples have statistically (in respect to bounds to their predictiveness and biasedness)? Fig.4(f) demonstrates a result, using census microdata, with an Accuracy vs. Spatial-level plot (see [27] for others). Samples in the previous section contained sample units primary occupations [6, 25]. This led to binary samples of dimension 𝑚= 543 (and each unit seen as a 543 length binary vector). For Fig.4(f), all variables in the American micro census were, instead, used [17]. Each census binary variable lead to one field, each categorical variable to as many binary variables as the size of their domain (as defined by the census) and continuous variables to 8-bit vectors (corresponding to their 8 quantiles). The final sample had 10.055 variables, including information on a broad range of population characteristics, including fertility, nuptiality, lifecourse transitions, immigration, internal migration, labor-force participation, occupational structure, education, ethnicity, and household composition. These variables can be correlated, colinear and spurious. Each of the

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

22:16 Ribeiro

multiple state-of-the-art classifiers employed next will deal with these statistical pitfalls in their own proposed ways. The classification task in Fig.4(f) is to predict whether a given occupation will grow (enlist further members) in the next time interval (10 years ahead), for each location 𝑥0. Detailed description of algorithms used, and their hyperparameter optimization, can be found on [27]. They include Neural Network Models, Generalized Linear Models, Boosting Models, Generalized Additive Models, Random Trees, LASSO and Ridge regressions, ANOVA, Support Vector Machine, and stacked meta-learners for all previous algorithms. Spatial levels (and aggregated data) ranged from the local to the national. One million location and year were chosen randomly, each leading to a full set of spatially growing samples. The figure thus shows the maximum accuracy of 24 state-of-the-art supervised algorithms, to predict whether a given occupation will grow, or not, in a location, as we use data from increasing spatial-levels (starting with the local and reaching all national). Accuracy is defined as the number of accurately classified observations in the held-out sample. Spatial levels 𝑑𝑠𝑞for each state are mapped to the diagonal (dashed) in the figure, and each state is a curve. The way accuracy changes across locations largely follow the shape expected by Fig.1(f). Accuracy was averaged across same-state locations to generate these curves. Bootstrap accuracy variation bands (across states locations) are shown for the two most accurate states, New York and Illinois. We observe that New York gains little from external data, above 𝑑𝑠𝑞, as it already contains, within its boundaries, high levels of variation. This also implies that, without unobservables, 81% of the sample is sufficient for prediction. Homogeneous locations, in contrast, have incomplete squares, and observed predictions are susceptible to external and unobserved variation. The right panels in Fig.4(f) shows the increase in accuracy, 𝐴𝐶𝐶/ 𝑑, of samples encompassing increasing distances (at 0.05 lat-lon intervals, normalized over their spatial and accuracy ranges) for New York and Illinois. It is estimated by the differences in accuracy of models trained at a level 𝑑and its predecessor. The top panel shows accuracy of algorithms with samples with observations 𝑑> 𝑑𝑠𝑞, and bottom panel 𝑑 𝑑𝑠𝑞, across all locations in those states (gray ribbons show their standard deviation). These patterned differences in accuracy change in the output of supervised black-box algorithms mirror the shape of functions in Fig.3(g,h). This is expected as the accuracy of systems with characteristics described above increase with pairwise correlations, Fig.2(a). The functional form for accuracy F (𝑑, 𝜇) = 𝜇 sinh(𝑑) take, therefore, hyperbolic forms with distinct parameters12, 𝜇= 0.5 𝑚and 𝜇= (1 1/𝑒) 𝑚, and constant tanh, 0 F (𝑑, 𝜇) 1. The 80-20 ratios, observed for correlations in Sect. A Hanging-Rope Model for Unbiased Sample Growth, are thus also observed in the outcome of black-box predictors, Fig.4(f). Like before, the functional F (𝑑, 𝜇) is an apt description in these systems only because tanh remain constant throughout them (i.e., in squares and balanced samples), Eq.(11). Given the balance of samples with 𝑑< 𝑑𝑠𝑞, it is expected for effect estimation to be accurate in these samples, in contrast to samples with 𝑑> 𝑑𝑠𝑞. In [27], we use multiple simulated scenarios to show that samples having combinatorial conditions like those with 𝑑< 𝑑𝑠𝑞facilitate causal effect estimation. This is possible because, there, we have ground truth information for the effect of variables. We do not have ground truth in this large real-world system, but the central claim here is that in samples with 𝑑> 𝑑𝑠𝑞the problem becomes precipitously harder. Fig.4(g) shows variance in effect estimation for the popular Shapley-based effect estimation [21] across all locations 𝑥0 and distances 𝑑, in the previous samples. There is a discontinuity, and significant increase in uncertainty over effect estimates above 𝑑𝑠𝑞(and very little below), across all locations. Together, the previous considerations suggest constraints, described by F (𝑑, 𝜇), for supervised prediction and effect estimation in spatial systems, Eq.(1). Functions F (𝑑, (1 1/𝑒)𝑚) and F (𝑑, 0.5𝑚) describe, respectively, spatial growth patterns during externally-valid (EV) and unconfounded (EV-CF) samples.

12the same is expected from 𝑑

𝑑𝑥sinh𝑥= cosh𝑥and results in Sect. A Hanging-Rope Model for Unbiased Sample Growth.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

Spatio-Causal Patterns of Sample Growth 22:17

6 Conclusion

We studied applications of concepts from non-parametric and counterfactual statistics to sample growth processes; common, for example, in the study of spatial systems. We highlighted sample conditions where 𝑚populations can have their effect observations remain unbiased while, at the same time, increasing in generalizability. Hyperbolic functions offered a natural description and visualization for these alternative growth patterns. We used them to formulate, in particular, the asymptotic number of backgrounds that effects are observed under in large-scale data, and thus limits to effect generalizability in these data. Increase in generalizability requires exponential sample size growth. Increase with unbiasedness requires Fibonaccian, with a half-golden growth ratio. Future work will describe whether the in and out-of-sample variable shuffling models described here can also lead to finer-grain and non-asymptotic limits to effect generalization based on background randomization. We demonstrated the model empirically (functional-form F ( 𝑑𝑥0 ), enumerative and combinatorial properties, 3 predicted rates), and connected sample growth to the statistical environment (biases and predictability) it creates for its populations.

Acknowledgments Funding provided by the Sao Paulo Research Foundation (FAPESP). The datasets analyzed are available in the IPUMS repository[17].

[1] Pierre Alexane Balland, Cristian Jara-Figueroa, Sergio G Petralia, Mathieu P. A Steijn, David L Rigby, and César A Hidalgo. Complex economic activities concentrate in large cities . In: Nature human behaviour 4.3 (2020), pp. 248 254. doi: 10.1038/s41562-019-0803-3. [2] Marcus Berliant and Axel H. Watanabe. A scale-free transportation network explains the city-size distribution . In: Quantitative Economics 9.3 (2018), pp. 1419 1451. doi: https://doi.org/10.3982/QE619. url: https://doi.org/10.3982/QE619. [3] Paul de Boer and João F. D Rodrigues. Decomposition analysis: when to use which method? In: Economic systems research 32.1 (2020), pp. 1 28. doi: 10.1080/09535314.2019.1652571. [4] Emile Borel. Introduction geometrique a quelques theories physiques. Paris, 1914. [5] Guy Van den Broeck, Anton Lykov, Maximilian Schleich, and Dan Suciu. On the Tractability of SHAP Explanations . In: Journal of Artificial Intelligence Research (JAIR) (2020). doi: https://doi.org/10.1613/jair.1. 13283. url: http://starai.cs.ucla.edu/papers/Vd BJAIR21.pdf. [6] Census Bureau. Alphabetical index of industries and occupations. 1950 census of population. Washington, 1951. [7] Nadia Burkart and Marco F. Huber. A Survey on the Explainability of Supervised Machine Learning . In: J. Artif. Int. Res. 70 (2021), pp. 245 317. issn: 1076-9757. doi: 10.1613/jair.1.12228. url: https://doi.org/10. 1613/jair.1.12228. [8] Sean Carroll. Spacetime and Geometry: An Introduction to General Relativity. Benjamin Cummings, 2003. isbn: 0805387323. [9] Paul Cella. Reexamining the Catenary . In: The College mathematics journal 30.5 (1999), pp. 391 393. doi: 10.1080/07468342.1999.11974093. [10] Persi Diaconis. Group representations in probability and statistics. Institute of Mathematical Statistics Lecture Notes Monograph Series, 11. Hayward, CA: Institute of Mathematical Statistics, 1988, pp. vi+198. isbn: 0-940600-14-5. url: http://projecteuclid.org/euclid.lnms/1215467407. [11] Persi Diaconis, R. L Graham, and William M Kantor. The mathematics of perfect shuffles . In: Advances in Applied Mathematics 4.2 (1983), pp. 175 196. doi: https://doi.org/10.1016/0196-8858(83)90009-X. url: https://www.sciencedirect.com/science/article/pii/019688588390009X.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

22:18 Ribeiro

[12] X Gabaix. Zipf s Law for Cities: An Explanation . In: The Quarterly journal of economics 114.3 (1999), pp. 739 767. doi: 10.1162/003355399556133. [13] Wassily Hoeffding. A Class of Statistics with Asymptotically Normal Distribution . In: The Annals of mathematical statistics 19.3 (1948), pp. 293 325. doi: 10.1214/aoms/1177730196. [14] Joseph Hucks. Hyperbolic complex structures in physics . In: Journal of mathematical physics 34.12 (1993), pp. 5986 6008. doi: 10.1063/1.530244. [15] Daniela Inclezan and Luis I. Prádanos. Viewpoint: A Critical View on Smart Cities and AI . In: J. Artif. Int. Res. 60.1 (2017), pp. 681 686. issn: 1076-9757. [16] Hong Inho, Frank Morgan R., Rahwan Iyad, Jung Woo-Sung, and Youn Hyejin. The universal pathway to innovative urban economies . In: Science Advances 6.34 (), eaba4934. doi: 10.1126/sciadv.aba4934. url: https://doi.org/10.1126/sciadv.aba4934. [17] IPUMS. U.S. Individual-level Census (United States Bureau of the Census). 2022. url: https://usa.ipums.org/ usa/complete\_count.shtml. [18] Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. Prediction Policy Problems . In: The American economic review 105.5 (2015), pp. 491 495. doi: 10.1257/aer.p20151023. [19] Christopher Krapu, Robert Stewart, and Amy Rose. A Review of Bayesian Networks for Spatial Data . In: ACM Trans. Spatial Algorithms Syst. (2022). Just Accepted. issn: 2374-0353. doi: 10.1145/3516523. url: https://doi.org/10.1145/3516523. [20] A. J. Lee. U-statistics : theory and practice. New York: M. Dekker, 1990. isbn: 0824782534. [21] Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions . In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS 17. Long Beach, California, USA: Curran Associates Inc., 2017, pp. 4768 4777. isbn: 9781510860964. [22] Douglas C Montgomery. Design and analysis of experiments. New York: John Wiley, 2001. isbn: 0471316490; 9780471316497. [23] Stephen L Morgan and Christopher Winship. Counterfactuals and Causal Inference: Methods and Principles for Social Research. Cambridge: Cambridge University Press, 2007. isbn: 0521671930; 9780521856157; 9780521671934; 0521856159. doi: 10.1017/CBO9780511804564. [24] MEJ Newman. Power laws, Pareto distributions and Zipf s law . In: Contemporary Physics 46.5 (Sept. 2005), pp. 323 351. doi: 10.1080/00107510500052444. [25] Anastasiya M Osborne, United States. Bureau of Labor Statistics. Office of Productivity, Technology, and Peter B Meyer. Proposed category system for 1960-2000 census occupations. Washington, D.C.]: U.S. Dept. of Labor, Bureau of Labor Statistics, Office of Productivity and Technology, 2005. [26] Hans Reichenbach. The direction of time, Berkeley: University of California Press, 1956. [27] Andre F. Ribeiro. Sample observed effects: enumeration, randomization and generalization . In: Scientific Reports 15.1 (2025), p. 8423. doi: 10.1038/s41598-024-80839-8. url: https://doi.org/10.1038/s41598-02480839-8. [28] Alvin E. Roth. The Shapley Value: Essays in Honor of Lloyd S. Shapley. Cambridge: Cambridge University Press, 1988. isbn: 9780521361774. doi: DOI:10.1017/CBO9780511528446. url: https://www.cambridge.org/ core/books/shapley-value/D3829B63B5C3108EFB62C4009E2B966E. [29] Donald B Rubin. Causal Inference Using Potential Outcomes: Design, Modeling, Decisions . In: Journal of the American Statistical Association 100.469 (2005), pp. 322 331. doi: 10.1198/016214504000001880. [30] Charles Edward Sandifer. How Euler did it (Chapter 17, pg.103). eng. The MAA tercentenary Euler celebration ; v.3. Washington, DC]: Mathematical Association of America, 2007. isbn: 9780883855638. [31] Bernhard Scholkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward Causal Representation Learning . In: Proceedings of the IEEE 109.5 (2021), pp. 612 634. doi: 10.1109/JPROC.2021.3058954.

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.

Spatio-Causal Patterns of Sample Growth 22:19

[32] Shashi Shekhar. What is Special about Spatial Data Science and Geo-AI? In: 33rd International Conference on Scientific and Statistical Database Management. SSDBM 2021. Tampa, FL, USA: Association for Computing Machinery, 2021, p. 271. isbn: 9781450384131. doi: 10.1145/3468791.3472263. url: https://doi.org/10.1145/ 3468791.3472263. [33] Shashi Shekhar and Pamela Vold. MIT Press Essential Knowledge series. Cambridge: The MIT Press, 2020. isbn: 0-262-35681-3. [34] Ilaria Tiddi and Stefan Schlobach. Knowledge graphs as tools for explainable machine learning: A survey . In: Artificial Intelligence 302 (2022), p. 103627. doi: https://doi.org/10.1016/j.artint.2021.103627. url: https://www.sciencedirect.com/science/article/pii/S0004370221001788.

Received 14 November 2023; accepted 26 July 2024

Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.