# spatiocausal_patterns_of_sample_growth__9adfcdbf.pdf Spatio-Causal Patterns of Sample Growth ANDRE F. RIBEIRO , Harvard University, USA and University of Sao Paulo, Brazil Different statistical samples (e.g., from different locations) offer populations and learning systems observations with distinct statistical properties. Samples under (1) Unconfounded growth preserve systems ability to determine their variables effects on outcomes-of-interest (and lead, therefore, to interpretable black-box predictions). Samples under (2) Externally-Valid growth preserve their ability to make predictions that generalize across out-of-sample variation. The first generates predictions that generalize over sample populations, the second over their common unobserved factors. We illustrate these theoretic patterns in the full American census from 1840 to 1940, and samples ranging from the street-level all the way to the national. This reveals new conditions for the generalizability of samples over space and time, and connections among the Shapley value, counterfactual statistics, and hyperbolic geometry. JAIR Associate Editor: Quanquan Gu JAIR Reference Format: Andre F. Ribeiro. 2025. Spatio-Causal Patterns of Sample Growth. Journal of Artificial Intelligence Research 83, Article 22 (July 2025), 19 pages. doi: 10.1613/jair.1.15675 1 Introduction Large scale and high-dimensional geospatial datasets currently offer rich opportunities for predictive and Geo-AI applications [32, 15, 19, 33] (e.g., disease incidence, ecological behavior, electoral results, crime occurrence, economic growth, recommendation systems). While it is common practice to train regression and classification models in data collected across distinct locations, little is known about how their out-of-sample accuracy ( predictiveness ) and biasedness (e.g., black-box fairness ) [7, 34] are expected to change across spatial extensions. The first indicates whether predictions derived from the sample will be close to their true values for a population in conditions different from those at data collection (i.e., whether they will generalize ), and the latter whether they will systematically favor individual populations (e.g., as result of insufficient sample sizes, unobserved variables, or other failures in sample selection). Understanding these issues is important because they allow us to answer crucial questions for learning systems: Can predictions made for a given population, with data from one location, be used in others? Does collecting larger samples improve prediction accuracy for that first population? We first formulate functions describing fairness-generalizability tradeoffs across space, revealing their connections to hyperbolic geometry and theoretic experimental designs. We then consider 100 years of the American census (and all variables in the census) as case study. For each cross-section (decade), we consider the important task of predicting economic growth for over 60K individual locations under increasing spatial samples. We demonstrate how (1) generalizability tradeoffs evolve across spatial levels, and (2) repeat the validation of generalizability limits derived in [27] for the spatial domain, and with the current census micro-data. Let 𝑆: π‘‹π‘š [0, 1] describe any learning system or agent using an input sample π‘‹π‘šwith π‘švariables to derive a classification decision, 𝑆(𝑋). Our central goal is to formulate how the generalizability of these systems changes Corresponding Author. Author s Contact Information: Andre F. Ribeiro, ribeiro@alum.mit.edu, Harvard University, Cambridge, Massachusetts, USA and University of Sao Paulo, Sao Carlos, Sao Paulo, Brazil. This work is licensed under a Creative Commons Attribution International 4.0 License. 2025 Copyright held by the owner/author(s). doi: 10.1613/jair.1.15675 Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. 22:2 Ribeiro across space (i.e., to what extent a model assembled in a location will hold for others), and, thus, to identify a parametric functional form F that can describe accuracy bounds across possible 𝑆, F 𝑑π‘₯0 = max 𝑆 n ACC 𝑆;𝑋 h π‘₯0, 𝑑π‘₯0 i o , (1) where ACC indicates the accuracy of models trained in samples 𝑋[π‘₯0; 𝑑π‘₯0] encompassing all observations at distances less than or equal to 𝑑π‘₯0 from π‘₯0. The specific way in which F changes across space offer limits and opportunities to algorithmic and agent learning systems and their performance. Strict bounds on the uncertainty of predictions afforded to algorithms or agents using local data can have a profound impact on the usefulness, scope, and quality of their strategic decisions or the recommendations they offer. A non-decreasing F , with increasing 𝑑π‘₯0, indicates that 𝑆is able to learn a model that is accurate across locations, while a decreasing function indicates that learning fails to generalize across locations. A complementary issue would be to what extent 𝑆would be able to identify the effect (or importance) of individual variables to prediction using the same sample. To illustrate these two fundamental sample characteristics, consider a set of binary attributes 𝑋4 = {π‘Ž,𝑏,𝑐,𝑑} observed across US locations, such as, respectively, recorded presence of crime, police stations, banks and icecream shops. The behavior of these entities are possibly interconnected, which implies that any calculated statistic 𝑦 R over π‘Žis in fact an statistic over 𝑦(π‘Ž| 𝑏,𝑐,𝑑). Because of this, we say that a local sample has low External Validity (EV), since any changes in factors {𝑏,𝑐,𝑑} (or the many other factors that can conceivably affect crime), can invalidate the statistic. At the same time, because banks often appear together with ice-cream shops in commercial and affluent neighborhoods, we are not sure whether their presence plays any essential role when predicting crime incidence (i.e., whether they have only a spurious relationship to crime). Because of this, we say that such samples are confounded (CF). How can these two statistical issues be addressed and quantified? Comparison of crime incidence between this location and a second with banks but no ice creams shops, everything else constant, would lend evidence to the fact that ice cream shops are not driving crime up. These types of ideal what-if statements, where the effect of an outcome is observed under a single difference (while holding other factors constant), are often called counterfactual statements [23, 29]. In the hypothetical case where all such conditions can be observed, the problem of determining whether a factor is relevant to prediction is fairly easy to solve. These conditions have been formulated mathematically in the study of experimental designs [22]. The more challenging, and practical, aspect of this problem is, however, the case of unobserved conditions: often the relevant factors that change across locations are neither observed nor held constant. The issue of unobserved confounding is particularly serious for effect estimation, as effect statistics calculated from the sample might then also reflect the effect of exogeneous or unobserved variation that cannot be easily discounted by typical regression and effect estimation methods [31, 23]. Although in these conditions effect observations are flawed or noisy counterfactual observations, we still refer to them as counterfactual observations for short. We address these problems by studying spatial sample growth: we start with a sample with a single unit π‘₯0 (all conditions unobserved), we then progressively add other units to the sample at larger distances to π‘₯0, 𝑑π‘₯0, progressively decreasing its number of unobserved conditions. We consider ACC 𝑆;𝑋 h π‘₯0, 𝑑π‘₯0 i across these samples, and, in particular, how EV and CF change across scales. The two previous problems reflect two key, but distinct, learning problems [18]: supervised out-of-sample prediction and factor effect (or importance) estimation. Supervised prediction focuses on making accurate predictions of an outcome in unseen data using a training sample, while factor effect estimation aims to measure the relevance of specific factors for prediction and model selection. These problems are, however, closely connected [18, 7], as complete and correctly specified models lead to accurate predictions. We will demonstrate that combinatorial properties of samples impact these two traditional problems differently and Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. Spatio-Causal Patterns of Sample Growth 22:3 fully observed hold constant: is hyperbolic is half-golden fully unobserved Gilbert Shannon Reeds (GSR) shuffle 1/6 2/6 3/6 4/6 5/6 6/6 shuffle-and-cycle ( 3) GSR shuffle scale of first square (d/dsq) hyperbolic sample size increases all effect observations (including interactions) all effect obs. of a (backgrounds with a, and without) Fig. 1. Can counterfactual effect observations made in one location be used in another (are they externally valid, EV)? Can their independent effects or relevance to prediction be distinguished from others (are they unconfounded, CF)? (a) a π‘š-dimensional hypercube and Pascal triangle with π‘šrows portrays the full set of counterfactual effect observations with π‘šfactors in a sample π‘‹π‘š(π‘š= 3), a π‘š π‘šLatin-Square ( square ) portrays all effect observation backgrounds, more counterfactual effect observations increase guarantees over the generalizability and bias of samples effect observations, (b) serial and parallel interleaving of in and out of sample factors during sample growth and their expected sample sizes and growth rates, (c-d) sample sizes under EV-CF growth follow hyperbolic forms, with a half-golden rate of effect-to-background rate for the high-dimensional case, (e) illustration of Gilbert Shannon Reeds (GSR) shuffling as a mechanism to increase generalizability (EV) in partially-observed samples, and a shuffle-and-cycle strategy as an alternative which also guarantees decreases in factor effect confoundness (CF), (f) hyperbola for samples with distinct 𝑁 π‘Ž/𝑁+π‘Ž(left) and expected accuracy for samples with small or large numbers of unobserved factors (right). reveal tradeoffs across samples and locations. Fig.1(d) outlines the two central contributions of the proposed framework, which establish hyperbolic forms for F (𝑑π‘₯0) and consequent asymptotic limits to black-box accuracy. We first briefly formulatethese contributions. After this summary, we relate the approach to known black-box importance estimation and causal effect estimators, and, finally, formulate the proposed combinatorial-geometric relationships in detail. 2 Sample Counterfactual Observation Limits Let 𝑋= {π‘Ž,𝑏,𝑐, ..., [π‘š]} be a set of (observed or unobserved) binary factors1 characterizing a population π‘₯, π‘₯ [0, 1]π‘š, and 𝑦(π‘₯) be a measurement over the population, 𝑦(π‘₯) R. Behind many effect or importance estimation procedures is an experimental procedure (Sect. Related Work): add π‘Žto every variation of other populations; with each insertion, observe before-after outcome differences, Δ𝑦(π‘Ž). Consider a sample growth process where we start with a fixed sample unit π‘₯0 and, as we observe each new unit, we also observe their differences from π‘₯0, in respect to both π‘₯and 𝑦. Starting with a population π‘₯0, each possible sample growth trajectory (e.g., π‘₯0,π‘₯1,π‘₯2, ...) is an temporally or spatially ordered observation of the impact of gaining, or 1where [π‘š] is the π‘š-th factor in a sample with variables π‘‹π‘š. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. 22:4 Ribeiro losing, a set of factors on 𝑦. In other words, each step is a counterfactual effect observation for sample unit π‘₯0, Δ𝑦(π‘₯0 π‘₯1) = 𝑦(π‘₯0) 𝑦(π‘₯1) (where the former difference is over sets and the latter over scalars). Each effect observation is thus defined by: (1) a difference in factors (what changed, π‘₯0 π‘₯1 π‘‹π‘š), an intersection of factors (what remained the same, π‘₯0 π‘₯1 π‘‹π‘š), and finally, a difference in outcomes (an observed scalar effect , Δ𝑦(π‘₯0 π‘₯1) R). For each trajectory and time, we can consider the biasedness and generalizability afforded by the accumulated samples to, for example, black-box predictors of 𝑦, and whether their performance is related to the increasing set of accumulated counterfactuals. 2.1 Sampling Effect-to-Background Ratios (πœ”) The growth space for any collected sample is then the imaginary space that contains all of the conceivable ways in which we could have assembled 𝑋from any one of its individual sample units π‘₯0 (Sect. Combinations, Permutations and Partial Permutations). It is often only partially observed. This is a problem in samples where factors cannot be assumed to be (1) independent or (2) in-sample. That s because (effect) observations can then be contingent on (1) the order or history of the growth process, or (2) out-of-sample factors. In practice, two timescales determine the statistical properties of the sample growth process: the rate 𝑁+π‘Žat which individual factor differences are observed (i.e., when their counterfactual effects can be observed), and the rate 𝑁 π‘Žat which the backgrounds in which they are observed change. If the relationship between these two timescales is such that effects are observed under a large number of backgrounds, then we can be more confident about their generalisability (i.e., that effect observations will likely be reproducible in future backgrounds). On the other hand, if observations are such that different populations are observed under the same backgrounds, we can be more confident about their unconfoundness (i.e., that effects observed reflect the same uncontrolled out-of-sample variation across all sample populations). More precisely, the background 𝐷 π‘Ž D(𝑋 {π‘Ž}) of an effect observation of factor π‘Ž, is the instantaneous condition in which the effect is observed, Δ𝑦[ π‘Ž| 𝐷 π‘Ž]. We will consider definitions for D(𝑋 {π‘Ž}) where it corresponds to the set of all possible values over the set of other factors, D(𝑋 {π‘Ž}) = 𝑋 {π‘Ž}, or their permutations2, D(𝑋 {π‘Ž}) = Ξ (𝑋 {π‘Ž}), and when 𝑋contains both observed and unobserved factors. For illustration, imagine all ways we can observe backgrounds across growth trajectories of the example in Sect. Introduction, 𝑋= {π‘Ž,𝑏,𝑐,𝑑}. In order to consider effect observations for π‘Ž, we must insert this individual factor in all its possible 3! = 6 backgrounds observable during growth. After each insertion, we can then observe changes Δ𝑦(π‘Ž) in the outcome-of-interest, 𝑦, to understand π‘Ž s effect on 𝑦. Another way of saying this is that we need to keep π‘Žconstant, while cyclically permuting all other sample factors, Fig.1(a). That is, we are called to observe the effect of π‘Žunder the cyclic permutations of 𝑑 π‘š 1 factors present in the sample. For a factor π‘Žand sample with π‘šfactors, this defines a map1 𝜏(π‘₯π‘š) : {𝑏,𝑐, ...[π‘š 1]} {𝑐, ...[π‘š 1],𝑏}, (2) whose iteration 𝜏(π‘₯π‘š),𝜏2(π‘₯π‘š), ...πœπ‘š(π‘₯π‘š) have a shifting action in the original permutation and generates the cyclic group3 of order π‘š 1 [11]. Whether we require only π‘Žto be inserted in all background values with factors other than π‘Ž(a single cyclic background permutation 𝜏(π‘₯)), or repeat this for all π‘šsample factors, will change the guarantees we can make in respect to the EV and CF of effect observations. At the limit, these two cases correspond to 1 or π‘š 1 iterations of the recursive definition of a factorial, π‘š! = π‘š (π‘š 1)! 2where Ξ (𝑋) is the set of all permutations of the set 𝑋. 3The cyclic group generated by 𝜏is related in an obvious way to the set of all permutations of π‘šelements (i.e., to the π‘†π‘šsymmetric group), where πœπ‘–(π‘₯), 𝑖< π‘š, correspond to one partial permutation, πœπ‘–(π‘₯) π‘†π‘š, for all4 π‘₯ P(𝑋) (Sect. Combinations, Permutations and Partial Permutations). 4where P(𝑋) is the power-set of 𝑋and we use high caps to indicate counts over unique values. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. Spatio-Causal Patterns of Sample Growth 22:5 Define, therefore, variable 𝑁+π‘₯0 = |P(π‘₯0)| indicating the number of unique in-sample effect observations4 in π‘₯0, and 𝐷 π‘₯0 (𝑋) = |D(𝑋 π‘₯0)| their out-of-sample backgrounds. The total number of counterfactual effect observations (i.e., all effects under different backgrounds) for the factors π‘₯0 is then given by 0 < 𝑁+π‘₯0 𝐷 π‘₯0 (𝑋) 2|π‘₯0| (π‘š |π‘₯0|)!, (3) (Sect. A Representation for Sample Effect Observations) which is large in the factorial-based definition of a background , 𝐷 π‘₯0(𝑋) 𝑁+π‘₯0(𝑋). A key question then becomes, as sample dimensions grow, what is the asymptotic number of backgrounds that effects will typically be observed under? 2.2 Unobserved-Observed Combinatorial Shuffling (Accuracy) To illustrate how models trained in the samples with the previous characteristics can have their accuracy constrained, we can relate the previous growth process to a traditional combinatorial randomization scheme such as Gilbert Shannon Reeds (GSR) shuffling [10, 11]. Like before, start with a sample containing only one population π‘₯0 π‘‹π‘š(set of binary attributes). Let a population with the same attributes as π‘₯0 be represented by the string containing only 1 values (for factors π‘Ž,𝑏,𝑐...). A string containing differences can then be written as 𝜎(π‘₯0) = (𝜎1, 𝜎2, 𝜎3, . . .), where πœŽπ‘‘ {0, 1}. An in-sample counterfactual effect observation corresponds to measuring outcomes before and after applying the operation in Eq.(2), 𝜏(πœŽπ‘‘, πœŽπ‘‘+1, πœŽπ‘‘+2, . . .) = (πœŽπ‘‘ 1, πœŽπ‘‘, πœŽπ‘‘+1, . . .). We can use the same representation for out-of-sample, or yet unobserved, factors, leading to a second sequence 𝜎 1, 𝜎 2, 𝜎 3, . . . until we observe all factors. Sample growth can then be represented as the bi-directional string with positive values in-sample and negative out-of-sample, 𝜎= (. . . , 𝜎 2, 𝜎 1, 𝜎0, 𝜎1, 𝜎2, . . .). In an increasing spatial sample, the current scale corresponds to the zero-index string position. Similar to the GSR, this bi-directional string may be represented, in turn, by two real numbers 0 π‘₯,𝑧 1 as 𝑑=0 𝜎 𝑑2 (𝑑+1), 𝑧(𝜎) = 𝑑=0 πœŽπ‘‘+12 (𝑑+1), (4) The shifting action 𝜏for two separate strings is known as a dyadic transformation, which can be thought as a folding or shuffling operation between them - mapping each π‘₯to a distinct 𝑧in each iteration. In this case, the transformation is between a set of factors, π‘₯0, and possible background, D(𝑋 π‘₯0). The right diagram in Fig.1(e) illustrates the result of a GSR shuffle for a 6-letter example with half variables observed. With each shuffle, each in-sample factor, {π‘Ž,𝑏,𝑐} (colored bars), effect is observed under different backgrounds (e.g., the effect of π‘Žis observed under the background of 𝑑and of 𝑏under 𝑒). We would therefore expect the generalizability (EV) of effect observations to increase with each operation. There are, however, two practical problems with this scheme. First, since each factor effect is observed in a different background, each effect observation is confounded by a different factor (e.g., the observed effect of π‘Žreflects the influence of unobserved factor 𝑑). Second, GSR shuffling only shuffles factors under a 1-step Markovian assumption, disregarding higher-order and non-additive factor interactions. Fig.1(e) illustrates an alternative shuffle-and-cycle scheme, where each shuffle step is followed by π‘šcyclic permutations 𝜏,𝜏2, ...,πœπ‘šof in-sample factors. This alternative has three advantages: (1) each factor is observed under every background, and thus effects are confounded in equal proportion across sample populations (thus allowing us, for example, to factor-out these effects more easily [27]), (2) every effect is now observed for the same permutation of unobserved factors ({𝑑,𝑐,𝑒} in the figure), and thus generalizability increases at a common rate for all factors and populations, and without Markovian assumptions, (3) this is a limiting process for every sample (Sect. Combinations, Permutations and Partial Permutations). Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. 22:6 Ribeiro 2.3 Hyperbolic Geometry (Sample Sizes) To better understand the relationship between sample sizes and out-of-sample performance, we need to consider what sample sizes are required to generate all ways of interleaving samples in and out factors. Fig.1(b) outlines two equivalent sample growth patterns that achieve this ( serial , parallel ). In the left, each new background is interleaved with all previous effects serially , analogous to the previous factorial scheme. A first out-of-sample factor,𝑑 𝑋 π‘₯0, requires observation of 𝑁+π‘₯0=3 effects, π‘₯0={π‘Ž,𝑏,𝑐}, a second requires 𝑁+π‘₯0+1 effect observations, π‘₯1={π‘Ž,𝑏,𝑐,𝑑}, etc. In the example at the right, several out-of-sample factors are interleaved with several in-sample simultaneously, analogous to a GSR shuffle. Effects for π‘₯0 are observed, at first, under backgrounds {𝑑,𝑒, 𝑓}, then {𝑓,𝑑,𝑒}, etc. Both strategies lead to a geometric series of distinctbackground for individual effects, but at different rates. Sample sizes in both cases can be described by the hyperbola, The equation expresses that, for each new in-sample factor, 𝑁+π‘₯0, we are able to re-measure their effects in each new unobserved background, 𝐷 π‘₯0, thus increasing their generalizability (EV). The quantity πœ”describes, in turn, the speed in which EV is expected to increase for individual in-sample populations, Fig.1(c). Sample sizes (as opposed to effect background counts) follow known exponential, (1 1/π‘š)𝑑 𝑒 1, and binomial, 2 1, growth rates in these two cases, 0 < 𝑑 π‘š(π‘š 1), Fig.2(b). When all π‘švariables are observed across the same number of backgrounds (i.e., in balanced samples5), samples following Eq.(5) have sizes, 𝑛π‘₯0, increasing according to a Fibonacci sequence, 2 𝑛π‘₯0 = 𝑁+π‘₯0 + 𝐷 π‘₯0, and thus asymptotically assuming half-golden background-to-effect rates, 𝑁+π‘₯0 𝐷 π‘₯0 πœ™ In conclusion, these equations describe systems that permute effect observations, but whose number of effect observations are limited to under-factorial sample sizes. Samples following half-golden background-to-effect ratios, Eq.(6), observe effects across approximately the same number of backgrounds across all its populations, leading to effect estimates that have, simultaneously, increasing EV and small CF sustained throughout sample growth. Fig.1(f) illustrates the expected relationship between maximal accuracy of samples, ACC 𝑆;𝑋 h π‘₯0, 𝑑π‘₯0 i , and sample sizes. The diagram shows sample size divided by square size vs. 𝐴𝐢𝐢(Sect. A Representation for Sample Effect Observations). In complete systems (top-curve), the observation of one square (full set of sample effect observations) is enough to generate accurate effect observations. In incomplete systems (bottom-curve), the number of effect observations necessary for accuracy can grow factorially with the number of unobserved factors - requiring very large samples to achieve similar levels of accuracy. This is a conservative estimate which can be abated by increases in periodicity and independence in out-of-sample factors. In real systems, where factors are typically highly correlated, it seems to be a reasonable upperbound for accuracy. We demonstrate these alternative patterns of sample growth in Sect. Results (EV vs. EV-CF) using Descriptive statistics (sample factor distributions, extreme-value distributions, rankings and autocorrelation), Large-scale black-box prediction of economic growth (multiple state-of-the-art supervised methods), calculated across samples ranging from the street to national spatial level, and over 100 years of the US census. 5this is analogous to notions of balancedness in experimental designs [22] but require milder conditions than equal-size populations, being observable in multi-frequency and multi-scale processes, and being observed in real-world systems, as demonstrated in Sect. Results. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. Spatio-Causal Patterns of Sample Growth 22:7 3 Related Work The Shapley-value [28] has become an essential tool across disciplines to estimate the importance of variables from the output of black-box systems (i.e., whose inputs can be manipulated exhaustively at will)[21, 3, 7]. The value can be interpreted as the enumeration of all counterfactual effect observations in a fully-observed system. This makes the Shapley value an instance of an U-Statistic and a permutation-based statistic [20, 13]. The value πœ‘(π‘Ž) was devised first to quantify the importance of a given player π‘Žin a π‘š-player game, and it can be written as πœ‘(π‘Ž) = 1 (π‘š 1)! 𝑦(π‘ƒπœ‹ π‘Ž {π‘Ž}) 𝑦(π‘ƒπœ‹ π‘Ž) , (7) where Ξ (𝑋 {π‘Ž}) enumerate all permutations πœ‹of a set of size π‘š 1, 𝑦is a game utility measure, and π‘ƒπœ‹ π‘Ž is a possible coalition6 among players (not including π‘Ž), formed in the order πœ‹. Each quantity under bracket is a counterfactual observation of the effect of π‘Ž(i.e., under all distinct backgrounds and their orderings). Eq.(3) counted the number of such observations for each population in a sample. The value is an ideal, as its calculation is NP-complete [5] and, when quantifying variable importance, it assumes there are no unobserved causes in the sample. Due to sample correlations this equation cannot be used, as well, with random sampling. Despite these shortcomings, the Shapley-value formulate crucial relationships among permutation of inputs when calculating sample statistics and their unbiasedness or fairness [7, 3]. Calculating the expected number of permutations that can be enumerated in samples, or locations, as proposed here, should thus sets strict bounds for their unbiasedness, and offer a finer-grained illustration of these relationships. While the relationship between the Shapley-value and fairness of black-box predictors is known [28, 21], their relation to generalization is perhaps more surprising [27]. A quantity that becomes central to the formulation of accuracy bounds, Eq.(1), and tradeoffs between the two previous learning problems, is the growth rate, πœ”, in enumerable permutations across systems spatial levels. Because the Shapley value cannot be calculated in practice, random sampling is often employed as an approximation in Shapley-based importance ranking [21]. The notion of sample squares (Sect. A Representation for Sample Effect Observations) was devised from the sample s set of observed permutations , with which effects can be calculated without assumptions of independent and identically distributed factors. Using square sampling, is advantageous not only for samples with factor correlations, but, particularly, incomplete samples, where, as formulated, assumptions of large random sampling become unrealistic due to their factorial requirements on effect observations. These gains were demonstrated in [27] and are revisited in Sect. Results. Furthermore, the theoretic relationships here elucidate sampling size requirements and EV-CF tradeoffs and limits for non-parametric variable importance and effect estimation. A key element of the previous solution is that sample units in the same location share a large number of unobserved ( external ) and uncontrolled factors. Studying samples with increasingly inclusive and distinct counterfactual effect observations can reveal conditions for effect generalization across samples and space. Consider the single factor case. Let 𝑋 {π‘Ž} be the set of external factors for population {π‘Ž}. In a random sample with a single treatment indicator π‘Ž, it follows that IE{ 𝑝[ π‘Ž= 1 | D(𝑋 {π‘Ž})] } = 2 1, as, at each 2 time intervals, we are expected to rebalance (random variables are bold). This is the rationale underlying, for example, Randomized Control Trials [22]. A location with this property has a single balanced population, {π‘Ž}, and common external factors, 𝑋 {π‘Ž}. We can alternatively say that 𝑝[ π‘Ž= 1 | D(𝑋 {π‘Ž})] = 0.5, or, π‘Ž (𝑋 {π‘Ž}) | D(𝑋 {π‘Ž}), which are typical non-confounding conditions [29, 26]. If units in the location share the same uncontrolled factors and have the same number of members with π‘Žas without π‘Ž, then expected outcome differences between them correspond to π‘Ž s effect, conditional on the common variation, IE[Δ𝑦(π‘Ž) | π‘₯0 = D(𝑋 {π‘Ž})] = 𝑦(π‘₯0+{π‘Ž}) 𝑦(π‘₯0 {π‘Ž}). Learning systems and agents in such locations operate with fair estimates of π‘Ž s impact (albeit, with low EV). In a square, in contrast, all π’Žsample factors are balanced simultaneously 6a player set describing a possible cooperation structure in the game with value 𝑦(π‘ƒπœ‹) when formed. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. 22:8 Ribeiro Binomial (CF) hold constant possible combinatorial populations without a, N-a, and a Pascal's triangle with (m-1) rows Fig. 2. (a) a π‘š π‘šLatin-Square ( square ) for a sample unit π‘₯0 (π‘š= 4) and the combinatorial relationships among the sample units placed across its different cells (Venn diagram, factor intersections are shown in grey and singleton differences in color), (c) Binomial ( 1 2 ), Fibonacci ( 1 πœ™), and Exponential ( 1 𝑒) rates across squares lead to hyperbolic relations among population sizes, each square s triangle altitude (dashed) is related to samples effect-to-background rate, πœ”, (c) a sample population sweep for factor π‘Ž, where the rate of insertion of π‘Žin populations is held constant, the figure illustrate population sizes as dots and three phases of sample growth: initial (no population has π‘Ž), balanced (the same number of populations have and don t have factor π‘Ž), and possibly selected (where all populations have π‘Ž). (π‘š> 1). While single-factor balance requires a binomial series, balancing several requires Fibonacci - i.e., square altitude expansion (Sect. Sample EV-CF Growth Patterns). Each population, in this case, follows asymptotic sample size rates IE[ 𝑁+π‘Ž 2 . Square accumulation thus increases the EV of all its populations simultaneously [27] - making them useful, for example, to understand sample accuracy limits across scales, Eq.(1). 4 The Combinatorial-Hyperbolic Relationship in Sample Growth 4.1 A Representation for Sample Effect Observations The set of all counterfactuals accumulated by sample growth at one instant can be visualized with a Latin-Square ( square ), Fig.2(a). Fig.1(a) illustrates two standard ways of visualizing the complete set of 2|π‘₯0| in-sample effects: as the number of edges of an hypercube of dimension |π‘₯0| or sums of Pascal triangle s |π‘₯0|-th row. The square will serve, in addition, as basis for non-parametric effect estimates across sample factors. For a fixed unit or population π‘₯0, it represents a stratification, or placement , of all other populations, π‘₯𝑖, across square cells, with repetition. The completeness or incompleteness of squares, for each π‘₯0, will have a stipulated impact on the EV or CF of their effect observations. In particular, for π‘šfactors (π‘Ž, 𝑏, 𝑐,..., [π‘š]) the square enumerates all singleton effect observations possible from the sample s π‘š-way effect differences. Its first column contains counterfactual effect observations for {π‘Ž,𝑏,𝑐, ..., [π‘š]} (i.e., conditioned on all otherπ‘š 1 factors being the same as, or overlapping with , π‘₯0). The second column contains singleton effect observations possible from the previous observations (with size 1 difference and π‘š 2 intersection with π‘₯0). These effect observations are thus conditioned on one further factor observation (i.e., on the factor in the preceding column). The third column contains singleton effect observations possible from the previous observations (size 1 difference and π‘š 3 intersection, etc.). Fig.2(a) illustrates these combinatorial patterns with Venn diagrams for each cell, where a cell s pairwise intersecting factors are grey and singleton differences are colored. This iterative procedure enumerates all possible singleton effect observations Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. Spatio-Causal Patterns of Sample Growth 22:9 in a sample. The square diagram shows only the singleton effects (cells), with their conditioning factors implied. Fig.1(a) illustrates that the square contains all in-sample backgrounds (each a Pascal s triangle) for a fixed factor π‘Ž, and thus 2 2π‘š 1 unique effect observations of π‘Ž. Each of its diagonals contains all observations for the effect of a given sample factor. Taking columns to mark time progression, the main diagonal thus marks the point of insertion of factor π‘Ž(e.g., insert π‘Žin populations {𝑑}, {𝑑,𝑐}, {𝑑,𝑐,𝑏}). Its lower triangle records the populations without π‘Ž(with size 𝑁 π‘Ž) and the upper triangle with π‘Ž(with size 𝑁+π‘Ž) . The square of order π‘š π‘š, as a whole, contain effect observations where all factors are observed under all π‘š-cycles of a fixed permutation (e.g., {π‘Ž,𝑏,𝑐,𝑑} in Fig.2(a)). Squares of increasing orders thus captures effect observations under increasing Markovian orders (i.e., conditioned across larger times or backgrounds). The relationship of sample permutations to measurements unbiasedness is a cornerstone of the most widely accepted Theory of Non-parametric Statistics, U-Statistics [20, 13] and of Shapley value based estimates of variable importance (Sect. Related Work). The relationship to generalizability has been discussed in [27], and is reviewed, and expanded, below. The full set of Γπ‘š 𝑑=0 π‘š 𝑑 = 2π‘šeffect observations observable in a sample of dimension π‘šcollect 1 square for each of its sample populations, π‘₯𝑖 π‘‹π‘š. It suggests then a natural sample limit for the generalization of effects. Fig.1(c) illustrate the resulting phases for samples under growth with a factor π‘Ž: no unit includes π‘Ž(initial), as many units include π‘Žas not (balanced), and all include π‘Ž(selected). Eq.(5) should hold across all such scenarios (𝑁+π‘Ž, 𝑁 π‘Ž>0). 4.2 Combinations, Permutations and Partial Permutations The statistical concept of a population is often associated with combinatorial combinations, as a set of sample units with a given attribute combination (e.g., high-income white males). There are thus π‘š 𝑑 = π‘š! 𝑑!(π‘š 𝑑)! populations of size 𝑑. A problem with this definition is that it leaves unspecified all non-population factors. To define a population we imagine, instead, that we fix the π‘š 𝑑population factors and vary (i.e., permute ) all 𝑑non-population (i.e., external ) factors. This leads to a combinatorial structure known as a partial permutation. The number of partial permutations for a population of size 𝑑is π‘š 𝑑 π·π‘š 𝑑, where π·π‘š 𝑑is the number of derangements (permutations without overlaps), The full set of π‘š! permutations of size π‘š, and all sample growth trajectories, can then be formulated as sets of partial permutations, using a well-known definition for factorials, = h cosh(π‘š 1) + sinh(π‘š 1) | {z } cosh(π‘š 1) sinh(π‘š 1) i (π‘š 1)! + 1. (10) The term πΆπ‘š= Γπ‘š 𝑑=0 π‘š 𝑑 in Eq.(9) corresponds to a single Pascal triangle and half-square (i.e., one set of all differences) for each sample population, and Eq.(9) to all squares. The number of observed permutations in a sample can thus be specified succinctly by its number of squares and their derangements. Samples with no missing variables require the observation of few derangements (no relevant exogeneous variation) for accurate effect observations, while incomplete samples require the observation of many derangements [27]. The odd and even parts of Taylor s expansion of Eq.(9) leads to hyperbolic trigonometric functions, Eq.(10) (proof in Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. 22:10 Ribeiro the Supplementary Material). They indicate the period in which full sets of permutations are collected. The parametric equations for the hyperbola s right branch in (π‘₯,𝑦) cartesian coordinates, Eq.(5), are π‘₯= πœ” cosh(𝑁+π‘Ž) and 𝑦= πœ” sinh(𝑁 π‘Ž), Fig.1(f). We will see that these quantities are related to in-sample effect to background sampling rates, πœ”. This quantity will be essential to describe statistical tradeoffs across growing samples. 4.3 Sample EV-CF Growth Patterns According to the previous, an unbiased effect estimate for π‘Žis an average across possible effect backgrounds, and constitutes an U-Statistic [20]. There are πΉπ‘š,𝑛= Γπ‘š 1 𝑑=0 π‘š 𝑑 𝑑 such sequential observations7. To generate all of them, we need to fix each effect observation s first, second, third, etc. factors in order. πΉπ‘š,𝑛corresponds to the sum of the number of observations necessary to fix any first factor, π‘š 1 1 = π‘š 1 , then π‘š 2 2 to fix a second from the remaining, etc. until all π‘š 1 factors are used. The relationship in Eq.(5) corresponds to the Cartesian equation of a rectangular hyperbola, 𝑁+π‘₯0 𝑁 π‘₯0 = 𝑐, where 𝑐is constant (although well-known, this is formulated in the Supplementary Material for completeness). According to the previous, these two quantities have different limits, however, 𝑁+π‘₯0 [1,πΆπ‘š] and 𝑁 π‘₯0 [1, πΉπ‘š,𝑛]. The relationship can thus describe sample limits in large-populations by substituting 𝑁+π‘₯0 = πΆπ‘šand in 𝑁 π‘₯0 = πΉπ‘š,𝑛in Eq.(5). As formulated next, the same result can be established from known rates across Pascal s triangle. The two previous quantities, πΆπ‘šand πΉπ‘š,𝑛, appear in Pascal s triangle (adjacent side and altitude), Fig.2(b). Since the main diagonal marks π‘Ž s possible times-of-insertion , the square s upper triangle contains the set of all counterfactuals with π‘Ž, and the lower, without π‘Ž, Fig.2(b). In respect to effect observations, we say that each individual effect observation is observed under πΉπ‘š,𝑛in-sample backgrounds for each derangement 𝐷𝑛(or twice this value in balanced samples). The sample effect-to-background enumeration rate πœ”, at time 𝑑, is thus defined as πœ”= πΉπ‘š,𝑛 𝐷𝑛(𝑑), or, the number of in-sample background observations, πΉπ‘š,𝑛(𝑑), per derangement, 𝐷𝑛(𝑑), across all populations in the sample. The growth of πΆπ‘š, 𝐷𝑛and πΉπ‘š,𝑛for 𝑁+π‘Žor 𝑁 π‘Žassume Pythagorean relations8, Fig.2(b), Eq.(11) suggests the visualization of sample growth as hyperbolae9 with increasing radius 𝐷𝑛, Fig.1(f). In this limiting expression of Eq.(5), πΆπ‘šcorresponds to all possible individual sample populations, 𝑁+π‘₯0 and 𝑁 π‘₯0, and πΉπ‘što in-sample effect observation backgrounds. Fig.1(f) shows the hyperbolic asymptotes πΆπ‘š= πΉπ‘š,𝑛and πΆπ‘š= πΉπ‘š,𝑛(dashed). They represent growth with constant EV, 𝐷𝑛= 0. The figure also shows the asymptotic sample (vertical black line) where exactly all observations have factor π‘Ž, πΉπ‘š,𝑛= 0. Under this condition, no estimator, algorithm, or agent is able to estimate 𝒂 s effect non-parametrically. It represents the sample with minimum EV, while outward hyperbolae, samples with increasing EV. Unique background count growth in this direction follows a Fibonacci series, whose rate is the Golden number. It is well known that the rows, columns and diagonal of Pascal s triangle are associated with binomial, exponential and fibonacian rates. Notice then that 𝐷𝑛 πΆπ‘š [1/𝑒, 1/2], as growth can range between πΆπ‘š π‘š= 2, and 𝐷𝑛 𝑛= 1/𝑒, Fig.2(b). The first is due to πΆπ‘š= 2π‘š, and the second was famously established by Euler [30]. In the previous nomenclature, the first is associated with balanced or Unconfounded (CF) growth, and the second with EV sample growth. The golden ratio is associated, in contrast, with high-dimensional balanced growth of samples and populations, EV-CF growth, and with squares, Eq.(11). 7 π‘š 𝑑 𝑑 = 0, when 𝑑> π‘š. 8the equation uses the Pythagorean theorem in its reciprocal form, as it includes the triangle s altitude. 9the equation for a hyperbole is ( π‘₯ 𝑏)2 = π‘Ÿ, with π‘Žand 𝑏its vertices and π‘Ÿradius. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. Spatio-Causal Patterns of Sample Growth 22:11 More specifically, squares are associated with the assumption that πΆπ‘š πΉπ‘š,𝑛is constant across factors (i.e., hyperbolae with constant radius), Eq.(11). It indicates that factors diagonals are the same size, and the population structure is, overall, a square . Finally, the known hyperbolic relationships π‘‘π‘Žπ‘›β„Ž(𝑛) = sinh(𝑛) 2 = πœ” 1, (12) suggest expressing sample effect-to-background enumeration rates πœ”in terms of tanh(𝑛)10. Also note that this definition for πœ”coincides with that of Lorentz factor 𝛾[8, 14], best known as a time correction between frames-of-reference in the physical sciences. Here, it preserves frequency relations among factors, 𝑁+π‘Ž/𝑁 π‘Ž, under changes of basis of the type π‘₯=π‘₯0+{π‘Ž} and π‘₯=π‘₯0 {π‘Ž}. As suggested by Borel [4], it is natural to think of the transformation as a hyperbolic rotation (analogously to the typical trigonometric). We will illustrate many of these mathematical abstractions using real-world spatial data in Sect. Results. 5 Results We will now illustrate the formulated combinatorial and statistical generalizability limits in an important realworld problem: out-of-sample economic growth prediction across increasing spatial extensions (i.e., samples with increasing census individuals). Data used encompasses microdata of American decennial censuses from 1840 to 1940, and approximately 65 billion individual-level records. This time range corresponds to the decades of American urbanization. We consider the economic and demographic changes as we go, spatially, from the household spatial-level, 𝑑0 in lat-lon distances, all the way to the national level, for each studied year. We thus create samples with units at arithmetically increasing levels, 𝑑𝑑+1=𝑑𝑑+Δ𝑑(starting from 𝑑0). We repeat this for approximately 60K American locations, π‘₯0. Each full spatial analysis is then reproduced independently across years (thus avoiding issues related to extended longitudinal data). Fig.3(b, top-right) shows two locations in New York City, which share a large amount of external variation (i.e., economic and demographic variations across the rest of the country). The resulting nation-wide transversal captures combinatorial patterns of populations differences and overlaps in samples, for all π‘₯0, as we increase scale. Our main goal is to illustrate how, consequently, generalizability change across spatial-scales, according to the stipulated model and limits. We first study sample correlations and sizes, demonstrating they follow the previous hyperbolic relationships. We then repeat previous out-of-sample prediction tasks with this new census data and increasing spatial levels - thus adding to previous evidence presented for a combinatorial counterfactual model for sample generalizability [27] . 5.1 Descriptive Statistics (Sample Sizes and Correlations) We illustrate the consequences of Eq.(12) to sample properties using Autocorrelation functions (correlations) and hyperbolic co-tangent (coth) regressions (sample sizes) in large-scale census data. These considerations will be key to solving our main problem, Eq.(1), as the accuracy of agents and algorithms operating in samples are directly determined by sample sizes and their combinatorial patterns [27]. Economic distribution across space can be described by the primary occupation and industry of all census individuals [1, 16] (e.g., carpenter or executive assistant ). We start with this set of variables, and discuss the full set of variables, including non-economic, in the next section. Fig.3(c) illustrates empirical frequencies for all occupations (each a curve) at 4 different spatial-levels in Massachusetts (MA) and New York (NY), 1880. They were the country s economic centers until the 19th century. The distribution has the familiar shape of a wave that moves to the left. New York reaches a stationary shape at a lower level π‘‘π‘ π‘ž. We demonstrate these correspond to levels where squares are completed across factors. 10with 𝑛= arctanh(πœ” 1) = arctanh( 𝐷𝑛 πΉπ‘š,𝑛), which, lets 𝑛be the number of accumulated derangements per fixed πΉπ‘š,𝑛(i.e., per square), as expected. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. 22:12 Ribeiro New York City 0 1000 2000 3000 0.0 0.1 0.2 0.3 0.4 0.5 1e 11 1e 03 1e+05 1.8 1.2 0.6 0 1.8 1.2 0 0.6 0 100 200 300 400 500 0.0 0.1 0.2 0.3 0.4 0.5 1e+03 5e+03 5e+04 5e+05 0 50 100 150 200 0 50 100 150 200 0 25 50 75 100 x 0 200 400 600 800 d Pensylvania 0 100 200 300 d New York Iowa Maryland 0 25 50 75 100 d 2.5 5.0 7.5 10.0 0 200 400 600 800 d 0 25 50 75 100 d 0 300 600 900 d 195x195 square 0 250 500 750 1000 t Figure 3. (a) sinh and cosh functions, (b) increasing spatial-levels at two example locations (national and city-levels), (rightmost panel) finest spatial-level for New-York City, (c) occupation frequency ranks vs. location across 4 example scales, each curve is an occupation, (d) enumerated Latin-Squares histograms for Massachusetts and New York, the latter has a square with almost all occupations, (e) periodogram of π‘π‘œπ‘ β„Ž(100)+π‘ π‘–π‘›β„Ž(100), sinh(100) and cosh(100), (f) per-occupation periodogram and series example from (c), (g) auto-correlation vs. spatial-level trace catenaries (free-hanging ropes) for each occupation (1880), probability distribution of their slack (red, sidepanel) indicate a fixed πœ”per factor at 0.81 (red horizontal line), Eq.(13), (h) standardized catenaries across all years, boxplots (red, sidepanel) show slack invariance and constant ratio between sinh and cosh growth for all locations, years and occupations, π‘š (1 1 𝑒) (red vertical line) is a fixed point in binomial-exponential (EV-CF) to exponential (EV) rate transitions. All squares in a location can be enumerated through an expensive computational procedure [27]. Fig.3(d) shows histograms, where each color corresponds to one of 220 occupations. NY has a spatial square that extends to almost all occupations, while MA has missing occupations (horizontal gaps) in comparison. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. Spatio-Causal Patterns of Sample Growth 22:13 5.1.1 A Hanging-Rope Model for Unbiased Sample Growth. The Catenary is a hyperbolic curve with long scientific history and describes a free-hanging rope [9]. Unlike circles and geodesics, they are sums of exponentials. Their equation in (π‘₯,𝑦) Cartesian coordinates is 𝑦= cosh (π‘₯), and their length is 𝑙= sinh(π‘₯), making them useful to demonstrate the previous model, Eq.(12), and increases in enumerable permutations across spatial levels. We demonstrate that both the observed shape, Fig.3(h), and parameters, Fig.3(h, boxplots), of spatial correlations follow predictions from the previous model. Before considering catenaries, however, Fig.3(a,e) illustrate the overall shape of the previous hyperbolic functions, and their frequency-based representations. Fig.3(a) depicts cosh(𝑛) and sinh(𝑛), and, Fig.3(e) the periodogram of π‘π‘œπ‘ β„Ž(𝑛)+π‘ π‘–π‘›β„Ž(𝑛), sinh(𝑛) and cosh(𝑛) where 𝑛= 100. Fig.3(f) illustrates the empirical periodogram of curves in Fig.3(c), which resemble the simulated. Fig.3(g) shows auto-correlation function (ACF) for all locations across 5K spatial-levels (as those illustrated in Fig.3(c)), until the state level (Sect. Methods, Supplementary Material). They trace catenaries. The horizontal line 𝑦= 1.0, of unitary correlation, is associated with the limit πΉπ‘š,𝑛= 0 where, despite the increasing scale, no population differences are added to the increasing samples. Each single catenary is a set of samples with constant πΆπ‘š/πΉπ‘š,𝑛, which is a defining property of squares, Eq.(11). Fig.3(g) illustrates 4 typical cases among states. Plots for all states are available in the Supplementary Material. Maryland has linear decreases in auto-correlation. From 1840, the USA economy and cities become increasingly interdependent. After 1900, no longer any state had such linear correlation signatures. Periodic and linear (zig-zag) auto-correlations, with period π‘š/2, are related to non-increasing EV, Fig.1(f, black vertical line). Periodic and exponential correlations, without growth, correspond to catenaries with β„Ž= 0 (where a system returns to its original state after a lag). The defining characteristic of the catenary is 𝑦/ π‘₯= 𝑙/β„Ž, where 𝑙is its length and β„Žits slack , or, difference in height, 𝑦, between its two hanging points. Standardizing catenaries [9] (i.e., making 𝑙unitary and β„Žconstant)11 thus makes its slack β„Žindicate πœ” during sample growth, which, according to Eq.(6), should assume half-golden values for small 𝐷 π‘₯0. Fig.3(h) shows standardized catenaries for all years and locations. It indicates that tanh per factor remains constant across a range of levels, up to π‘‘π‘ π‘ž, starting at the local. This was anticipated by Eq.(12). The rate, up to π‘‘π‘ π‘ž, is 81% of correlation. Fig.3(h, sidepanel, red) shows box-plots for β„Ž, across all levels, years, occupations, and locations. For all spatial-levels below π‘‘π‘ π‘ž, factors remain balanced, with binary-exponential rates (i.e., hyperbolic functions with period π‘š/2). Levels above π‘‘π‘ π‘žreverse to exponential growth. We called this a transition between EV-CF and EV growth. This is indicated in plots by the dislocation of the catenary center fromπ‘š/2 toπ‘š(1 1/𝑒) (red vertical lines). We reproduce the same results, Fig.3(g,h), with standard Pareto regressions in Sect. Methods of the Supplementary Material - as alternative to these graphical depictions. Fig.4(d) shows estimated levels π‘‘π‘ π‘žfor all states, across years. Levels π‘‘π‘ π‘žconverge across years for all states. New York has 2-level squares. Fig.3(h, upright-panel) shows catenaries for its lower-level square, and Fig.4(b) illustrate levels cartographically. The two squares factors are disjoint (gray, lower factors taking exponential rates in higher). American states have had through their histories very different work forces and regional distributions. While catenary lengths are different across occupations, Fig.3(g), their slack (and cosh-sinh growth ratio) remains invariant across all locations, years and occupations, Fig.3(h). These plots thus graphically illustrate the previous combinatorial constraints for the unbiasedness and predictiveness of learning systems across space. We return to this discussion in Sect. Predictive Statistics. 5.1.2 Permutations in Heterogeneous Samples. Zipf s law and distribution are central to the study of city size distributions [24, 12, 2]. The law is based on a frequency ranking of studied factors, and thus, on one of their 11 𝑦/ π‘₯= cosh(π‘₯)/ sinh(π‘₯) = sinh(π‘₯)/cosh(π‘₯) = tanh(π‘₯). Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. 22:14 Ribeiro dsq, ~81% corr. 1920 1900 1870 1850 1880 1910 1930 0.018 0.050 0.135 0.368 1.000 New York Montana Wyoming N. Mexico Florida Oregon Mass. Delaware Maryland Connect. Vermont Michigan Minnes. N. Dakota Alabama Georgia N. Carol. Texas Oklaho. Pennsyl. Wisconsin Missouri Mississ. W. Virg. Arizona Nevada Idaho Illinois Utah Maine R. Island Ohio Washing. N. Hamp, N. Jersey Iowa Nebraska S. Dakota Arkansas Louisiana S. Carolina Kentucky Colorado Indiana Kansas Virginia Tennessee California BIC likelihood ratio 0.5 0.625 0.750 0.875 Pensylvania New York Iowa Maryland 0 100 200 300 0 0 100 200 0 0 100 200 300 5 10 15 occ. 0 100 200 300 0 100 200 300 ~0.81m ~0.19m Pop. per sq. mile 1-10 10-25 25-50 50-100 100-250 250-500 500-1000 1000-2500 2500-5000 > 5000 2nd level (1/36) occ. occ. occ. occ. occ. occ. occ. var [ y(a)] ( max [ACC]) Figure 4. (a) coth and tanh functions, colored arrows illustrate square factor frequency increases with scale, (b) New York (NY) state population density (left), NY has squares at two levels π‘‘π‘ π‘ž, at 0.018 and 0.65 lat-lon distances (red circles), (c) schematic depiction of frequency rank vs. spatial-scale plots in (h), (d) π‘‘π‘ π‘ž(spatial-level of first square) for states and years, (e) BIC likelihood of coth over a Zipf model, (f) state-of-the-art growth prediction s accuracy with increasing spatial-level 𝑑, π‘‘π‘ π‘žis diagonal (dashed), (g) min. (green) and max. (red) frequency ranks across locations, each curve is a scale, blue curves indicate square size, which follows a tanh function, (h) coth model, as illustrated in (a), with empirical data. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. Spatio-Causal Patterns of Sample Growth 22:15 permutations. It is, here, associated with homogeneous samples (i.e., samples with little across-factor variation). Fig.4(a) depicts the overall shapes of tanh(𝑛) and coth(𝑛) functions. Fig.4(h) shows occupations minimum frequency rank, π‘Ÿ0 (green), across all locations in increasing spatial levels, as well as their maximum rank, π‘Ÿπœ” (red). The former is the minimum frequency ranking order of one given occupation across all the level s locations. The latter is the maximum (these are formulated explicitly in Sect. Methods). The latter is related to Zipf s frequency rankings and the Pareto distribution (Sect. Methods), as the three are Power-laws. Each curve in the figure corresponds to one spatial-level and occupation. With a homogeneous sample, we expect one highest-rank industry across all locations, and thus π‘Ÿπœ” π‘Ÿ0 = 1. What we observe, however, is that factors are ranked in constant-sized ranges, as visualized in squares. Each factor is the highest ranked in some location, the second in other, etc. These rankings define an arithmetic series π‘Ÿ0,π‘Ÿ0 + 1,π‘Ÿ0 + 2, ...,π‘Ÿπœ” for each factor. The series has mean π‘Ÿ= π‘Ÿ0+π‘Ÿπœ” 2 , which is also shown (blue). The previous model predicts both that π‘Ÿπœ” π‘Ÿ0 is constant, and that it reflects the enumeration rate πœ”. Fig.4(h) shows that empirical rankings have constant π‘Ÿπœ” π‘Ÿ0, with increasing π‘Ÿ0. A closer examination of both branches (red and green) reveals they correspond to the positive and negative sections of the coth(𝑛) = 1/tanh(𝑛) function, Fig.4(i,a). Imagine the following process: pick a location π‘₯0, and its most and least-frequent factors (i.e., with rank 1 and π‘š). Label them, respectively, π‘Žand 𝑧. Balance 𝑧to match π‘Ž s frequency. Move one spatial-level up, pick another 𝑧, balance, and repeat. This is the process described by Eq.(5). Each square row corresponds to a single derangement and background, where 𝑛+π‘Ž= πœ” π·π‘Žis the number of units in cell π‘Ž. The cost to balance each 𝑧is thus 𝑛+π‘Ž/πœ” per row. For all locations π‘₯0, and levels 𝑑0 𝑑 π‘‘π‘ π‘ž, πœ” 𝑛+𝑧 𝑋 {π‘Ž} = 0, 𝑛+π‘Ž 𝑛+𝑧 𝑋 {π‘Ž} coth(𝑛) = 0. (14) The coth function has the interesting property of separating, by sign, each location s background and effect phases, and describe more directly how squares are completed. This is illustrated in Fig.4(a) as one hyperbolic rotation, with subsequent square derangements leading to others. Methodologically, this suggests fitting a coth function to observed frequency ranks. A Zipf-distribution can be fit by Power-law or Pareto distribution regressions (Sect. Methods). Enumeration rate increases imply increasing permutations - and thus differences between min. and max-frequency ranks. This predicts that Zipf-Pareto regressions will become increasingly inaccurate (compared to coth), as cities become more heterogeneous. Fig.4(e) shows increase of up to 18 times fit likelihood favoring the coth model throughout the studied period, according to a Bayesian Information Criterion. 5.2 Predictive Statistics What impact does the presence of squares in samples have statistically (in respect to bounds to their predictiveness and biasedness)? Fig.4(f) demonstrates a result, using census microdata, with an Accuracy vs. Spatial-level plot (see [27] for others). Samples in the previous section contained sample units primary occupations [6, 25]. This led to binary samples of dimension π‘š= 543 (and each unit seen as a 543 length binary vector). For Fig.4(f), all variables in the American micro census were, instead, used [17]. Each census binary variable lead to one field, each categorical variable to as many binary variables as the size of their domain (as defined by the census) and continuous variables to 8-bit vectors (corresponding to their 8 quantiles). The final sample had 10.055 variables, including information on a broad range of population characteristics, including fertility, nuptiality, lifecourse transitions, immigration, internal migration, labor-force participation, occupational structure, education, ethnicity, and household composition. These variables can be correlated, colinear and spurious. Each of the Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. 22:16 Ribeiro multiple state-of-the-art classifiers employed next will deal with these statistical pitfalls in their own proposed ways. The classification task in Fig.4(f) is to predict whether a given occupation will grow (enlist further members) in the next time interval (10 years ahead), for each location π‘₯0. Detailed description of algorithms used, and their hyperparameter optimization, can be found on [27]. They include Neural Network Models, Generalized Linear Models, Boosting Models, Generalized Additive Models, Random Trees, LASSO and Ridge regressions, ANOVA, Support Vector Machine, and stacked meta-learners for all previous algorithms. Spatial levels (and aggregated data) ranged from the local to the national. One million location and year were chosen randomly, each leading to a full set of spatially growing samples. The figure thus shows the maximum accuracy of 24 state-of-the-art supervised algorithms, to predict whether a given occupation will grow, or not, in a location, as we use data from increasing spatial-levels (starting with the local and reaching all national). Accuracy is defined as the number of accurately classified observations in the held-out sample. Spatial levels π‘‘π‘ π‘žfor each state are mapped to the diagonal (dashed) in the figure, and each state is a curve. The way accuracy changes across locations largely follow the shape expected by Fig.1(f). Accuracy was averaged across same-state locations to generate these curves. Bootstrap accuracy variation bands (across states locations) are shown for the two most accurate states, New York and Illinois. We observe that New York gains little from external data, above π‘‘π‘ π‘ž, as it already contains, within its boundaries, high levels of variation. This also implies that, without unobservables, 81% of the sample is sufficient for prediction. Homogeneous locations, in contrast, have incomplete squares, and observed predictions are susceptible to external and unobserved variation. The right panels in Fig.4(f) shows the increase in accuracy, 𝐴𝐢𝐢/ 𝑑, of samples encompassing increasing distances (at 0.05 lat-lon intervals, normalized over their spatial and accuracy ranges) for New York and Illinois. It is estimated by the differences in accuracy of models trained at a level 𝑑and its predecessor. The top panel shows accuracy of algorithms with samples with observations 𝑑> π‘‘π‘ π‘ž, and bottom panel 𝑑 π‘‘π‘ π‘ž, across all locations in those states (gray ribbons show their standard deviation). These patterned differences in accuracy change in the output of supervised black-box algorithms mirror the shape of functions in Fig.3(g,h). This is expected as the accuracy of systems with characteristics described above increase with pairwise correlations, Fig.2(a). The functional form for accuracy F (𝑑, πœ‡) = πœ‡ sinh(𝑑) take, therefore, hyperbolic forms with distinct parameters12, πœ‡= 0.5 π‘šand πœ‡= (1 1/𝑒) π‘š, and constant tanh, 0 F (𝑑, πœ‡) 1. The 80-20 ratios, observed for correlations in Sect. A Hanging-Rope Model for Unbiased Sample Growth, are thus also observed in the outcome of black-box predictors, Fig.4(f). Like before, the functional F (𝑑, πœ‡) is an apt description in these systems only because tanh remain constant throughout them (i.e., in squares and balanced samples), Eq.(11). Given the balance of samples with 𝑑< π‘‘π‘ π‘ž, it is expected for effect estimation to be accurate in these samples, in contrast to samples with 𝑑> π‘‘π‘ π‘ž. In [27], we use multiple simulated scenarios to show that samples having combinatorial conditions like those with 𝑑< π‘‘π‘ π‘žfacilitate causal effect estimation. This is possible because, there, we have ground truth information for the effect of variables. We do not have ground truth in this large real-world system, but the central claim here is that in samples with 𝑑> π‘‘π‘ π‘žthe problem becomes precipitously harder. Fig.4(g) shows variance in effect estimation for the popular Shapley-based effect estimation [21] across all locations π‘₯0 and distances 𝑑, in the previous samples. There is a discontinuity, and significant increase in uncertainty over effect estimates above π‘‘π‘ π‘ž(and very little below), across all locations. Together, the previous considerations suggest constraints, described by F (𝑑, πœ‡), for supervised prediction and effect estimation in spatial systems, Eq.(1). Functions F (𝑑, (1 1/𝑒)π‘š) and F (𝑑, 0.5π‘š) describe, respectively, spatial growth patterns during externally-valid (EV) and unconfounded (EV-CF) samples. 12the same is expected from 𝑑 𝑑π‘₯sinhπ‘₯= coshπ‘₯and results in Sect. A Hanging-Rope Model for Unbiased Sample Growth. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. Spatio-Causal Patterns of Sample Growth 22:17 6 Conclusion We studied applications of concepts from non-parametric and counterfactual statistics to sample growth processes; common, for example, in the study of spatial systems. We highlighted sample conditions where π‘špopulations can have their effect observations remain unbiased while, at the same time, increasing in generalizability. Hyperbolic functions offered a natural description and visualization for these alternative growth patterns. We used them to formulate, in particular, the asymptotic number of backgrounds that effects are observed under in large-scale data, and thus limits to effect generalizability in these data. Increase in generalizability requires exponential sample size growth. Increase with unbiasedness requires Fibonaccian, with a half-golden growth ratio. Future work will describe whether the in and out-of-sample variable shuffling models described here can also lead to finer-grain and non-asymptotic limits to effect generalization based on background randomization. We demonstrated the model empirically (functional-form F ( 𝑑π‘₯0 ), enumerative and combinatorial properties, 3 predicted rates), and connected sample growth to the statistical environment (biases and predictability) it creates for its populations. Acknowledgments Funding provided by the Sao Paulo Research Foundation (FAPESP). The datasets analyzed are available in the IPUMS repository[17]. [1] Pierre Alexane Balland, Cristian Jara-Figueroa, Sergio G Petralia, Mathieu P. A Steijn, David L Rigby, and CΓ©sar A Hidalgo. Complex economic activities concentrate in large cities . In: Nature human behaviour 4.3 (2020), pp. 248 254. doi: 10.1038/s41562-019-0803-3. [2] Marcus Berliant and Axel H. Watanabe. A scale-free transportation network explains the city-size distribution . In: Quantitative Economics 9.3 (2018), pp. 1419 1451. doi: https://doi.org/10.3982/QE619. url: https://doi.org/10.3982/QE619. [3] Paul de Boer and JoΓ£o F. D Rodrigues. Decomposition analysis: when to use which method? In: Economic systems research 32.1 (2020), pp. 1 28. doi: 10.1080/09535314.2019.1652571. [4] Emile Borel. Introduction geometrique a quelques theories physiques. Paris, 1914. [5] Guy Van den Broeck, Anton Lykov, Maximilian Schleich, and Dan Suciu. On the Tractability of SHAP Explanations . In: Journal of Artificial Intelligence Research (JAIR) (2020). doi: https://doi.org/10.1613/jair.1. 13283. url: http://starai.cs.ucla.edu/papers/Vd BJAIR21.pdf. [6] Census Bureau. Alphabetical index of industries and occupations. 1950 census of population. Washington, 1951. [7] Nadia Burkart and Marco F. Huber. A Survey on the Explainability of Supervised Machine Learning . In: J. Artif. Int. Res. 70 (2021), pp. 245 317. issn: 1076-9757. doi: 10.1613/jair.1.12228. url: https://doi.org/10. 1613/jair.1.12228. [8] Sean Carroll. Spacetime and Geometry: An Introduction to General Relativity. Benjamin Cummings, 2003. isbn: 0805387323. [9] Paul Cella. Reexamining the Catenary . In: The College mathematics journal 30.5 (1999), pp. 391 393. doi: 10.1080/07468342.1999.11974093. [10] Persi Diaconis. Group representations in probability and statistics. Institute of Mathematical Statistics Lecture Notes Monograph Series, 11. Hayward, CA: Institute of Mathematical Statistics, 1988, pp. vi+198. isbn: 0-940600-14-5. url: http://projecteuclid.org/euclid.lnms/1215467407. [11] Persi Diaconis, R. L Graham, and William M Kantor. The mathematics of perfect shuffles . In: Advances in Applied Mathematics 4.2 (1983), pp. 175 196. doi: https://doi.org/10.1016/0196-8858(83)90009-X. url: https://www.sciencedirect.com/science/article/pii/019688588390009X. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. 22:18 Ribeiro [12] X Gabaix. Zipf s Law for Cities: An Explanation . In: The Quarterly journal of economics 114.3 (1999), pp. 739 767. doi: 10.1162/003355399556133. [13] Wassily Hoeffding. A Class of Statistics with Asymptotically Normal Distribution . In: The Annals of mathematical statistics 19.3 (1948), pp. 293 325. doi: 10.1214/aoms/1177730196. [14] Joseph Hucks. Hyperbolic complex structures in physics . In: Journal of mathematical physics 34.12 (1993), pp. 5986 6008. doi: 10.1063/1.530244. [15] Daniela Inclezan and Luis I. PrΓ‘danos. Viewpoint: A Critical View on Smart Cities and AI . In: J. Artif. Int. Res. 60.1 (2017), pp. 681 686. issn: 1076-9757. [16] Hong Inho, Frank Morgan R., Rahwan Iyad, Jung Woo-Sung, and Youn Hyejin. The universal pathway to innovative urban economies . In: Science Advances 6.34 (), eaba4934. doi: 10.1126/sciadv.aba4934. url: https://doi.org/10.1126/sciadv.aba4934. [17] IPUMS. U.S. Individual-level Census (United States Bureau of the Census). 2022. url: https://usa.ipums.org/ usa/complete\_count.shtml. [18] Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. Prediction Policy Problems . In: The American economic review 105.5 (2015), pp. 491 495. doi: 10.1257/aer.p20151023. [19] Christopher Krapu, Robert Stewart, and Amy Rose. A Review of Bayesian Networks for Spatial Data . In: ACM Trans. Spatial Algorithms Syst. (2022). Just Accepted. issn: 2374-0353. doi: 10.1145/3516523. url: https://doi.org/10.1145/3516523. [20] A. J. Lee. U-statistics : theory and practice. New York: M. Dekker, 1990. isbn: 0824782534. [21] Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions . In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS 17. Long Beach, California, USA: Curran Associates Inc., 2017, pp. 4768 4777. isbn: 9781510860964. [22] Douglas C Montgomery. Design and analysis of experiments. New York: John Wiley, 2001. isbn: 0471316490; 9780471316497. [23] Stephen L Morgan and Christopher Winship. Counterfactuals and Causal Inference: Methods and Principles for Social Research. Cambridge: Cambridge University Press, 2007. isbn: 0521671930; 9780521856157; 9780521671934; 0521856159. doi: 10.1017/CBO9780511804564. [24] MEJ Newman. Power laws, Pareto distributions and Zipf s law . In: Contemporary Physics 46.5 (Sept. 2005), pp. 323 351. doi: 10.1080/00107510500052444. [25] Anastasiya M Osborne, United States. Bureau of Labor Statistics. Office of Productivity, Technology, and Peter B Meyer. Proposed category system for 1960-2000 census occupations. Washington, D.C.]: U.S. Dept. of Labor, Bureau of Labor Statistics, Office of Productivity and Technology, 2005. [26] Hans Reichenbach. The direction of time, Berkeley: University of California Press, 1956. [27] Andre F. Ribeiro. Sample observed effects: enumeration, randomization and generalization . In: Scientific Reports 15.1 (2025), p. 8423. doi: 10.1038/s41598-024-80839-8. url: https://doi.org/10.1038/s41598-02480839-8. [28] Alvin E. Roth. The Shapley Value: Essays in Honor of Lloyd S. Shapley. Cambridge: Cambridge University Press, 1988. isbn: 9780521361774. doi: DOI:10.1017/CBO9780511528446. url: https://www.cambridge.org/ core/books/shapley-value/D3829B63B5C3108EFB62C4009E2B966E. [29] Donald B Rubin. Causal Inference Using Potential Outcomes: Design, Modeling, Decisions . In: Journal of the American Statistical Association 100.469 (2005), pp. 322 331. doi: 10.1198/016214504000001880. [30] Charles Edward Sandifer. How Euler did it (Chapter 17, pg.103). eng. The MAA tercentenary Euler celebration ; v.3. Washington, DC]: Mathematical Association of America, 2007. isbn: 9780883855638. [31] Bernhard Scholkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward Causal Representation Learning . In: Proceedings of the IEEE 109.5 (2021), pp. 612 634. doi: 10.1109/JPROC.2021.3058954. Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025. Spatio-Causal Patterns of Sample Growth 22:19 [32] Shashi Shekhar. What is Special about Spatial Data Science and Geo-AI? In: 33rd International Conference on Scientific and Statistical Database Management. SSDBM 2021. Tampa, FL, USA: Association for Computing Machinery, 2021, p. 271. isbn: 9781450384131. doi: 10.1145/3468791.3472263. url: https://doi.org/10.1145/ 3468791.3472263. [33] Shashi Shekhar and Pamela Vold. MIT Press Essential Knowledge series. Cambridge: The MIT Press, 2020. isbn: 0-262-35681-3. [34] Ilaria Tiddi and Stefan Schlobach. Knowledge graphs as tools for explainable machine learning: A survey . In: Artificial Intelligence 302 (2022), p. 103627. doi: https://doi.org/10.1016/j.artint.2021.103627. url: https://www.sciencedirect.com/science/article/pii/S0004370221001788. Received 14 November 2023; accepted 26 July 2024 Journal of Artificial Intelligence Research, Vol. 83, Article 22. Publication date: July 2025.