# an_entropybased_model_for_hierarchical_learning__077b9f30.pdf Journal of Machine Learning Research 25 (2024) 1-45 Submitted 1/23; Revised 6/24; Published 6/24 An Entropy-Based Model for Hierarchical Learning Amir R. Asadi asadi@statslab.cam.ac.uk Statistical Laboratory, Centre for Mathematical Sciences, University of Cambridge, Cambridge CB3 0WA United Kingdom Editor: Gabor Lugosi Machine learning, the predominant approach in the field of artificial intelligence, enables computers to learn from data and experience. In the supervised learning framework, accurate and efficient learning of dependencies between data instances and their corresponding labels requires auxiliary information about the data distribution and the target function. This central concept aligns with the notion of regularization in statistical learning theory. Real-world datasets are often characterized by multiscale data instance distributions and well-behaved, smooth target functions. Scale-invariant probability distributions, such as power-law distributions, provide notable examples of multiscale data instance distributions in various contexts. This paper introduces a hierarchical learning model that leverages such a multiscale data structure with a multiscale entropy-based training procedure and explores its statistical and computational advantages. The hierarchical learning model is inspired by the logical progression in human learning from easy to complex tasks and features interpretable levels. In this model, the logarithm of any data instance s norm can be construed as the data instance s complexity, and the allocation of computational resources is tailored to this complexity, resulting in benefits such as increased inference speed. Furthermore, our multiscale analysis of the statistical risk yields stronger guarantees compared to conventional uniform convergence bounds. Keywords: machine learning, neural network, chaining, information theory, scale-invariant distribution, curriculum learning, logarithmic binning 1. Introduction This paper introduces a hierarchical learning model with a multiscale entropy-based training mechanism. Designed to exploit the multiscale structure in many real-world datasets, the proposed model aims to provide statistical and computational efficiency while featuring interpretability. 1.1 Background In contemporary times, machine learning is the predominant approach in artificial intelligence, enabling computers to acquire knowledge from data and experience. Within the supervised learning paradigm, training data is postulated to arise randomly as pairs (X, Y ), c 2024 Amir R. Asadi. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0096.html. denoting the data instance and its corresponding label, respectively. A computer is presented with a sequence of such instances and labels independently drawn from the underlying data probability distribution. Its task is to discern the relationship between the data instances and their labels, ultimately predicting the label of new, randomly drawn instances. In this paper, for simplicity, we adopt the function learning setting, part of the realizability assumption in statistical learning theory. That is, we assume that the dependency between data instances and their corresponding labels is modeled noiselessly by a deterministic target function Y = T(X), mapping any data instance to its label (Shalev-Shwartz and Ben-David, 2014; Mendelson, 2008). An insightful observation is that the sequence of training examples typically lacks comprehensive information about the target function or the data domain. Beyond the training data, providing the computer auxiliary information through the learning model is crucial. Higher auxiliary information enables the computer to learn the target function more accurately and more efficiently. To illustrate this point further, consider the extreme scenario where no auxiliary information is given to the computer, implying a total lack of knowledge about the target function or the data domain beyond the training data. In such a case, the optimal task for the computer becomes merely memorizing the training examples, leaving it unable to predict the label for any new instance not present in the training data. This approach is very prone to overfitting, most likely resulting in poor performance on new and unseen data instances. This notion of auxiliary information aligns with the idea of regularization studied in statistical learning theory; see, for example, (Vapnik, 1999). Regularization manifests in various forms, such as restricting the hypothesis class and explicitly and implicitly regularizing the training mechanism. Moreover, it is intricately linked to the no-free-lunch theorem, which underscores the necessity for every learning algorithm to possess some level of prior knowledge about the underlying assumptions of the learning problem to attain success; see, for instance, (Shalev-Shwartz and Ben-David, 2014, Theorem 5.1). Analogously, in most tasks, human knowledge is acquired through a combination of training data, prior knowledge, and intuition. A prevalent characteristic in many real-world datasets is the multiscale nature of their data instances. In other words, in these datasets, data manifest across different scales of magnitude, exhibiting a diverse range of sizes and complexities. This inherent property is used in applications in various fields, such as wavelet theory, Fourier analysis, and signal processing, as detailed in (E, 2011) and references within. In particular, empirical data distributions across various domains such as physics, biology, medicine, finance, natural language processing, and the social sciences frequently exhibit power-law distributions. Notable instances of power-law probability distributions include the distributions of people s incomes, city populations, stars brightness, file sizes in computing, earthquake magnitudes, and word frequencies in human languages, among other examples. These phenomena have been extensively studied, as demonstrated in works such as (Newman, 2005; Clauset et al., 2009) and references therein. The generation of power-law distributions in the real world in natural and artificial systems involves diverse mechanisms, and each is believed to apply to specific applications; see (Sornette, 2006, Chapter 14) and (Newman, 2005; Mitzenmacher, 2004). The most important of such mechanisms are considered to be growth with preferential attachment (Yule s process) and critical phenomena (Newman, 2005). Furthermore, An Entropy-Based Model for Hierarchical Learning the generalized central limit theorem posits that the normalized sum of independent and identically distributed random variables with infinite variance can only converge to a stable distribution; see, for example, (Nolan, 2020). All stable distributions with infinite variance exhibit power-law tails, whereas the Gaussian distribution is the sole stable distribution with finite variance (Samoradnitsky and Taqqu, 2017). Power-law distributions are examples of scale-invariant probability distributions. Additionally, target functions in the real world, relating continuous data instances to their real-valued labels, tend to be smooth and well-behaved. Leveraging such characteristics of real-world data instances and target functions as auxiliary information in a machinelearning model presents an opportunity to yield statistical and computational benefits in real-world applications. 1.2 Overview of the Contributions In this paper, we introduce a hierarchical learning model to take advantage of the aforementioned multiscale nature of data instances and the smoothness of their target functions. The model comprises a compositional learning architecture with a sequential multiscale training mechanism. First, the training data is partitioned at different scales, and the training mechanism starts by learning from the batch of the smallest data and progressing step by step toward the batches of larger data. The learning process at each batch of data takes advantage of the learned model over smaller data as prior information. This is inspired by the logical learning mechanism observed in humans, who progressively learn different tasks, commencing with small (easy) examples and advancing toward large (complex) examples. Due to the ubiquity of such multiscale data, our proposed learning model holds promise for many applications. More precisely, we consider the following two assumptions: (a) Data instances Xi Rm emerge in different scales of magnitude from a distribution µ defined on the data domain X. Here, we assume that for 0 < ε < R, the data domain is X = {x Rm : ε |x| < R}, where |x| denotes the Euclidean norm of datum x. Typically, ε is much smaller than R. Later in the paper in Subsection 5.2, we assume that µ is a scale-invariant probability distribution with shape parameter α, whose probability density q(x) satisfies the following condition: For all x X and any γ 1 such that x/γ X, we have = γαq(x). (1) (b) The target function T : Bm R Rm is well-behaved, where Bm R = {x Rm : |x| < R} denotes the m-dimensional Euclidean ball with radius R > 0, centered at the origin. In this paper, we assume that T is differentiable and smooth, and we further assume both the invertibility of T and the Lipschitz continuity of its inverse, T 1. For the special case of m = 1, we show (in Theorem 19) that an additional smoothness assumption of T 1 leads to a strengthened result. We further assume that from prior knowledge or intuition, the learning model knows the behavior of function T on Bm ε , data instances at very small scales. We take advantage of such assumptions on the data in our hierarchical learning model based on the following contributions: Ladder decompositions of functions The first main result of the paper studies ladder decompositions of invertible functions T : Bm R Rm. The decomposition is defined in terms of dilations T[γ] : Bm R Rm of T, where for any scale 0 < γ 1 and all x Bm R , T[γ](x) := T(γx) The dilation T[γ] can be conceptualized as a zoomed version of the original function T into the origin, where the degree of zooming is determined by the scale γ. Given a sequence of dilation scales 0 < γ0 < < γd = 1, the ladder decomposition is defined based on the sequence of dilations {T[γk] : k = 0, . . . , d} interpolating between T[γ0] and T[1] = T. The decomposition is constructed as a sequence of successive compositions of functions Tk := T[γk] T 1 [γk 1] for all k = 1, . . . , d, and writing T[γk] = Tk . . . T1 T[γ0]. Building on an earlier result from (Bartlett et al., 2018), we show that the smaller the value of γk γk 1 is, then the closer Tk is to the identity mapping, thus being more well-behaved and easier to learn. To elaborate, for all 1 k d, let ψk(x) := Tk(x) x. The first main contribution of the paper, Corollaries 20 and 26 in Section 3, states that if M2-smooth invertible function T : Bm R Rm is such that T 1 is M1-Lipschitz, then ψk Lip C(γk γk 1), where ψk Lip denotes the Lipschitz norm of ψk and constant C only depends on M1, M2 and R. The definitions of the ladder decomposition and functions ψk yield that, for all 1 k d, we have T[γk] = Tk T[γk 1] = T[γk 1] + ψk T[γk 1]. This dependency between the levels of the ladder decomposition fits with a hierarchical learning model with residual levels of the following form: hk(x) := hk 1(x) + f(hk 1(x); wk). (2) Here, h0 := T[γ0], and f( ; w) is a learnable model with parameters w (a vector), for example, a neural network. A specific example of f, a two-layer network with step activation functions, is explored in Section 6. Let w = (w1, . . . , wd) denote the parameters of the whole hierarchical model. For any 1 k d, define w1:k := (w1, . . . , wk). Level hk is a function of the input datum x and the weights w1:k. Therefore, we sometimes denote hk with hk(x; w1:k). Parameters w1, . . . , wd will be chosen sequentially by the training mechanism such that f( ; wk) approximates ψk well for all k = 1, . . . , d. Thus, for all k = 1, . . . , d, the kth level of our learning model (hk) aims to approximate the dilation T[k] of the target function. An Entropy-Based Model for Hierarchical Learning The hierarchical learning model By defining γ := (R/ε)1/d, we split the data domain X into d scales: Xk := n x Rm : εγk 1 |x| < εγko , for all 1 k d. This multiscale partitioning of data domains is sometimes referred to as logarithmic binning in the literature; see, for instance, (Newman, 2005). We then consider the ladder decomposition of the target function T with respect to γk := γk d, for k = 0, . . . , d. The logarithm of the norm of a datum x X is viewed as a measure of its complexity; that is, the sets X0, X1, . . . , Xd represent inputs of increasing complexity. We assume that the target function is known at level γ0, that is, T[γ0] is given. In other words, the behavior of the function T at minimal values of x is known, and we shall sequentially learn T[γ1], . . . , T[γd]. For all 1 k d, since the training mechanism aims to make hk(x) closely approximate dilation T[γk](x) = T(γkx)/γk, we define the output of the learned model for any x Xk as hw(x) := γkhk Given training examples s = (xi, T(xi))n i=1, the training of the model is performed by sampling the parameters w1, w2, . . . , wd in a sequential manner from a sequence of Gibbs measures. Namely, we start by considering training data examples whose instances belong to the smallest scale X1, corresponding to the easiest instances. An approximation of ψ1 is then learned by sampling w1 from a Gibbs measure, a maximum entropy distribution, defined with the empirical risk over these small-scaled examples with the absolute error loss function and with hyperparameter λ1. More precisely, w1 is sampled from a discrete set W1 according to the Gibbs measure: P1(W1 = w1) := exp 1 nλ1 P xi s X1 γ1h1 xi γ1 ; w1 T(xi) P w 1 W1 exp 1 nλ1 P xi s X1 γ1h1 xi γ1 ; w k T(xi) . Subsequently, we observe data at the next scale X2, characterized by a higher magnitude. Similarly, given the learned approximation f( , w1) of ψ1, an approximation f( , w2) of ψ2 is learned by sampling w2 from a Gibbs measure with hyperparameter λ2, and this process is iteratively repeated. That is, for k = 2, . . . , d, we sample wk Wk conditionally on w1:(k 1) with probability PWk|W1:(k 1) wk w1:(k 1) := exp 1 nλk P xi s Xk γkhk xi γk ; w1:k T(xi) P w k Wk exp 1 nλk P xi s Xk γkhk xi γk ; w1:(k 1)w k T(xi) . Consequently, learning the target function for data at smaller scales is inherently easier, serving as a foundational step for tackling more challenging learning tasks at higher scales. Thus, this hierarchical model is also a model for implementing the concept of easy-to-hard learning, commonly known as curriculum learning (Bengio et al., 2009). Here, scale serves as a temporal progression akin to how humans learn a course by starting with simpler examples and gradually advancing to more complex ones. This hierarchical approach to learning the ladder decomposition is metaphorically akin to the step-by-step ascent of a ladder. The total training is then modeled with the following measure: P W(w) := PW1(w1)PW2|W1(w2|w1) . . . PWd|W1:(d 1) wd|w1:(d 1) . The next main result of the paper, Theorem 29, states that P W minimizes a multiscale loss regularized by a multiscale form of entropy. That is, P W = arg min PW E h ℓ(λ)(W, s) i k=1 (λk λk+1)H(W1:k) where ℓ(λ) is related to the total loss over the multiple scales and H(W1:k) is Shannon entropy. The proof s idea, which expands upon the proof technique of (Asadi and Abbe, 2020, Theorem 13), is to show that the functional optimized by P W can be decomposed into a sum of conditional entropies of Wk given W1:(k 1), for all 1 k d. The statistical risk The paper s final main result bounds the model s statistical risk when the parameters are chosen according to the above multiscale training mechanism. The result proceeds through a chaining argument, where the loss is decomposed over the successive stages of the training procedure, during which the parameters w = (w1, . . . , wd) are chosen. Under the assumption that the model is realizable, that is, there exists a choice of parameters ˆw1, . . . , ˆwd such that ψk( ) = f( ; ˆwk), for all 1 k d, we define the chained risk as L(C) µ (w) := E ℓk(w1:k, X) ℓk w1:(k 1) ˆwk, X # where X µ and, for all 1 k d, ℓk(w1:k, x) := ( |hw(x) T(x)| = γkhk x γk , w1:k T(x) if x Xk 0 if x / Xk is the loss of the model on data at scale k. Theorem 33 then states that when (S, W) µ n P W|S, that is, when training data S are n independent samples from distribution µ and the parameters W are the outputs of the multiscale entropy-based training mechanism given random training data S, then the expected chained risk satisfies E h L(C) µ (W) i 2 j=1 log |Wj| + 4γ2 kρ2 k n(λk λk+1) where λd+1 := 0 and ρk = O(γk) is the maximum output norm of ψk( ), for all 1 k d. Optimizing the hyperparameters λ1, . . . , λd of our learning mechanism in the bound above An Entropy-Based Model for Hierarchical Learning yields that, for all 1 k d, we have λk λk+1 = 2γkρk q n Pk j=1 log |Wj| . This, in turn, establishes the following bound on the expected chained risk: E h L(C) µ (W) i 8 n j=1 log |Wj|. (5) Although the true parameter ˆw = ( ˆw1, . . . , ˆwd) achieves zero chained risk, in general it is not clear if a low expected chain risk implies low statistical risk. In fact, the chained risk is always a lower bound for the statistical risk. However, for any scale-invariant probability distribution satisfying equation (1), with sufficiently large shape parameter α, we demonstrate in Subsection 5.2 an upper bound on the statistical risk, based on the chained risk. More precisely, we show that for such scale-invariant µ, there exists a constant ˆC > 0 independent of w such that ˆCLµ(w) L(C) µ (w), yielding an upper bound on the expected statistical risk from (5). Bounded-norm parameterization example In Section 6, we study a particular example of the function f in (2), where invertible functions that are Lipschitz and smooth are approximated by two-layer neural networks with step activation functions and boundednorm parameters. We study the approximation error, a bound on the network s output, and a bound on the network parameters norm. The parameters of the example are discretized, hence the hypothesis set of the whole hierarchical model is finite. We then apply the derived bound on the statistical risk of the hierarchical model from the earlier sections to this example. 1.3 Related Work This work draws inspiration primarily from integrating concepts presented in (Bartlett et al., 2018) and (Asadi and Abbe, 2020). The paper (Bartlett et al., 2018) establishes that any smooth bi-Lipschitz function T can be represented as a composition of d functions Td T1, where each function Tk, for 1 k d, approximates the identity function, manifested by the Lipschitz norm of Tk(x) x decreasing inversely with d. Notably, the proof of (Bartlett et al., 2018, Theorem 2) employs the concepts of function dilation and ladder decomposition. On the other hand, the paper (Asadi and Abbe, 2020) addresses the solution to the multiscale entropy regularization problem, an extension of the Gibbs probability distribution to a multiscale context. However, efficient sampling from the optimal probability distribution, the output of the Marginalize-Tilt algorithm of (Asadi and Abbe, 2020), remains challenging due to multiple steps of marginalizations of probability distributions. To alleviate this problem, (Asadi and Loh, 2024) considers finding self-similar approximations to the optimal probability distribution. Developing on the proof technique of (Asadi and Abbe, 2020), this paper introduces the multiscale loss function ℓ(λ) that, when regularized with multiscale entropy in (4), is minimized by the self-similar and computable distribution P W. Distinct from these works, our approach emphasizes interpretability by hierarchically learning ladder decompositions of target functions and involves reading the output of the multilevel learning model from different levels and depths, contingent on the scale of the input data instance, as in (3). In essence, we align the multiscale architecture of the learning model with the multiscale data domain. Information-theoretic methods for analyzing the statistical risk and the generalization error of learning algorithms have been pioneered within the framework of PAC-Bayesian bounds (Mc Allester, 1999). This line of inquiry later evolved, adopting a related form that utilizes mutual information, exemplified by the works (Russo and Zou, 2019), (Xu and Raginsky, 2017) and (Bu et al., 2020). In a related vein, (Raginsky et al., 2016) introduces alternative information-theoretic measures to assess the stability of learning algorithms and bounds on their generalization capabilities. The extension of these information-theoretic methods to multiscale techniques, inspired by the method of chaining in probability theory (see, for example, Talagrand, 2014), is presented in (Audibert and Bousquet, 2007),(Asadi et al., 2018), (Asadi and Abbe, 2020), and (Clerico et al., 2022). Multiscale entropies, a combination of entropies at different scales, play an implicit role in chaining. Notably, Dudley s inequality (Dudley, 1967) can be variationally transformed, expressing the bound as a linear mixture of metric entropies across multiple scales. The work (Xu and Raginsky, 2017) also studies into the expected statistical risk of sampling from the Gibbs distribution, sometimes referred to as maximum-entropy training. A subsequent multiscale extension of this analysis has likewise been derived in (Asadi and Abbe, 2020). The work (Fletcher and Markovic, 2012) establishes that any diffeomorphism defined on the sphere can be decomposed as the composition of bi-Lipschitz functions with small distortion. Hierarchical learning models comprising layers that are nearly identity functions find applications in deep learning, evident in the context of residual networks (He et al., 2016) and through dynamical systems approaches; see, for example, (E, 2017). A plausible rationale behind the superior performance of deep neural networks compared to shallow networks is their capability to learn different aspects of the data distribution across various layers. This hypothesis finds support in empirical evidence where pre-trained bottom layers combined with task-specific layers achieve excellent performance in image classification tasks; see, for instance, (Devlin et al., 2018) and (Girshick et al., 2014). 1.4 Organization of This Paper The rest of the paper is organized as follows: We provide the preliminaries and notation in Section 2. Following that, in Section 3, we introduce the concept of ladder decompositions of functions, examining the Lipschitz continuity and smoothness of its components. Section 4 explains our proposed learning model. Section 5 comprises two parts: Subsection 5.1 demonstrates the efficacy of multiscale entropy-based training in achieving low chained risk, a new analytical tool. Subsection 5.2 establishes that if the data distribution µ is scale-invariant, the chained risk can serve as an upper bound on the statistical risk, thereby providing an overall upper bound on the statistical risk of the learned model. Section 6 gives an example where diffeomorphisms can be represented using a parameterized model An Entropy-Based Model for Hierarchical Learning with bounded-norm parameters. We then compute the bound on the statistical risk from Section 5 for this example. Finally, Section 7 encapsulates our work s conclusions. 2. Preliminaries and Notation The sets of real numbers and integers are symbolized by R and Z, respectively. Throughout the paper, |x| denotes the Euclidean norm of a vector x Rm, |x|1 represents the ℓ1-norm of a vector x Rm, |A|2 denotes the spectral norm of a matrix A Rm m, and f Lip is the Lipschitz norm of a function f. The identity function is denoted as id. For a pair of positive integers k n, let n k represent the n choose k binomial coefficient. For any x R, the floor function x equals the largest integer smaller than or equal to x. Given two matrices A, B Rm m, A B indicates that B A is a positive semidefinite matrix. Random variables and vectors are denoted by capital letters, while lowercase letters represent their specific realizations. The equiprobable (uniform) probability distribution is represented by U, and its support is indicated with a subscript. If X is a random variable and P is a probability measure, the notation X P signifies that X is distributed according to P. The Dirac probability measure on w is denoted as δw. The n-fold tensor product of a measure µ with itself is represented by µ n. For two distributions P and Q, P Q means that P is absolutely continuous with respect to Q. In supervised batch learning, we define X as the instance domain and Y as the label domain. A target function T : X Y exists, which maps any data instance to its label. We also define the hypothesis set H = {hw : w W} consisting of hypotheses indexed by the set W. A loss function ℓ: W X R+ is introduced. A learning algorithm is presented with a random training sequence S = (X1, ..., Xn) comprising n data instances along with their corresponding labels (T(X1), . . . , T(Xn)), where S is drawn i.i.d. from X with an unknown distribution µ. In other words, S µ n. During the training procedure, the algorithm selects h W H, modeled by a random transformation PW|S. Definition 1 (Statistical Risk) For any w W, the statistical (or population) risk of w is defined as Lµ(w) := E[ℓ(w, X)], (6) A primary goal of statistical learning is to identify computationally efficient learning algorithms for which, given a random training set of size n, the expected statistical risk E[Lµ(W)] is small. Here, W is distributed with the marginal distribution of PWS = µ n PW|S. Next, we present some preliminary tools later used in the paper. Definition 2 (Entropy) The Shannon entropy of a discrete random variable X, taking values on set A, is defined as x A PX(x) log PX(x). The relative entropy between two distributions PX and QX, if PX QX is defined as D(PX QX) := X x A PX(x) log PX(x) otherwise, we define D(PX QX) := . The conditional relative entropy is D PY |X QY |X PX := X x A D(PY |X=x QY |X=x)PX(x) = E D PY |X( |X) QY |X( |X) , X PX. The following useful property of entropy is called the chain rule . For proof, see, for example, (Cover and Thomas, 2012, Theorem 2.5.3): Lemma 3 (Entropy Chain Rule) Let PXY and QXY be two distributions. We have D(PXY QXY ) = D(PX QX) + D PY |X QY |X PX . The next definition relates to geometric transformations of probability measures: Definition 4 (Escort and Tilted Distributions) Given a discrete probability measure P defined on a set A, and any λ [0, 1], we define the escort distribution (P)λ for all a A as (P)λ(a) := P λ(a) P x A P λ(x). Given two discrete probability measures P and Q defined on a set A, and any λ [0, 1], we define the tilted distribution (P, Q)λ as the following geometric mixture: (P, Q)λ(a) := P λ(a)Q1 λ(a) P x A P λ(x)Q1 λ(x). Evidently, if U is the equiprobable distribution on A, then (P)λ = (P, U)λ. In our analysis in Section 5, similar to (Asadi and Abbe, 2020), we encounter linear combinations of relative entropies. The next lemma shows the role of tilted distributions in dealing with such linear combinations; see, for example, (van Erven and Harremo es, 2014, Theorem 30): Lemma 5 (Entropy Combination) Let λ [0, 1]. For any distributions P, Q and R defined on a discrete set A such that P Q and P R, we have λD(P Q) + (1 λ)D(P R) = D P (Q, R)λ log x A Qλ(x)R1 λ(x) An Entropy-Based Model for Hierarchical Learning Proof Let Z := P x A Qλ(x)R1 λ(x). We have λD(P Q) + (1 λ)D(P R) = λ X x A P(x) log P(x) x A P(x) log P(x) x A P(x) log P(x) Qλ(x)R1 λ(x) x A P(x) log Qλ(x)R1 λ(x) = D P (Q, R)λ log x A Qλ(x)R1 λ(x) We provide the following definition to later simplify the notation in the proof of Theorem 29: Definition 6 (Congruent Functionals) We call two functionals L1(P) and L2(P) of a distribution P congruent and write L1 = L2 if L1 L2 does not depend on P. For example, Lemma 5 implies that if Q and R are fixed distributions, then as functionals of P, the following congruency holds: λD(P Q) + (1 λ)D(P R) = D P (Q, R)λ . Specifically, if U is the equiprobable distribution, then λD(P Q) + (1 λ)D(P U) = D P (Q)λ . (7) The following well-known result, sometimes referred to as the Gibbs variational principle, implies that the distribution that minimizes the sum of average energy (loss) and entropy (regularization) is the Gibbs measure: Lemma 7 (Gibbs Variational Principle) Let W be an arbitrary finite set and UW be the equiprobable distribution on W. Given a function g : W R and λ > 0, we define the following Gibbs probability distribution for all w W: QW (w) exp g(w) P w W exp g(w ) Then, for any probability measure PW defined on W, we have E[g(W)] + λD(PW UW ) = λD(PW QW ) λ log w W exp g(w ) where W PW . Particularly, Lemma 7 yields the following congruency identity as functionals of PW : E[g(W)] + λD(PW UW ) = λD(PW QW ). (8) We later make use of the congruency relations (7) and (8) iteratively in the proof of Theorem 29. In Section 5, we require the following well-known result on the Log-Sum-Exp function: Lemma 8 (Log-Sum-Exp) For any positive integer N, let z = (z1, z2, . . . , z N) RN be arbitrarily chosen. For any λ > 0, the Log-Sum-Exp function Gλ(z) := λ log satisfies min j=1,...,N zj λ log N Gλ(z) min j=1,...,N zj. The proof of Theorem 33 requires some tools on the topic of concentration of measures, as stated next. Definition 9 (Subgaussian) A random variable X is called σ-subgaussian if for all λ R, its cumulant generating function satisfies log E h eλ(X EX)i λ2σ2 The next result is based on (Xu and Raginsky, 2017, Lemma 1), which itself can be derived from (Boucheron et al., 2013, Lemma 4.18). This result and its variants are key ingredients in information-theoretic generalization bounds. Lemma 10 If g A, B is σ-subgaussian where A, B PAPB, then for all λ > 0, E g A, B E[g(A, B)] λ(log |A| H(A|B)) + σ2 The Azuma Hoeffding inequality shows the subgaussianity of the sum of independent and bounded random variables: Lemma 11 (Azuma Hoeffding) Let X1, . . . , Xn be independent random variables such that a Xi b for all 1 i n. Then, E h e λ n Pn i=1(Xi EXi)i λ2 In other words, Pn i=1 Xi/n is (b a)/ n-subgaussian. Let g : [a, b] R be a differentiable function and assume that n is a positive integer. Let ˆ := (b a)/n. We define the Riemann sum at level n as i=1 g(a + (i 1) ˆ ) The subsequent well-known lemma, which we use in Section 6, bounds the approximation error of the Riemann sum. For proof, see, for example, (Hughes-Hallett et al., 2020). An Entropy-Based Model for Hierarchical Learning Lemma 12 (Riemann Sum) The approximation error of the Riemann sum is bounded as follows: a g(x)dx Σn ˆ M 2n(b a)2, where ˆ M is the maximum absolute value of the derivative of g on [a, b]. 3. Ladder Decompositions of Functions In this section, we first precisely define dilations and ladder decompositions of invertible functions. Then, we study Lipschitzness and smoothness of the components of the ladder decompositions of smooth functions T : Bm R Rm defined on bounded Euclidean balls with radius R > 0. For simplicity, we first consider the special case m = 1 in Subsection 3.1 and investigate multi-dimensional functions in Subsection 3.2. We first provide the precise definition of dilations of a function. Definition 13 For any 0 < γ 1, let the dilation of function T : Bm R Rm at scale γ be defined as T[γ](x) := T(γx) γ , for all x Bm R . If T is differentiable and T(0) = 0, as a continuous extension for γ = 0, we define T[0] as the derivative of T at the origin. Namely, T[0](x) := JT (0)x, where JT (0) is the Jacobian matrix of T at the origin. The next lemma relates the inverse of the dilation with the dilation of the inverse function: Lemma 14 Let T be an invertible function. For any 0 < γ 1, we have T[γ] 1 = T 1 [γ] . Moreover, if T is differentiable and T(0) = 0, then T[0] 1 = T 1 [0] . Proof If γ > 0, then T[γ] 1(x) = T 1(γx)/γ = T 1 [γ] (x). The case γ = 0 follows from the well-known inverse function theorem; see, for example, (Baxandall and Liebeck, 1986, Chapter 4). The next definition characterizes the concept of ladder decompositions: Definition 15 For all 1 k d, let Tk := T[γk] T 1 [γk 1]. We call the following multiscale decomposition the ladder decomposition of T at scale parameters {γk}d k=0: T = Td T1 T[γ0]. (9) For all 1 k d, we further define ψk := Tk id. Clearly, for all 1 k d, we have T[γk] = Tk T1 T[γ0]. Owing to the smoothness of the function T, for all 1 k d, when consecutive scale parameters γk and γk 1 are close, we intuitively expect that the transformation Tk, acting between the dilations of T at scales γk 1 and γk, be close to the identity function. In the rest of this section, we make this intuition precise, starting from the simpler case of one-dimensional functions (m = 1) and then studying the more general case of multi-dimensional functions (m 2). 3.1 One-Dimensional Functions We begin by providing the definition of smooth functions for one-dimensional functions. Definition 16 Let V be a bounded subset of R. The one-dimensional function T : V R is M-smooth if it is differentiable and its derivative T is M-Lipschitz. If T : V R is twice differentiable, then it is M-smooth if |T (x)| M for all x V, where T denotes the second derivative of T. The next definition concerns the concept of diffeomorphisms: Definition 17 A function T : ( R, R) R is an (M1, M2)-diffeomorphism if it is invertible and both T and its inverse T 1 are twice differentiable, M1-Lipschitz and M2-smooth. If T is smooth, then one may expect T[γ] to get closer to T[0] as γ 0. The following preliminary result can be viewed as a formalization of this insight. Its proof is based on a simple extension of the proof of (Bartlett et al., 2018, Theorem 2). Proposition 18 Let T : ( R, R) R be a M2-smooth function and T(0) = 0. Then, for any 0 γ 1, T[γ] T[0] Lip γM2R. Proof The statement is trivial if γ = 0. Assume that 0 < γ 1, and let x, y ( R, R) be arbitrarily chosen. Based on the mean value theorem, there exists z between x and y such that T(γx) T(γy) = γT (γz)(x y). We can write (T[γ](x) T[0](x)) (T[γ](y) T[0](y)) = = T (γz)(x y) T (0)(x y) = |x y||T (γz) T (0)| |x y||γz|M2 |x y|γM2R, which implies the statement. Since T(0) = 0, then based on Proposition 18 we have (T[γ](x) T[0](x)) γM2R|x|. Thus, the smaller γk is in the ladder decomposition, the closer function T[γk] is to the linear function T[0]. The next result makes precise the intuition that if two subsequent scale parameters γk and γk 1 are close, then Tk := T[γk] T 1 [γk 1] is a function close to the identity function: Theorem 19 Let T : ( R, R) R be an invertible M2-smooth function such that T 1 is M1-Lipschitz and T(0) = 0. Then, for any 0 γ γ 1, we have T[γ ] T 1 [γ] id Lip γ γ M1M2R. If T is further assumed to be an (M1, M2)-diffeomorphism, then T[γ ] T 1 [γ] id is (M2 1 + M1)M2-smooth as well. An Entropy-Based Model for Hierarchical Learning Proof The statement is trivial when γ = γ . It is enough to prove that for any 0 γ < γ 1 and any x, y in the domain of T[γ ] T 1 [γ] , the following inequality holds: T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x γ γ M1M2R|y x|. Let v := T 1 [γ] (x) and w := T 1 [γ] (y). Based on Lemma 14, x = T[γ](v) and y = T[γ](w). According to Definition 13, the domain of T[γ] is ( R, R), thus max{|v|, |w|} < R. By defining the function r := T[γ ] T[γ], we observe that T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x = T[γ ](w) T[γ](w) T[γ ](v) T[γ](v) = r(w) r(v). Note that the derivative of r at point v is equal to r (v) = T (γ v) T (γv). Based on the mean value theorem, there exists c R between w and v such that r(w) r(v) = (w v)r (c) = (w v)(T (γ c) T (γc)). Hence, T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x = |r(w) r(v)| = |w v| T (γ c) T (γc) |w v|M2|γ c γc| (10) = |w v|M2|c| γ γ |w v|M2R γ γ (11) |y x|M1M2R γ γ , (12) where (10) follows from T being M2-smooth, (11) follows from |c| max{|v|, |w|} < R, and (12) is based on the assumption that T 1 is M1-Lipschitz. We now prove the smoothness property. Let g(x) := T 1(x) and ψ := T[γ ] T 1 [γ] id. The chain rule of derivatives yields ψ (x) = γ T γ γ g(γx) g (γx) 2 + γT γ γ g(γx) g (γ). Since T and g are both M1-Lipschitz and M2-smooth and 0 γ, γ 1, we deduce |ψ (x)| (M2 1 + M1)M2. Therefore, ψ is (M2 1 + M1)M2-smooth. Corollary 20 Theorem 19 implies that for the ladder decomposition of T at scale parameters {γk}d k=0, the following inequality holds for all 1 k d: ψk Lip (γk γk 1)M1M2R. (13) The following is an example in which the functions ψk, 1 k d, have a closed-form expression: Example 1 Let T(x) := tanh(x) be the hyperbolic tangent function where it is known that T 1(x) = 1 2 ln 1+x 1 x for |x| < 1. Assume that γk = 2k d for all 0 k d. For all |x| < 1, we can derive ψk(x) = Tk(x) x = T[γk] T 1 [γk 1](x) x = exp 2γk T 1 [γk 1](x) 1 γk exp 2γk T 1 [γk 1](x) + 1 x = exp γk γk 1 ln 1+γk 1x 1 γk 1x 1 γk exp γk γk 1 ln 1+γk 1x 1 γk 1x + 1 x 1+γk 1x 1 γk 1x γk γk 1 1 1+γk 1x 1 γk 1x γk γk 1 + 1 x = 2γk 1x γk 1 + γ2 k 1x2 x = x 1 + γ2 k 1x2 x 1 + γ2 k 1x2 = 1 γ 2 k 1x 3 + x 1 . Figure 1 depicts the plot of ψk(x) for all 1 k d, where d = 5. It clearly demonstrates that the Lipschitz norm of ψk increases with k. 3.2 Multi-Dimensional Functions In this subsection, we extend the first part of Theorem 19 to multi-dimensional functions with m 2. We commence by providing the definition of smooth multi-dimensional functions. The well-known extension of Definition 16 to real-valued functions of multiple variables is as follows: Definition 21 The scalar-valued function g : Bm R R is M-smooth if it is differentiable, and its gradient g is M-Lipschitz with respect to the Euclidean distance. Namely, for any x, y Bm R , we have | g(x) g(y)| M|x y|. An Entropy-Based Model for Hierarchical Learning -1.0 -0.5 0.0 0.5 1.0 Figure 1: A plot of ψk(x) for different values k: k = 1 the solid line, k = 2 the dotted line, k = 3 the dashed line, k = 4 the dotted-dashed line and k = 5 the large dashed line. A twice differentiable function g : Bm R R is M-smooth if the absolute value of any eigenvalues of its Hessian Hg(x) is smaller than or equal to M, for all x Bm R . In other words, if I denotes the identity matrix, then MI Hg(x) MI for all x Bm R . We now extend the previous definition to vector-valued functions: Definition 22 The multi-dimensional function T : Bm R Rm is M-smooth if it is differentiable and its Jacobian JT is M-Lipschitz with respect to the spectral distance. In other words, for any x, y Bm R , the subsequent inequality holds: JT (x) JT (y) 2 M|x y|. The next result shows the relation between Definitions 21 and 22. Proposition 23 Assume that T : Bm R Rm is a function with components T(x) = (T1(x), . . . , Tm(x)). Then, the multi-dimensional function T is M-smooth if and only if each scalar-valued function Ti : Bm R R is M-smooth, where 1 i m. Proof Let x, y Bm R be arbitrarily chosen. Assume that each function Ti is M-smooth for all 1 i m. For any u Rm, we observe |(JT (x) JT (y))u| max 1 i m ( Ti(x) Ti(y))T u . Moreover, given any 1 i m, the Cauchy-Schwartz inequality implies ( Ti(x) Ti(y))T u | Ti(x) Ti(y)||u| Thus, for any u Rm, |(JT (x) JT (y))u| M|x y||u|, which implies that JT (x) JT (y) 2 M|x y|. This proves the if part. To prove the only if part, assume that T is M-smooth. For any 1 i m, let ei be the ith standard unit vector. Then, for all 1 i m, ( Ti(x) Ti(y))T = |(JT (x) JT (y))ei| JT (x) JT (y) 2|ei| which implies that Ti is M-smooth. The proof of Theorem 19 relied upon the mean value theorem. The following result is a multi-dimensional extension of the mean value theorem: Lemma 24 Let r : Bm R Rm be a differentiable function. For any a, b Bm R , there exists λ (0, 1) such that for c := λb + (1 λ)a Bm R , we have |r(b) r(a)| Jr(c) 2|b a|, where Jr(c) is the Jacobian of r at c. Proof Define ξ : [0, 1] R as ξ(t) := (r(b) r(a))T r(tb + (1 t)a). Based on the mean value theorem, there exists λ (0, 1) such that ξ(1) ξ(0) = ξ (λ). Thus, according to the chain rule of derivatives, we can write (r(b) r(a))T (r(b) r(a)) = (r(b) r(a))T Jr(c)(b a). The Cauchy-Schwartz inequality yields |r(b) r(a)|2 |r(b) r(a)||Jr(c)(b a)| |r(b) r(a)| Jr(c) 2|b a|, which implies the statement. We now have sufficient tools to extend the first part of Theorem 19 to multi-dimensional functions. Theorem 25 Let T : Bm R Rm be an invertible M2-smooth function such that T 1 is M1-Lipschitz and T(0) = 0. Then, for any 0 γ γ 1, we have T[γ ] T 1 [γ] id Lip γ γ M1M2R. An Entropy-Based Model for Hierarchical Learning Proof The case γ = γ is easy to verify. Thus, it is enough to prove that for any 0 γ < γ 1 and any x, y in the domain of T[γ ] T 1 [γ] , we have T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x γ γ M1M2R|y x|. Let v := T 1 [γ] (x) and w := T 1 [γ] (y). Based on Lemma 14, x = T[γ](v) and y = T[γ](w). According to Definition 13, the domain of T[γ] is Bm R , thus max{|v|, |w|} < R. Let r := T[γ ] T[γ]. We observe that T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x = T[γ ](w) T[γ](w) T[γ ](v) T[γ](v) = r(w) r(v). The Jacobian of function r at point z is equal to Jr(z) = JT (γ z) JT (γz), where JT is the Jacobian of T, for all z Bm R . According to Lemma 24, there exists c Rm as a convex combination of v and w such that |r(w) r(v)| Jr(c) 2|w v|. Hence, T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x = |r(w) r(v)| |w v| Jr(c) 2 = |w v| JT (γ c) JT (γc) 2 |w v|M2|c| γ γ (14) |w v|M2R γ γ (15) |y x|M1M2R γ γ , (16) where (14) follows from T being M2-smooth, (15) is based on the fact that |c| max{|v|, |w|} < R, and (16) follows from the assumption that T 1 is M1-Lipschitz. Corollary 26 Similar to Corollary 20, Theorem 25 implies that for the ladder decomposition of multi-dimensional function T at scale parameters {γk}d k=0, the following inequality holds for all 1 k d: ψk Lip (γk γk 1)M1M2R. (17) Remark 27 The proof of (Bartlett et al., 2018, Theorem 2) establishes that, for all 1 k d, function ψk is C(γk γk 1)/γk-Lipschitz for some constant C. However, this implication is weaker than our result. For example, for each 1 k d, when scale parameters are chosen as γk = k/d, then our result yields that ψk Lip = O(1/d). However, (Bartlett et al., 2018, Theorem 2) states that there exist functions T1, . . . , Td such that T = Td T1 and for each 1 k d, Tk id Lip = O((log d)/d). Corollaries 20 and 26, when applied to the ladder decomposition of the target function T, are key results underpinning our learning model. 4. The Proposed Learning Model In this section, we precisely formulate the learning model and discuss its strengths. Assume that for 0 < ε < R, the data instance domain is X := {x Rm : ε |x| < R}. Typically, ε is much smaller than R. The label set is Y = Rm, and the target function T is an invertible M2-smooth function such that T 1 is M1-Lipschitz. Suppose that from previous knowledge or intuition, we know the behavior of function T at data instances in very small scales 0 |x| < ε. This is equivalent to knowing T[γ0], where γ0 := R/ε. Without loss of generality, we further assume that T(0) = 0.1 For example, it may be assumed that T[γ0] is a linear function equal to T[0], the derivative of T at the origin. Let γ := (R/ε)1/d. We split the data domain X into d scales. For all 1 k d, define Xk := n x Rm : εγk 1 |x| < εγko . Clearly, X = d k=1Xk, indicating that these sets form a partition of X. Each set Xk is called the domain of instances at scale k. We define the scale parameters {γk}d k=0 to constitute a geometric sequence such that for all 0 k d, γk := γk d. (18) Consider the ladder decomposition of T at scale parameters {γk}d k=0 (Definition 15). For all 1 k d, we have T[γk] = Tk T[γk 1] = T[γk 1] + ψk T[γk 1]. (19) To progressively learn the target function T on X through multiple stages, we introduce a d-level hierarchical learning model that aligns with the relation (19). We set h0 := T[γ0], and for all 1 k d, define hk(x) := hk 1(x) + f(hk 1(x); wk), (20) where f is a function dependent on the parametrization of the learning model. The choice of function f is important for ensuring enough representation power of the model, and we explore an example in Section 6. Remark 28 One may rely on the universal approximation theorem for neural networks and assume that function f is a two-layer neural network with sufficient width; see a discussion in (Bartlett et al., 2018). We denote the parameters of the kth level of the learning model with wk, allowing it to take values from a finite set Wk during training. The training mechanism aims to make the mapping between the input and each layer hk approximate T[γk], the dilated version of the 1In general, one can replace T( ) with T( ) := T( ) T(0) and learn this new function. Clearly, T(0) = 0. An Entropy-Based Model for Hierarchical Learning target function T at scale γk. For a successfully trained model, f( ; wk) should effectively approximate ψk( ). Let C1 := M1M2R. According to Theorem 25, we have ψk Lip = C1(γk γk 1) = C1(γ 1)γk d 1. (21) Since T(0) = 0, we have ψk(0) = 0, for all 1 k d. Thus, (21) implies that |ψk( )| O(γk γk 1) = O(γk). (22) Hence, we regularize our hypothesis set as follows: |f( ; wk)| ρk, for all wk Wk, (23) where ρk is at least the maximum value of |ψk( )|. In a successfully trained model, each level hk approximates T[γk]. Knowing T[γk] is enough to determine the label of each instance x Xk with appropriate rescalings. Thus, we define the output of our model hw(x) as follows: if x X2, ... hd(x) if x Xd. Therefore, there is no necessity to propagate every data instance x through all levels of our hierarchical model. Rather, the processing steps required to evaluate hw(x) for an input instance x X are proportional to the logarithm of the instance norm |x|, determining in which Xk does x belong to. Consequently, the logarithm of the norm of each data instance x can be construed as a metric for its difficulty or complexity. Let the n-tuple of training instances be denoted by s = (x1, . . . , xn). We assume the training set includes instance-label pairs (xi, T(xi)) for all 1 i n. The training mechanism commences with the simplest training examples whose instances are at the smallest scale X1, and progressively trains the model s layers by using the larger-scaled (more complex) examples. At each level, corresponding to each scale of the data, training is represented by sampling wk from a Gibbs measure with the loss (energy) as the empirical risk evaluated for that specific scale of the training data and with hyperparameter (temperature) λk. The temperature vector (λk)d k=1 consists of model hyperparameters that can be chosen based on cross-validation or by Corollary 34, which provides their values that optimize the derived bound on the statistical risk. It is well-known that Gibbs probability measures are maximum-entropy distributions; see (Jaynes, 1957), a property that we later use in the analysis. Precisely, for all 1 k d, given trained values for w1:(k 1), we sample the vector value for wk with the following probability: PWk|W1:(k 1) wk w1:(k 1) := exp 1 nλk P xi s Xk γkhk xi γk ; w1:k T(xi) P w k Wk exp 1 nλk P xi s Xk γkhk xi γk ; w1:(k 1)w k T(xi) . This training mechanism is hierarchical and stochastic. Furthermore, it exhibits selfsimilarity, as at all steps, we sample from Gibbs distributions consisting of loss functions with similar absolute error forms, albeit on data at different scales and with different temperatures. We refer to this training mechanism as multiscale entropy-based training, see Algorithm 1. Algorithm 1 Multiscale entropy-based training Hyperparameters: Temperature vector (λk)d k=1. Input: Training data (xi, T(xi))n i=1. Output: Trained parameters w1, . . . , wd. 1: for k = 1 to d do 2: Given training data at scale k (s Xk) and sampled values for w1:(k 1), obtain wk by sampling from the Gibbs measure: PWk|W1:(k 1) wk w1:(k 1) exp 1 nλk P xi s Xk γkhk xi γk ; w1:k T(xi) . The proposed learning model and the forthcoming analysis of its statistical risk possess the following advantages: 1. Hierarchical learning model with interpretable levels: The training objective is to ensure each level hk approximates the dilation of the target function T at scale γk, that is, T[γk]. Thus, each level in our hierarchical learning model holds a meaningful interpretation a departure from black-box hierarchical models such as commonly used neural networks. In other words, the mapping between the input of the whole compositional model (id + f( ; wd)) (id + f( ; w1)) T[γ0], with any of its intermediate levels is aimed to be close to dilations of the original target function at different scales. 2. Measuring the complexity of any data instance with the logarithm of its norm: For any data instance x, the logarithm of its norm log |x| can be interpreted as a measure of its complexity. This complexity determines the training and inference stage at which the label of x can be predicted in the learning model. Thus, our learning model can also be observed as a mathematical model for curriculum learning. For illustrative examples of norm-based complexity interpretations, consider the following scenarios: In finance, predicting fraud becomes increasingly challenging as a company s revenue resources grow. In image processing, higher degrees of sparsity lead to lower norms in feature vectors, making their targets easier to predict. 3. Computational savings in inference: Assume that x Xk is a new data instance that we want to predict its label using our model. To compute the model s output for x and predict its label, it is sufficient to process x/γk only up to the kth level and calculate γkhk(x/γk). In other words, there is no necessity to pass the instance through all d levels of the learning model, and processing only the first k levels is adequate, in contrast to commonly used neural networks. This efficiency stems from An Entropy-Based Model for Hierarchical Learning the fact that, on the one hand, hk approximates the dilation T[γk]. On the other hand, given that |x| < γk and with proper rescaling, one can compute the value of T(x) just by knowing T[γk]. More precisely, we have T(x) = γk T[γk](x/γk) whenever |x| < γk. In practical terms, when using the trained model to predict the target of new data, the computational workload for computing the output of the learned model with a particular data instance as input is directly tied to the complexity of that instance. Since, by assumption (such as scale-invariant distributions (1)), data instances are distributed heterogeneously across different scales and difficulties, this characteristic may lead to substantial computational savings and an increased inference speed. 4. Statistical guarantee stronger than uniform convergence: The statistical analysis of the risk of the trained compositional model is tailored to the hierarchical training mechanism and takes its multiscale structure into account when deriving the bound on its statistical risk. Consequently, the bound can be much sharper than a uniform convergence bound for the classical empirical-risk-minimizing training algorithm, as shown in the next section. 5. Robustness to interruptions during training: The training of current hierarchical learning models (such as neural networks) on massive datasets may take extensive amounts of time and are prone to interruptions. Our sequential training mechanism consists of d stages. If, for any reason, this mechanism terminates prematurely after stage k for any 1 k d 1, we can still ensure the availability of a useful model. More precisely, this model remains capable of accurately predicting the labels of data instances belonging to X1 Xk, which are all x Rm such that ε |x| < εγk. 6. Computational savings in training for diverse users: The proposed learning model can provide computational savings when there are d different users, each requiring accurate prediction of the labels of data instances at scale k (set Xk) for k = 1, . . . , d. Instead of training separate models for each of the d users, we streamline the process by using the trained model for user k 1 as prior information in training the model for user k, for all 2 k d. 5. Statistical Analysis of the Learning Model In this section, we statistically analyze the learning model s performance. In Subsection 5.1, we prove an upper bound on the training mechanism s chained risk. Then, in Subsection 5.2, we show that if the data instance distribution µ is scale-invariant, we can bound the statistical risk from above based on the chained risk. 5.1 Multiscale Entropy-Based Training and the Chained Risk For all 1 k d, we define the loss at scale k as ℓk(w1:k, x) := ( |hw(x) T(x)| = γkhk x γk , w1:k T(x) if x Xk 0 if x / Xk. (25) Furthermore, suppose that ℓk(w1:k, s) := 1 i=1 ℓk(w1:k, xi). Equation (24) in the previous section can be written more succinctly as PWk|W1:(k 1) wk|w1:(k 1) = exp 1 λk ℓk(w1:k, s) P w k Wk exp 1 λk ℓk w1:(k 1)w k, s . We denote the whole random parameters as W := (W1, . . . , Wd). The measure P W := PW1PW2|W1 . . . PWd|W1:(d 1) models the total training mechanism. For ease of presentation, define the extra parameter λd+1 := 0. The next theorem indicates that the self-similar measure P W is the minimizing distribution of the sum of a multiscale loss and a multiscale entropy. Consider the definition of the multiscale loss ℓ(λ)(w, s) := ℓk(w1:k, s) ℓk w1:(k 1), s , (26) where, for all 1 k d, ℓk w1:(k 1), s := λk log w k Wk exp 1 λk ℓk w1:(k 1)w k, s is a Log-Sum-Exp function. Theorem 29 We have P W = arg min PW E h ℓ(λ)(W, s) i k=1 (λk λk+1)H(W1:k) Proof We develop what we call the multiscale congruent technique used in the proof of (Asadi and Abbe, 2020, Theorem 13). Specifically, recalling the definition of congruent functionals (Definition 6), we aim to show that, as a functional of PW, E h ℓ(λ)(W, s) i k=1 (λk λk+1)H(W1:k) = k=1 λk D PWk|W1:(k 1) PWk|W1:(k 1) PW1:(k 1) . (29) This will then immediately imply (28), as setting PW = P W makes the entropies in the right side of (29) vanish altogether. For all 1 k d, define Lk(w1:k, s) := j=1 ℓj(w1:j, s), An Entropy-Based Model for Hierarchical Learning Q(k) W1:k(w1:k) := exp 1 λk Lk(w1:k, s) P w 1:k W1 Wk exp 1 λk Lk w 1:k, s . Recall that UW1:d denotes the equiprobable distribution. We can write j=1 ℓj(w1:j, s) k=1 (λk λk+1)D(PW1:k UW1:k) k=1 (λk λk+1)D(PW1:k UW1:k) + (E[Ld(W1:d, s)] + λd D(PW1:d UW1:d)) k=1 (λk λk+1)D(PW1:k UW1:k) + λd D PW1:d Q(d) W1:d k=1 (λk λk+1)D(PW1:k UW1:k) + λd D PW1:(d 1) Q(d) W1:(d 1) + λd D PWd|W1:(d 1) Q(d) Wd|W1:(d 1) PW1:(d 1) (31) k=1 (λk λk+1)D(PW1:k UW1:k) + (λd 1 λd)D PW1:(d 1) UW1:(d 1) + λd D PW1:(d 1) Q(d) W1:(d 1) + λd D PWd|W1:(d 1) Q(d) Wd|W1:(d 1) k=1 (λk λk+1)D(PW1:k UW1:k) + λd 1D Q(d) W1:(d 1) + λd D PWd|W1:(d 1) Q(d) Wd|W1:(d 1) PW1:(d 1) , (32) where (30) is based on the Gibbs variational principle (Lemma 7), (31) is based on the chain rule of entropy (Lemma 3), and (32) is based on Lemma 5 on entropy combination. To reiterate, we have derived j=1 ℓj(w1:j, s) k=1 (λk λk+1)D(PW1:k UW1:k) k=1 (λk λk+1)D(PW1:k UW1:k) + λd 1D Q(d) W1:(d 1) + λd D PWd|W1:(d 1) Q(d) Wd|W1:(d 1) PW1:(d 1) . (33) Marginalizing the distribution Q(d) W1:d over Wd yields Q(d) W1:(d 1) exp 1 λd Ld 1 w1:(d 1), s w d Wd exp 1 λd ℓd w1:(d 1)w d, s Thus, its escort distribution is Q(d) W1:(d 1) λd λd 1 exp 1 λd 1 Ld 1 w1:(d 1), s w d Wd exp 1 λd ℓd w1:(d 1)w d, s Let Z and Z denote the normalizing constants (partition functions) of Q(d) W1:(d 1) λd λd 1 and Q(d 1) W1:(d 1), respectively. We have D PW1:(d 1) Q(d 1) W1:(d 1) Q(d) W1:(d 1) w1:(d 1) log Q(d) W1:(d 1) λd λd 1 w1:(d 1) Q(d 1) W1:(d 1) w1:(d 1) PW1:(d 1) w1:(d 1) w1:(d 1) log ℓd w1:(d 1)w d, s PW1:(d 1) w1:(d 1) w1:(d 1) log λd ℓd w1:(d 1)w d, s PW1:(d 1) w1:(d 1) w1:(d 1) log Z PW1:(d 1) w1:(d 1) w1:(d 1) log λd ℓd w1:(d 1)w d, s PW1:(d 1) w1:(d 1) + log Z w1:(d 1) log λd ℓd w1:(d 1)w d, s PW1:(d 1) w1:(d 1) = λd λd 1 E λd ℓd w1:(d 1)w d, s An Entropy-Based Model for Hierarchical Learning D PW1:(d 1) Q(d 1) W1:(d 1) Q(d) W1:(d 1) λd ℓd W1:(d 1)w d, s By adding both sides of (33) and (34), we obtain j=1 ℓj(w1:j, s) k=1 (λk λk+1)D(PW1:k UW1:k) λd ℓd W1:(d 1)w d, s k=1 (λk λk+1)D(PW1:k UW1:k) + λd 1D PW1:(d 1) Q(d 1) W1:(d 1) + λd D PWd|W1:(d 1) Q(d) Wd|W1:(d 1) PW1:(d 1) . By iterating this argument for k = d 1, . . . , 1, we deduce that E h ℓ(λ)(W, s) i + k=1 (λk λk+1)D(PW1:k UW1:k) k=1 λk D PWk|W1:(k 1) Q(k) Wk|W1:(k 1) PW1:(k 1) . (35) For all 1 k d, it is apparent that D(PW1:k UW1:k) = log(|W1 Wk|) H(W1:k) = H(W1:k). Thus, based on (35), we obtain E h ℓ(λ)(W, s) i k=1 (λk λk+1)H(W1:k) = k=1 λk D PWk|W1:(k 1) Q(k) Wk|W1:(k 1) PW1:(k 1) . Since, for all 1 k d, Q(k) Wk|W1:(k 1) wk w1:(k 1) = PWk|W1:(k 1) wk w1:(k 1) , we deduce that E h ℓ(λ)(W, s) i k=1 (λk λk+1)H(W1:k) = k=1 λk D PWk|W1:(k 1) PWk|W1:(k 1) PW1:(k 1) , as desired. Straightforwardly, the result extends for random training sequences S := (X1, . . . , Xn) µ n. Corollary 30 Assume that for all 1 k d, P Wk|W1:(k 1)S wk w1:(k 1)s := exp 1 λk ℓk(w1:k, s) P w k Wk exp 1 λk ℓk w1:(k 1)w k, s . Let P W|S = P W1|SP W2|W1S . . . P Wd|W1:(d 1)S. Then, P W|S = arg min PW|S E h ℓ(λ)(W, S) i k=1 (λk λk+1)H(W1:k|S) where (S, W) µ n PW|S. Proof The expression in (36) is linear in PS = µ n. Conditioned on any realization S = s, we can use Theorem 29 to yield the result. Recall the definition of the function ψk, for all 1 k d, in the ladder decomposition (19). We assume that our hypothesis set is realizable, meaning that there exist parameters ˆw = ( ˆw1, . . . , ˆwd) in our hypothesis index set such that, for all 1 k d, ψk( ) = f( ; ˆwk). (37) In other words, we assume that function T belongs to the hypothesis set and is represented with parameters ˆw as hˆw = T. In case T is represented with different parameterizations; we choose one of these and denote it with ( ˆw1, . . . , ˆwd). Now, consider the definition of the chained risk: Definition 31 (Chained Risk) Let X µ. For any w W, we define the chained risk of w as L(C) µ (w) := E ℓk(w1:k, X) ℓk w1:(k 1) ˆwk, X # The training mechanism of the learning model chooses the values of the parameters w1, ..., wd sequentially. At the kth stage, choosing wk instead of the true target function parameter ˆwk results in the following difference between the risks at scale k: E ℓk(w1:k, X) ℓk w1:(k 1) ˆwk, X . The chained risk is equal to the accumulation of these deviations of risk at each of the d stages of the training mechanism. Obviously, we have L(C) µ (ˆw) = 0. In an intuitive and approximate sense, if the chained risk of w is small, then it suggests that hw could be close to the target function hˆw = T. It is informative to compare the chained risk with the statistical risk (6). Based on the definition of ℓk(w1:k, x) in (25) for all 1 k d, the loss of the model with parameters w An Entropy-Based Model for Hierarchical Learning on any data instance x X is equal to ℓ(w, x) = Pd k=1 ℓk(w1:k, x). Assume that X µ. The statistical risk of the model with parameters w W is Lµ(w) := E[ℓ(w, X)] k=1 ℓk(w1:k, X) k=1 (ℓk(w1:k, X) ℓk( ˆw1:k, X)) where (39) is based on the realizability assumption Lµ(ˆw) = 0. The difference between the right sides of (38) and (39) is between the terms ℓk(w1:(k 1) ˆwk, X) and ℓk( ˆw1:k, X). In fact, we have L(C) µ (w) = E k=1 ℓk(w1:k, X) k=1 ℓk w1:(k 1) ˆwk, X # k=1 ℓk w1:(k 1) ˆwk, X # Hence, the chained risk is always smaller than or equal to the statistical risk. However, in the next subsection, for scale-invariant data instance distributions µ with sufficiently large shape parameters, we find a reverse form of this inequality: an upper bound on the statistical risk based on the chained risk. Remark 32 The difference between the chained risk and the statistical risk can be large when the data instance distribution µ is not multiscale. For example, assume that the data instance distribution µ is supported only on Xd. Consider parameters w = (w1, . . . , wd 1, ˆwd). We claim that the chained risk of w is zero. Since the probability that X µ takes values on X1 Xd 1 is zero, for any 1 k d 1 we have E[ℓk(w1:k, X)] = E ℓk w1:(k 1) ˆwk, X = 0. L(C) µ (w) = E[ℓd(w1:d, X)] E ℓd w1:(d 1) ˆwd, X = 0. On the other hand, we have k=1 ℓk(w1:k, X) = E[ℓd(w1:d, X)]. However, if the parameters w1, . . . , wd 1 are chosen significantly far from the optimal values ˆw1, . . . , ˆwd 1, then hw can be far from the target function T and E[ℓd(w1:d, X)] tends to be large, thereby resulting in a high statistical risk. In the next theorem, we derive an upper bound on the expected value of the chained risk E h L(C) µ (W) i for the output of the training mechanism P W|S. Recall the definition of {ρk}d k=1 when regularizing the hypothesis set in (23). Theorem 33 Let (S, W) PSP W|S. Then, E h L(C) µ (W) i 2 j=1 log |Wj| + 4γ2 kρ2 k n(λk λk+1) ℓ(C)(w, s) := ℓk(w1:k, s) ℓk w1:(k 1) ˆwk, s . Recall from (26) and (27) that ℓ(λ)(w, s) := ℓk(w1:k, s) ℓk w1:(k 1), s , where, for all 1 k d, ℓk w1:(k 1), s := λk log w k Wk exp 1 λk ℓk w1:(k 1)w k, s For all 1 k d, ℓk w1:(k 1), s is a Log-Sum-Exp function, thus Lemma 8 yields ℓ(C)(w, s) = ℓk(w1:k, s) ℓk w1:(k 1) ˆwk, s ℓk(w1:k, s) min w k Wk ℓk w1:(k 1)w k, s ! ℓk(w1:k, s) ℓk w1:(k 1), s = ℓ(λ)(w, s). (40) Lemma 8 also implies that ℓ(λ)(ˆw, s) = ℓk( ˆw1:k, s) ℓk ˆw1:(k 1), s ℓk( ˆw1:k, s) min w k Wk ℓk ˆw1:(k 1)w k, s + λk log |Wk| k=1 λk log |Wk|. (41) An Entropy-Based Model for Hierarchical Learning Let W and S = X1, . . . , Xn be independent copies of W and S, respectively, and assume that W and S are independent from each other. Thus, W, S PWPS. We have, E h L(C) µ (W) i = E h ℓ(C)( W, S) i . Based on (20) and (25), for any fixed w = (w1, . . . , wd) and all 1 k d, we can write ℓk(w1:k, x) = ( γk hk 1 x γk + f hk 1 x γk , wk T(x) if x Xk 0 if x / Xk, ℓk w1:(k 1) ˆwk, x = ( γk hk 1 x γk + f hk 1 x γk , ˆwk T(x) if x Xk 0 if x / Xk. The triangle inequality implies that, for any a, b Rm, we have ||a| |b|| |a b|. Thus, for any x X, ℓk(w1:k, x) ℓk w1:(k 1) ˆwk, x γk 2γkρk, (42) where (42) is due to (23). Hence, based on Azuma Hoeffding s inequality (Lemma 11), for any fixed w and all 1 k d, ℓk(w1:k, S) ℓk w1:(k 1) ˆwk, S = 1 ℓk(w1:k, xi) ℓk w1:(k 1) ˆwk, xi is 4γkρk/ n-subgaussian. Since W and S are independent, ℓk W1:k, S ℓk W1:(k 1) ˆwk, S is 4γkρk/ n-subgaussian as well. Using Lemma 10 for all 1 k d and adding up the derived inequalities, we can write E h ℓ(C)( W, S) i E h ℓ(C)(W, S) i E ℓk W1:k, S ℓk W1:(k 1) ˆwk, S E ℓk(W1:k, S) ℓk W1:(k 1) ˆwk, S λk(log |W1 Wk| H(W1:k|S)) + 8γ2 kρ2 k n λk j=1 log |Wj| H(W1:k|S) + 8γ2 kρ2 k n λk where we define, for all 1 k d, λk := λk λk+1. E h L(C) µ (W) i = E h ℓ(C)( W, S) i E h ℓ(C)(W, S) i + j=1 log |Wj| H(W1:k|S) + 8γ2 kρ2 k n λk E h ℓ(λ)(W, S) i + j=1 log |Wj| H(W1:k|S) + 8γ2 kρ2 k n λk E h ℓ(λ)(ˆw, S) i + j=1 log |Wj| + 8γ2 kρ2 k n λk j=1 log |Wj| + λk log |Wk| + 8γ2 kρ2 k n λk j=1 log |Wj| + 8γ2 kρ2 k n λk where (45) is obtained by rewriting (44), (46) is based on (40), (47) is obtained based on Corollary 30 and by replacing P W|S with the conditional distribution PW|S = δˆw (the Dirac measure on ˆw), (48) is based on (41), and (49) is a simple calculation based on summation by parts. Optimizing the bound in (49) over the values of (λ1, . . . , λd) gives the following result: Corollary 34 Assume that (λ1, . . . , λd) are chosen such that for all 1 k d, λk = λk λk+1 = 2γkρk r n Pk j=1 log |Wj| (50) Then, the right side of (49) is minimized with respect to (λ1, . . . , λd). In this case, the bound simplifies to the following form: E h L(C) µ (W) i 8 n j=1 log |Wj|. (51) 5.2 Bounding the Statistical Risk Based on the Chained Risk Until now, the analysis did not require any restrictions on the data instance distribution µ. However, we could derive an upper bound only on the expected chained risk of the training An Entropy-Based Model for Hierarchical Learning mechanism. We also observed that the chained risk is always smaller than or equal to the statistical risk. In this subsection, we show that if µ is scale-invariant, a small chained risk can imply a small statistical risk. Assume that the instance distribution µ is scale-invariant: If q(x) denotes the density function of µ, then there exists a shape parameter α > 0 such that for all x X such that x/γ X, = γαq(x). (52) Thus, scale-invariant distributions have homogenous density functions. One particular example of such distributions is the following power-law probability density function with shape parameter α: q(x) = 1 Cq|x|α for all x X, where Cq = R X |x| αdx. Given this assumption, in the following result, we show an upper bound on the statistical risk based on the chained risk: Theorem 35 If µ is a scale-invariant probability distribution with shape parameter α defined on X Rm, then for any w W, we have 1 γm+1 α 1 + C1R(1 γ 1) Lµ(w) L(C) µ (w). Proof For ease of presentation, let ˆhk(x) denote the kth level of the model given parameters ˆw and input x, for any 1 k d. For any 2 k d and any x Xk, we have ℓk w1:(k 1) ˆwk, x = γk(id + ψk) hk 1 = γk(id + ψk) hk 1 γk(id + ψk) ˆhk 1 + C1(γ 1)γk d 1γk = γk 1 + C1(γ 1)γk d 1 hk 1 where in (53), we used the fact that ψk Lip C1(γ 1)γk d 1, according to (21). Now, define x := γk 1 We observe that x Xk 1, and that the transformation x x establishes a bijection between Xk and Xk 1. Therefore, we obtain ℓk w1:(k 1) ˆwk, x γk 1 + C1(γ 1)γk d 1 hk 1 = γk 1 + C1(γ 1)γk d 1 hk 1 1 + C1(γ 1)γk d 1 γk 1hk 1 = γ 1 + C1(γ 1)γk d 1 ℓk 1 w1:(k 1), x . (54) Since x Rm, the change of variables formula for differentials is dx = γmdx . Let X be distributed according to density q(x). Based on (54), for all 2 k d, we derive E ℓk w1:(k 1) ˆwk, X x Xk ℓk w1:(k 1) ˆwk, x q(x)dx x Xk γ 1 + C1(γ 1)γk d 1 ℓk 1 w1:(k 1), x q(x)dx x Xk γ 1 + C1(γ 1)γk d 1 ℓk 1 w1:(k 1), x γ αq x dx (55) x Xk 1 γm+1 α 1 + C1(γ 1)γk d 1 ℓk 1 w1:(k 1), x q x dx = γm+1 α 1 + C1(γ 1)γk d 1 E ℓk 1 w1:(k 1), X γm+1 α 1 + C1(γ 1)γ 1 E ℓk 1 w1:(k 1), X , (56) where (55) is due to the scale invariance property (52). We can extend inequality (56) to the case k = 1 by defining ℓ0 0. Thus, L(C) µ (w) = E[ℓk(w1:k, X)] E ℓk w1:(k 1) ˆwk, X E[ℓk(w1:k, X)] γm+1 α 1 + C1(γ 1)γ 1 E ℓk 1 w1:(k 1), X (57) k=1 ℓk(w1:k, X) γm+1 α 1 + C1(γ 1)γ 1 E k=1 ℓk 1 w1:(k 1), X # k=1 ℓk(w1:k, X) γm+1 α 1 + C1(γ 1)γ 1 E k=1 ℓk(w1:k, X) = 1 γm+1 α 1 + C1(γ 1)γ 1 Lµ(w), where (57) is based on (56). An Entropy-Based Model for Hierarchical Learning Theorem 35 in conjunction with Theorem 33 or Corollary 34 yields an upper bound on the expected statistical risk of multiscale entropy-based training, as follows: Corollary 36 If the shape parameter α of density q satisfies α > m + 1 + log 1 + C1(γ 1)γ 1 log γ , (58) then the following upper bound on the expected statistical risk holds: E[Lµ(W)] 2 Pd k=1 (λk λk+1) Pk j=1 log |Wj| + 4γ2 kρ2 k n(λk λk+1) 1 γm+1 α(1 + C1(γ 1)γ 1) . (59) If the hyperparameters (λ1, . . . , λd) are chosen as in (50), then (59) reduces to E[Lµ(W)] 8 n Pd k=1 γkρk q Pk j=1 log |Wj| 1 γm+1 α(1 + C1(γ 1)γ 1) Proof Inequality (58) is equivalent to 1 γm+1 α 1 + C1(γ 1)γ 1 > 0, which, in conjunction with Theorem 35 and Theorem 33 or Corollary 34, yields the results. Given the realizability assumption on the hypothesis set, the regular union bound applied to the empirical-risk-minimizing hypothesis yields E[Lµ(WERM)] j=1 log |Wj|. (61) Even when ignoring the effect of 1 γm+1 α 1 + C1(γ 1)γ 1 /8, the right side of (60) can be quite smaller than the right side of (61). Consider the following example: Example 2 Recall that γ0 := R/ε, γ = (γ0)1/d and γk = γk d, for all 1 k d, as in (18). Assume that ρk = ρ0γk for all 1 k d and |W1| = = |Wd|. We compute the following ratio Pd k=1 γkρk q Pk j=1 log |Wj| Pd k=1 ρk q Pd j=1 log |Wj| Pd k=1 γ2k d k Pd k=1 γk The power of two exists to compare the bounds on the required number of samples n. For example, given γ0 = 10 and d = 20, we obtain Λ 0.2648. 6. Bounded-Norm Parameterization Example In this section, for the case of one-dimensional functions (m = 1), we show that any (M1, M2)-diffeomorphism defined on ( R, R) can be represented with a hierarchical parameterized model with bounded-norm parameters of the form (20). The approach to proving this is by constructing the ladder decomposition of the diffeomorphism. We then proceed to discretize the parameters of this model and apply the bound of Corollary 36 to the corresponding hypothesis set. Assume that the target function T is a (M1, M2)-diffeomorphism with T(0) = 0, and consider its ladder decomposition at scale parameters {γk}d k=1. The range of T[γk 1], which is the domain of ψk, is an interval subset of ( M1R, M1R) that includes 0. Note that for a (M1, M2)-diffeomorphism, since the function is M1-bilipschitz, it is easy to observe that M1 1. Therefore ( M1R, M1R) includes ( R, R). Let Ψ : R R be an arbitrary invertible φ1-Lipschitz and φ2-smooth function with support DΨ = (a1, a2) ( M1R, M1R) such that 0 DΨ (that is, a1 < 0 < a2) and Ψ(0) = 0. We will later replace Ψ with each function ψk of the ladder decomposition of the target function T. For any x DΨ, we can write a1 Ψ (b)db + Ψ(a1) a1 Ψ (b)Θ(x b)db + Ψ(a1), (62) where Θ(x) is the Heaviside (unit) step function. Since Ψ is φ1-Lipschitz, for all b DΨ, we have |Ψ (b)| φ1. Moreover, since Ψ(0) = 0, we conclude that |Ψ(a1)| φ1|a1| φ1M1R. The idea is now to derive the Riemann sum approximation to the integral representation of Ψ(x) in (62) and to view it as a two-layer neural network. For a given integer τ 2, let a τ-width two-layer network with continuous parameters w := {w1, . . . , wτ, w(c)} be defined for any x ( M1R, M1R) as ψ(τ) w (x) := j=1 w(j)Θ(x bj) + w(c). (63) For all 1 j τ, we set Clearly, {bj}τ j=1 is an arithmetic progression with common difference := 2M1R/τ, where b1 M1R and bτ = M1R. We allow the parameters of (63) to take values ( w(j) Ψ (bj) if bj DΨ w(j) 0 if bj / DΨ and w(c) Ψ(a1). Using the Riemann sum approximation bound, given as Lemma 12 in Section 2, we deduce the following result: Lemma 37 For all x DΨ, we have ψ(τ) w (x) Ψ(x) 6M1R τ φ1 + 2(M1R)2 An Entropy-Based Model for Hierarchical Learning Proof Recall that since Ψ is φ1-Lipschitz, for all b DΨ, we have |Ψ (b)| φ1. If x a1 < , then ψ(τ) w (x) = Ψ(a1). Based on the triangle inequality, we have ψ(τ) w (x) Ψ(x) = a1 Ψ (b)db Z x Ψ (b) db < φ1 = 2M1R which implies the result. Therefore, assume that x a1 , and let j1 and j2 be the smallest and largest integer j such that a1 bj x, respectively. We have ψ(τ) w (x) Ψ(x) = j=1 w(j)Θ(x bj) Z a2 a1 Ψ (b)Θ(x b)db j=j1 Ψ (bj) Z x j=j1 Ψ (bj) Z bj2 bj1 Ψ (b)db Z bj1 a1 Ψ (b)db Z x bj2 Ψ (b)db j=j1 Ψ (bj) Z bj2 bj1 Ψ (b)db + 2 φ1 (64) j=j1 Ψ (bj) Z bj2 bj1 Ψ (b)db + 3 φ1, (65) where, (64) and (65) are based on the triangle inequality. If j2 = j1, then j=j1 Ψ (bj) Z bj2 bj1 Ψ (b)db which implies the result. Otherwise, we have j2 > j1. In this case, we obtain j=j1 Ψ (bj) Z bj2 bj1 Ψ (b)db φ2(bj2 bj1)2 2(j2 j1) (66) = φ2(j2 j1)2 2 2 (j2 j1) 2 where (66) is based on Lemma 12, and (67) is based on the fact that (j2 j1) (bτ b1) 2M1R. The proof of the statement is now complete. We proceed to discretize the parameters of the network. Let η > 0 be the precision level of the parameters. For any y R, we denote the closest real number in ηZ = {. . . , 2η, η, 0, η, 2η, . . . } to y with [y]η. For any function Ψ, the approximate τ-width two-layer network with discretized parameters at precision level η is ψ(τ,η) w (x) := ηΘ(x bj) + h w(c)i We define the norm of the network ψ(τ,η) w as the sum of the absolute values of its parameters: norm ψ(τ,η) w := The next result studies the approximation error, the output norm, and the norm of the network of ψ(τ,η) w . Proposition 38 The two-layer network ψ(τ,η) w , approximating function Ψ, satisfies the following properties: (a) For all x DΨ, ψ(τ,η) w (x) Ψ(x) (τ + 1)η τ φ1 + 2(M1R)2 (b) For all x ( M1R, M1R), ψ(τ,η) w (x) norm ψ(τ,η) w . (c) We have norm ψ(τ,η) w (τ + 1)η 2 + 3M1R + 2M1R φ1 + 2(M1R)2 τ φ2 =: ϱ(φ1, φ2). (70) (a) For all x DΨ, we have ψ(τ,η) w (x) Ψ(x) = ψ(τ,η) w (x) ψ(τ) w (x) + ψ(τ) w (x) Ψ(x) ψ(τ,η) w (x) ψ(τ) w (x) + ψ(τ) w (x) Ψ(x) w(j) h w(j)i + w(c) h w(c)i τ φ1 + 2(M1R)2 τ φ1 + 2(M1R)2 where (71) is based on Lemma 37 and the triangle inequality. An Entropy-Based Model for Hierarchical Learning (b) Based on the triangle inequality, ψ(τ,η) w (x) = ηΘ(x bj) + h w(c)i = norm ψ(τ,η) w . (c) We have norm ψ(τ,η) w = w(j) + w(c) + (τ + 1)η w(j) + M1Rφ1 + (τ + 1)η where (72) is based on w(c) = |Ψ(a1)| M1Rφ1. If a2 a1 < , then Pτ j=1 w(j) = 0 and the result is proven. Thus, assume that a2 a1 , and let j1 and j2 be the smallest and largest integer j such that a1 bj a2, respectively. If j1 = j2, then Pτ j=1 w(j) = | Ψ (bj1)| φ1, which implies the result. Therefore, assume that j2 > j1. Since Ψ(x) is φ2-smooth, Ψ (x) is φ2-Lipschitz. Thus, |Ψ (x)| is φ2-Lipschitz as well. Moreover, due to the fact that Ψ is invertible, |Ψ (x)| is either identical to Ψ (x) or Ψ (x), thus is differentiable. We derive Ψ (bj) + φ1 bj1 |Ψ (b)|db + φ2(bj2 bj1)2 2(j2 j1) + φ1 (73) bj1 |Ψ (b)|db + 2(M1R)2 τ φ2 + φ1 (74) (bj2 bj1)φ1 + 2(M1R)2 (2M1R + )φ1 + 2(M1R)2 where (73) is due to Lemma 12 and (74) is based on the same derivation of (68) from the right side of (66). The result is now proven. Now, consider the ladder decomposition of the target function T at scale parameters {γk}d k=1. For any 1 k d, we replace function ψk with the function Ψ in Proposition 38. Let C2 := (M2 1 + M1)M2. Based on Theorem 19, ψk is C1γk d 1(γ 1)-Lipschitz and C2-smooth. Thus, we take φ1 C1γk d 1(γ 1) and φ2 C2. For all 1 k d, define ρk := ϱ C1γk d 1(γ 1), C2 , where the function ϱ is defined in (70). Note that, as function of k, ρk = O(γk), conforming with (22). Suppose Wk, the set of parameters for our learning model at level k, is Wk := n w = w(1), . . . , w(τ), w(c) ηZτ+1 : |w|1 ρk o . In (20), the recursive definition of the model, we define f(hk 1(x); wk) := ψ(τ,η) wk (hk 1(x)), for all 1 k d. Therefore, the kth level of our learning model is the following function: hk = ψ(τ,η) wk + id ψ(τ,η) w2 + id ψ(τ,η) w1 + id . Notice that since Wk is a discretization of the ℓ1-ball {|w|1 ρk}, it is a finite set. The next result finds the cardinality of Wk: Proposition 39 For all 1 k d, we have r=0 2τ+1 r τ + 1 r ρk/η τ + 1 r Proof For all 1 k d, suppose W k := w Zτ+1 : |w|1 ρk Based on the one-to-one correspondence w ηw, we have |Wk| = |W k|. We will employ a counting argument to find |W k|. Assume that r components of w W k are equal to zero, and the rest are non-zero, where 0 r τ. There are τ+1 r ways to choose the r zero components. We now count the number of configurations that τ + 1 r positive integers have a sum less than or equal to ρk/η . For each such configuration, there are 2τ+1 r ways to make each component retain its positive value or negate it. By adding a slack variable, it is easy to observe that the number of configurations that τ + 1 r positive integers have sum less than or equal to ρk/η is equal to the number of configurations that (τ + 1 r) + 1 positive integers have sum equal to ρk/η + 1. Based on a basic result in combinatorics (Niven, 1965, (4.7) in Section 4.2), this number is equal to ρk/η τ + 1 r An Entropy-Based Model for Hierarchical Learning |Wk| = |W k| = r=0 2τ+1 r τ + 1 r ρk/η τ + 1 r Corollary 40 Based on Proposition 39, the optimized bound on the statistical risk in Corollary 36 for the example studied in this section is as follows: E[Lµ(W)] 8 n Pd k=1 γkρk q Pk j=1 log |Wj| 1 γm+1 α(1 + C1(γ 1)γ 1) Pd k=1 γk dρk r Pk j=1 log Pτ r=0 2τ+1 r τ+1 r ρj/η τ+1 r 1 γm+1 α(1 + C1(γ 1)γ 1) where ρk = ϱ C1γk d 1(γ 1), C2 , for all 1 k d. 7. Conclusions In this paper, we introduced an entropy-based hierarchical learning model designed to leverage the multiscale structure of data instance distributions and the smoothness inherent in real-world target functions. We started with the definition of ladder decompositions of invertible functions, followed by an examination of Lipschitz continuity and smoothness in the components of this decomposition for smooth functions. Subsequently, we analyzed the effectiveness of the proposed multiscale entropy-based training, demonstrating its capability to achieve low chained risk. Notably, where the data distribution µ is scale-invariant with a sufficiently large shape parameter, this paper showed an upper bound on the statistical risk based on the chained risk. Consequently, we derived a guarantee on the statistical risk of our training mechanism for these data distributions. Finally, for the specific case of onedimensional functions, the paper provided an illustrative example of a parameterized model featuring bounded-norm parameters to showcase a simple application of our methodology. Our proposed learning model offers several noteworthy advantages. Firstly, it adopts a hierarchical structure with interpretable levels. Each level in the model is trained to approximate a dilation of the target function, offering interpretable roles for different levels, which contrasts opaque black-box hierarchical models. Secondly, our model introduces a new perspective on data instance complexity. The logarithm of the norm of data instances serves as a measure of their individual complexity. The training procedure can also be conceptualized as a mathematical model for curriculum learning. Another merit of the proposed model is the computational point of view. The amount of computation required to compute the output of the learned model for a given data instance is directly proportional to the complexity of that instance. Given the heterogeneous distribution of data instances at different scales and complexities, this characteristic may translate into significant computational savings and faster inference speeds. Furthermore, our model s statistical analysis accounts for its hierarchical and multiscale structure, leading to sharper bounds on the statistical risk for scale-invariant distributions compared to uniform convergence bounds for empirical-risk-minimizing mechanisms. Additionally, the multi-staged nature of our training mechanism can be beneficial when dealing with the challenge of extensive training times on massive datasets. Even if the training terminates prematurely, the mechanism ensures a useful model capable of accurately predicting the labels of data instances with norms smaller than a specific threshold, depending on the terminated stage. Finally, our learning model proves advantageous in scenarios involving multiple users learning the target function at a particular scale of data instances. It provides computational savings by using each user s learned model as prior information to train the model for the following user. Our current work has limitations in that the derived risk bounds are applied only when the parameters of our model are discrete and the hypothesis set is finite. Section 6 provided an example of parameterization where diffeomorphisms are expressed using parameters with bounded norms. Consequently, the hypothesis set becomes finite when these parameters are discretized due to, for instance, real-world memory constraints. While the multiscale entropy-based training procedure (Algorithm 1) remains well-defined in the scenario where the hypothesis set is infinite, an extension of our technique is needed to analyze the statistical risk comprehensively. It is plausible that leveraging the Lipschitz property of the hypothesis set, as demonstrated in the derivation of the risk bound on the classical Gibbs distribution with infinite hypothesis set in (Xu and Raginsky, 2017), could prove beneficial. We defer the exploration of this direction to our future investigations. Acknowledgments I would like to express my sincere appreciation to the anonymous reviewers for their constructive comments and suggestions, which significantly enriched the quality of this manuscript. This research was supported by Leverhulme Trust grant ECF-2023-189 and Isaac Newton Trust grant 23.08(b). Amir R. Asadi and Emmanuel Abbe. Chaining meets chain rule: Multilevel entropic regularization and training of neural networks. Journal of Machine Learning Research, 21 (139):1 32, 2020. Amir R. Asadi and Po-Ling Loh. Entropic regularization of neural networks: Self-similar approximations. Journal of Statistical Planning and Inference, 233:106181, 2024. Amir R. Asadi, Emmanuel Abbe, and Sergio Verd u. Chaining mutual information and tightening generalization bounds. In Advances in Neural Information Processing Systems, pages 7234 7243, 2018. An Entropy-Based Model for Hierarchical Learning Jean-Yves Audibert and Olivier Bousquet. Combining PAC-Bayesian and generic chaining bounds. Journal of Machine Learning Research, 8(Apr):863 889, 2007. Peter L. Bartlett, Steven N. Evans, and Philip M. Long. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization. ar Xiv preprint ar Xiv:1804.05012, 2018. Peter Baxandall and Hans Liebeck. Vector Calculus. New York: Oxford University Press, 1986. Yoshua Bengio, J erˆome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual International Conference on Machine Learning, pages 41 48, 2009. St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening mutual information based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 2020. Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SIAM Review, 51(4):661 703, 2009. ISSN 00361445, 10957200. Eugenio Clerico, Amitis Shidani, George Deligiannidis, and Arnaud Doucet. Chained generalisation bounds. In Proceedings of Machine Learning Research, pages 4212 4212, 2022. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. R. M. Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290 330, 1967. Weinan E. Principles of Multiscale Modeling. Cambridge University Press, 2011. Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1 11, 2017. Alastair Fletcher and Vladimir Markovic. Decomposing diffeomorphisms of the sphere. Bulletin of the London Mathematical Society, 44(3):599 609, 2012. Ross Girshick, JeffDonahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 580 587, 2014. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. Deborah Hughes-Hallett, Andrew M. Gleason, and William G. Mc Callum. Calculus: Single and Multivariable. John Wiley & Sons, 2020. Edwin T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4): 620, 1957. David A. Mc Allester. PAC-Bayesian model averaging. In Proceedings of the twelfth annual conference on Computational Learning Theory, pages 164 170, 1999. Shahar Mendelson. Lower bounds for the empirical minimization algorithm. IEEE Transactions on Information Theory, 54(8):3797 3803, 2008. Michael Mitzenmacher. A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1(2), 2004. M. E. J. Newman. Power laws, Pareto distributions and Zipf s law. Contemporary Physics, 46(5):323 351, 2005. Ivan Niven. Mathematics of Choice: Or, How to Count Without Counting, volume 15. MAA, 1965. John P. Nolan. Univariate Stable Distributions. Springer, 2020. Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu. Information-theoretic analysis of stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop, pages 26 30. IEEE, 2016. Daniel Russo and James Zou. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302 323, 2019. Gennady Samoradnitsky and Murad S. Taqqu. Stable Non-Gaussian Random Processes: Stochastic Models with Infinite Variance. Routledge, 2017. Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Didier Sornette. Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder: Concepts and Tools. Springer Science & Business Media, 2006. Michel Talagrand. Upper and Lower Bounds for Stochastic Processes: Modern Methods and Classical Problems, volume 60. Springer Science & Business Media, 2014. Tim van Erven and Peter Harremo es. R enyi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014. An Entropy-Based Model for Hierarchical Learning Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 1999. Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pages 2524 2533, 2017.