# an_entropybased_model_for_hierarchical_learning__077b9f30.pdf

Journal of Machine Learning Research 25 (2024) 1-45 Submitted 1/23; Revised 6/24; Published 6/24

An Entropy-Based Model for Hierarchical Learning

Amir R. Asadi asadi@statslab.cam.ac.uk Statistical Laboratory, Centre for Mathematical Sciences, University of Cambridge, Cambridge CB3 0WA United Kingdom

Editor: Gabor Lugosi

Machine learning, the predominant approach in the ﬁeld of artiﬁcial intelligence, enables computers to learn from data and experience. In the supervised learning framework, accurate and eﬃcient learning of dependencies between data instances and their corresponding labels requires auxiliary information about the data distribution and the target function. This central concept aligns with the notion of regularization in statistical learning theory. Real-world datasets are often characterized by multiscale data instance distributions and well-behaved, smooth target functions. Scale-invariant probability distributions, such as power-law distributions, provide notable examples of multiscale data instance distributions in various contexts. This paper introduces a hierarchical learning model that leverages such a multiscale data structure with a multiscale entropy-based training procedure and explores its statistical and computational advantages. The hierarchical learning model is inspired by the logical progression in human learning from easy to complex tasks and features interpretable levels. In this model, the logarithm of any data instance s norm can be construed as the data instance s complexity, and the allocation of computational resources is tailored to this complexity, resulting in beneﬁts such as increased inference speed. Furthermore, our multiscale analysis of the statistical risk yields stronger guarantees compared to conventional uniform convergence bounds.

Keywords: machine learning, neural network, chaining, information theory, scale-invariant distribution, curriculum learning, logarithmic binning

1. Introduction

This paper introduces a hierarchical learning model with a multiscale entropy-based training mechanism. Designed to exploit the multiscale structure in many real-world datasets, the proposed model aims to provide statistical and computational eﬃciency while featuring interpretability.

1.1 Background

In contemporary times, machine learning is the predominant approach in artiﬁcial intelligence, enabling computers to acquire knowledge from data and experience. Within the supervised learning paradigm, training data is postulated to arise randomly as pairs (X, Y ),

c 2024 Amir R. Asadi.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0096.html.

denoting the data instance and its corresponding label, respectively. A computer is presented with a sequence of such instances and labels independently drawn from the underlying data probability distribution. Its task is to discern the relationship between the data instances and their labels, ultimately predicting the label of new, randomly drawn instances. In this paper, for simplicity, we adopt the function learning setting, part of the realizability assumption in statistical learning theory. That is, we assume that the dependency between data instances and their corresponding labels is modeled noiselessly by a deterministic target function Y = T(X), mapping any data instance to its label (Shalev-Shwartz and Ben-David, 2014; Mendelson, 2008). An insightful observation is that the sequence of training examples typically lacks comprehensive information about the target function or the data domain. Beyond the training data, providing the computer auxiliary information through the learning model is crucial. Higher auxiliary information enables the computer to learn the target function more accurately and more eﬃciently. To illustrate this point further, consider the extreme scenario where no auxiliary information is given to the computer, implying a total lack of knowledge about the target function or the data domain beyond the training data. In such a case, the optimal task for the computer becomes merely memorizing the training examples, leaving it unable to predict the label for any new instance not present in the training data. This approach is very prone to overﬁtting, most likely resulting in poor performance on new and unseen data instances. This notion of auxiliary information aligns with the idea of regularization studied in statistical learning theory; see, for example, (Vapnik, 1999). Regularization manifests in various forms, such as restricting the hypothesis class and explicitly and implicitly regularizing the training mechanism. Moreover, it is intricately linked to the no-free-lunch theorem, which underscores the necessity for every learning algorithm to possess some level of prior knowledge about the underlying assumptions of the learning problem to attain success; see, for instance, (Shalev-Shwartz and Ben-David, 2014, Theorem 5.1). Analogously, in most tasks, human knowledge is acquired through a combination of training data, prior knowledge, and intuition. A prevalent characteristic in many real-world datasets is the multiscale nature of their data instances. In other words, in these datasets, data manifest across diﬀerent scales of magnitude, exhibiting a diverse range of sizes and complexities. This inherent property is used in applications in various ﬁelds, such as wavelet theory, Fourier analysis, and signal processing, as detailed in (E, 2011) and references within. In particular, empirical data distributions across various domains such as physics, biology, medicine, ﬁnance, natural language processing, and the social sciences frequently exhibit power-law distributions. Notable instances of power-law probability distributions include the distributions of people s incomes, city populations, stars brightness, ﬁle sizes in computing, earthquake magnitudes, and word frequencies in human languages, among other examples. These phenomena have been extensively studied, as demonstrated in works such as (Newman, 2005; Clauset et al., 2009) and references therein. The generation of power-law distributions in the real world in natural and artiﬁcial systems involves diverse mechanisms, and each is believed to apply to speciﬁc applications; see (Sornette, 2006, Chapter 14) and (Newman, 2005; Mitzenmacher, 2004). The most important of such mechanisms are considered to be growth with preferential attachment (Yule s process) and critical phenomena (Newman, 2005). Furthermore,

An Entropy-Based Model for Hierarchical Learning

the generalized central limit theorem posits that the normalized sum of independent and identically distributed random variables with inﬁnite variance can only converge to a stable distribution; see, for example, (Nolan, 2020). All stable distributions with inﬁnite variance exhibit power-law tails, whereas the Gaussian distribution is the sole stable distribution with ﬁnite variance (Samoradnitsky and Taqqu, 2017). Power-law distributions are examples of scale-invariant probability distributions. Additionally, target functions in the real world, relating continuous data instances to their real-valued labels, tend to be smooth and well-behaved. Leveraging such characteristics of real-world data instances and target functions as auxiliary information in a machinelearning model presents an opportunity to yield statistical and computational beneﬁts in real-world applications.

1.2 Overview of the Contributions

In this paper, we introduce a hierarchical learning model to take advantage of the aforementioned multiscale nature of data instances and the smoothness of their target functions. The model comprises a compositional learning architecture with a sequential multiscale training mechanism. First, the training data is partitioned at diﬀerent scales, and the training mechanism starts by learning from the batch of the smallest data and progressing step by step toward the batches of larger data. The learning process at each batch of data takes advantage of the learned model over smaller data as prior information. This is inspired by the logical learning mechanism observed in humans, who progressively learn diﬀerent tasks, commencing with small (easy) examples and advancing toward large (complex) examples. Due to the ubiquity of such multiscale data, our proposed learning model holds promise for many applications. More precisely, we consider the following two assumptions:

(a) Data instances Xi Rm emerge in diﬀerent scales of magnitude from a distribution µ deﬁned on the data domain X. Here, we assume that for 0 < ε < R, the data domain is X = {x Rm : ε |x| < R}, where |x| denotes the Euclidean norm of datum x. Typically, ε is much smaller than R. Later in the paper in Subsection 5.2, we assume that µ is a scale-invariant probability distribution with shape parameter α, whose probability density q(x) satisﬁes the following condition: For all x X and any γ 1 such that x/γ X, we have

= γαq(x). (1)

(b) The target function T : Bm R Rm is well-behaved, where Bm R = {x Rm : |x| < R} denotes the m-dimensional Euclidean ball with radius R > 0, centered at the origin. In this paper, we assume that T is diﬀerentiable and smooth, and we further assume both the invertibility of T and the Lipschitz continuity of its inverse, T 1. For the special case of m = 1, we show (in Theorem 19) that an additional smoothness assumption of T 1 leads to a strengthened result. We further assume that from prior knowledge or intuition, the learning model knows the behavior of function T on Bm ε , data instances at very small scales.

We take advantage of such assumptions on the data in our hierarchical learning model based on the following contributions:

Ladder decompositions of functions The ﬁrst main result of the paper studies ladder decompositions of invertible functions T : Bm R Rm. The decomposition is deﬁned in terms of dilations T[γ] : Bm R Rm of T, where for any scale 0 < γ 1 and all x Bm R ,

T[γ](x) := T(γx)

The dilation T[γ] can be conceptualized as a zoomed version of the original function T into the origin, where the degree of zooming is determined by the scale γ. Given a sequence of dilation scales 0 < γ0 < < γd = 1, the ladder decomposition is deﬁned based on the sequence of dilations {T[γk] : k = 0, . . . , d} interpolating between T[γ0] and T[1] = T. The decomposition is constructed as a sequence of successive compositions of functions Tk := T[γk] T 1 [γk 1] for all k = 1, . . . , d, and writing

T[γk] = Tk . . . T1 T[γ0].

Building on an earlier result from (Bartlett et al., 2018), we show that the smaller the value of γk γk 1 is, then the closer Tk is to the identity mapping, thus being more well-behaved and easier to learn. To elaborate, for all 1 k d, let ψk(x) := Tk(x) x. The ﬁrst main contribution of the paper, Corollaries 20 and 26 in Section 3, states that if M2-smooth invertible function T : Bm R Rm is such that T 1 is M1-Lipschitz, then

ψk Lip C(γk γk 1),

where ψk Lip denotes the Lipschitz norm of ψk and constant C only depends on M1, M2 and R. The deﬁnitions of the ladder decomposition and functions ψk yield that, for all 1 k d, we have

T[γk] = Tk T[γk 1] = T[γk 1] + ψk T[γk 1].

This dependency between the levels of the ladder decomposition ﬁts with a hierarchical learning model with residual levels of the following form:

hk(x) := hk 1(x) + f(hk 1(x); wk). (2)

Here, h0 := T[γ0], and f( ; w) is a learnable model with parameters w (a vector), for example, a neural network. A speciﬁc example of f, a two-layer network with step activation functions, is explored in Section 6. Let w = (w1, . . . , wd) denote the parameters of the whole hierarchical model. For any 1 k d, deﬁne w1:k := (w1, . . . , wk). Level hk is a function of the input datum x and the weights w1:k. Therefore, we sometimes denote hk with hk(x; w1:k). Parameters w1, . . . , wd will be chosen sequentially by the training mechanism such that f( ; wk) approximates ψk well for all k = 1, . . . , d. Thus, for all k = 1, . . . , d, the kth level of our learning model (hk) aims to approximate the dilation T[k] of the target function.

An Entropy-Based Model for Hierarchical Learning

The hierarchical learning model By deﬁning γ := (R/ε)1/d, we split the data domain X into d scales:

Xk := n x Rm : εγk 1 |x| < εγko ,

for all 1 k d. This multiscale partitioning of data domains is sometimes referred to as logarithmic binning in the literature; see, for instance, (Newman, 2005). We then consider the ladder decomposition of the target function T with respect to γk := γk d, for k = 0, . . . , d. The logarithm of the norm of a datum x X is viewed as a measure of its complexity; that is, the sets X0, X1, . . . , Xd represent inputs of increasing complexity. We assume that the target function is known at level γ0, that is, T[γ0] is given. In other words, the behavior of the function T at minimal values of x is known, and we shall sequentially learn T[γ1], . . . , T[γd]. For all 1 k d, since the training mechanism aims to make hk(x) closely approximate dilation T[γk](x) = T(γkx)/γk, we deﬁne the output of the learned model for any x Xk as

hw(x) := γkhk

Given training examples s = (xi, T(xi))n i=1, the training of the model is performed by sampling the parameters w1, w2, . . . , wd in a sequential manner from a sequence of Gibbs measures. Namely, we start by considering training data examples whose instances belong to the smallest scale X1, corresponding to the easiest instances. An approximation of ψ1 is then learned by sampling w1 from a Gibbs measure, a maximum entropy distribution, deﬁned with the empirical risk over these small-scaled examples with the absolute error loss function and with hyperparameter λ1. More precisely, w1 is sampled from a discrete set W1 according to the Gibbs measure:

P1(W1 = w1) := exp 1

nλ1 P xi s X1

γ1h1 xi γ1 ; w1 T(xi)

P w 1 W1 exp 1

nλ1 P xi s X1

γ1h1 xi γ1 ; w k T(xi) .

Subsequently, we observe data at the next scale X2, characterized by a higher magnitude. Similarly, given the learned approximation f( , w1) of ψ1, an approximation f( , w2) of ψ2 is learned by sampling w2 from a Gibbs measure with hyperparameter λ2, and this process is iteratively repeated. That is, for k = 2, . . . , d, we sample wk Wk conditionally on w1:(k 1) with probability

PWk|W1:(k 1) wk w1:(k 1) := exp 1 nλk P xi s Xk

γkhk xi γk ; w1:k T(xi)

P w k Wk exp 1 nλk P xi s Xk

γkhk xi γk ; w1:(k 1)w k T(xi) .

Consequently, learning the target function for data at smaller scales is inherently easier, serving as a foundational step for tackling more challenging learning tasks at higher scales. Thus, this hierarchical model is also a model for implementing the concept of easy-to-hard learning, commonly known as curriculum learning (Bengio et al., 2009). Here, scale serves as a temporal progression akin to how humans learn a course by starting with simpler

examples and gradually advancing to more complex ones. This hierarchical approach to learning the ladder decomposition is metaphorically akin to the step-by-step ascent of a ladder. The total training is then modeled with the following measure:

P W(w) := PW1(w1)PW2|W1(w2|w1) . . . PWd|W1:(d 1) wd|w1:(d 1) .

The next main result of the paper, Theorem 29, states that P W minimizes a multiscale loss regularized by a multiscale form of entropy. That is,

P W = arg min PW

E h ℓ(λ)(W, s) i

k=1 (λk λk+1)H(W1:k)

where ℓ(λ) is related to the total loss over the multiple scales and H(W1:k) is Shannon entropy. The proof s idea, which expands upon the proof technique of (Asadi and Abbe, 2020, Theorem 13), is to show that the functional optimized by P W can be decomposed into a sum of conditional entropies of Wk given W1:(k 1), for all 1 k d.

The statistical risk The paper s ﬁnal main result bounds the model s statistical risk when the parameters are chosen according to the above multiscale training mechanism. The result proceeds through a chaining argument, where the loss is decomposed over the successive stages of the training procedure, during which the parameters w = (w1, . . . , wd) are chosen. Under the assumption that the model is realizable, that is, there exists a choice of parameters ˆw1, . . . , ˆwd such that ψk( ) = f( ; ˆwk), for all 1 k d, we deﬁne the chained risk as

L(C) µ (w) := E

ℓk(w1:k, X) ℓk w1:(k 1) ˆwk, X #

where X µ and, for all 1 k d,

ℓk(w1:k, x) :=

( |hw(x) T(x)| = γkhk x γk , w1:k T(x) if x Xk

0 if x / Xk

is the loss of the model on data at scale k. Theorem 33 then states that when (S, W) µ n P W|S, that is, when training data S are n independent samples from distribution µ and the parameters W are the outputs of the multiscale entropy-based training mechanism given random training data S, then the expected chained risk satisﬁes

E h L(C) µ (W) i 2

j=1 log |Wj|

+ 4γ2 kρ2 k n(λk λk+1)

where λd+1 := 0 and ρk = O(γk) is the maximum output norm of ψk( ), for all 1 k d. Optimizing the hyperparameters λ1, . . . , λd of our learning mechanism in the bound above

An Entropy-Based Model for Hierarchical Learning

yields that, for all 1 k d, we have

λk λk+1 = 2γkρk q

n Pk j=1 log |Wj| .

This, in turn, establishes the following bound on the expected chained risk:

E h L(C) µ (W) i 8 n

j=1 log |Wj|. (5)

Although the true parameter ˆw = ( ˆw1, . . . , ˆwd) achieves zero chained risk, in general it is not clear if a low expected chain risk implies low statistical risk. In fact, the chained risk is always a lower bound for the statistical risk. However, for any scale-invariant probability distribution satisfying equation (1), with suﬃciently large shape parameter α, we demonstrate in Subsection 5.2 an upper bound on the statistical risk, based on the chained risk. More precisely, we show that for such scale-invariant µ, there exists a constant ˆC > 0 independent of w such that ˆCLµ(w) L(C) µ (w),

yielding an upper bound on the expected statistical risk from (5).

Bounded-norm parameterization example In Section 6, we study a particular example of the function f in (2), where invertible functions that are Lipschitz and smooth are approximated by two-layer neural networks with step activation functions and boundednorm parameters. We study the approximation error, a bound on the network s output, and a bound on the network parameters norm. The parameters of the example are discretized, hence the hypothesis set of the whole hierarchical model is ﬁnite. We then apply the derived bound on the statistical risk of the hierarchical model from the earlier sections to this example.

1.3 Related Work

This work draws inspiration primarily from integrating concepts presented in (Bartlett et al., 2018) and (Asadi and Abbe, 2020). The paper (Bartlett et al., 2018) establishes that any smooth bi-Lipschitz function T can be represented as a composition of d functions Td T1, where each function Tk, for 1 k d, approximates the identity function, manifested by the Lipschitz norm of Tk(x) x decreasing inversely with d. Notably, the proof of (Bartlett et al., 2018, Theorem 2) employs the concepts of function dilation and ladder decomposition. On the other hand, the paper (Asadi and Abbe, 2020) addresses the solution to the multiscale entropy regularization problem, an extension of the Gibbs probability distribution to a multiscale context. However, eﬃcient sampling from the optimal probability distribution, the output of the Marginalize-Tilt algorithm of (Asadi and Abbe, 2020), remains challenging due to multiple steps of marginalizations of probability distributions. To alleviate this problem, (Asadi and Loh, 2024) considers ﬁnding self-similar approximations to the optimal probability distribution. Developing on the proof technique of (Asadi and Abbe, 2020), this paper introduces the multiscale loss function ℓ(λ) that, when regularized with multiscale entropy in (4), is minimized by the self-similar and computable distribution P W. Distinct

from these works, our approach emphasizes interpretability by hierarchically learning ladder decompositions of target functions and involves reading the output of the multilevel learning model from diﬀerent levels and depths, contingent on the scale of the input data instance, as in (3). In essence, we align the multiscale architecture of the learning model with the multiscale data domain.

Information-theoretic methods for analyzing the statistical risk and the generalization error of learning algorithms have been pioneered within the framework of PAC-Bayesian bounds (Mc Allester, 1999). This line of inquiry later evolved, adopting a related form that utilizes mutual information, exempliﬁed by the works (Russo and Zou, 2019), (Xu and Raginsky, 2017) and (Bu et al., 2020). In a related vein, (Raginsky et al., 2016) introduces alternative information-theoretic measures to assess the stability of learning algorithms and bounds on their generalization capabilities. The extension of these information-theoretic methods to multiscale techniques, inspired by the method of chaining in probability theory (see, for example, Talagrand, 2014), is presented in (Audibert and Bousquet, 2007),(Asadi et al., 2018), (Asadi and Abbe, 2020), and (Clerico et al., 2022). Multiscale entropies, a combination of entropies at diﬀerent scales, play an implicit role in chaining. Notably, Dudley s inequality (Dudley, 1967) can be variationally transformed, expressing the bound as a linear mixture of metric entropies across multiple scales. The work (Xu and Raginsky, 2017) also studies into the expected statistical risk of sampling from the Gibbs distribution, sometimes referred to as maximum-entropy training. A subsequent multiscale extension of this analysis has likewise been derived in (Asadi and Abbe, 2020).

The work (Fletcher and Markovic, 2012) establishes that any diﬀeomorphism deﬁned on the sphere can be decomposed as the composition of bi-Lipschitz functions with small distortion.

Hierarchical learning models comprising layers that are nearly identity functions ﬁnd applications in deep learning, evident in the context of residual networks (He et al., 2016) and through dynamical systems approaches; see, for example, (E, 2017). A plausible rationale behind the superior performance of deep neural networks compared to shallow networks is their capability to learn diﬀerent aspects of the data distribution across various layers. This hypothesis ﬁnds support in empirical evidence where pre-trained bottom layers combined with task-speciﬁc layers achieve excellent performance in image classiﬁcation tasks; see, for instance, (Devlin et al., 2018) and (Girshick et al., 2014).

1.4 Organization of This Paper

The rest of the paper is organized as follows: We provide the preliminaries and notation in Section 2. Following that, in Section 3, we introduce the concept of ladder decompositions of functions, examining the Lipschitz continuity and smoothness of its components. Section 4 explains our proposed learning model. Section 5 comprises two parts: Subsection 5.1 demonstrates the eﬃcacy of multiscale entropy-based training in achieving low chained risk, a new analytical tool. Subsection 5.2 establishes that if the data distribution µ is scale-invariant, the chained risk can serve as an upper bound on the statistical risk, thereby providing an overall upper bound on the statistical risk of the learned model. Section 6 gives an example where diﬀeomorphisms can be represented using a parameterized model

An Entropy-Based Model for Hierarchical Learning

with bounded-norm parameters. We then compute the bound on the statistical risk from Section 5 for this example. Finally, Section 7 encapsulates our work s conclusions.

2. Preliminaries and Notation

The sets of real numbers and integers are symbolized by R and Z, respectively. Throughout the paper, |x| denotes the Euclidean norm of a vector x Rm, |x|1 represents the ℓ1-norm of a vector x Rm, |A|2 denotes the spectral norm of a matrix A Rm m, and f Lip is the Lipschitz norm of a function f. The identity function is denoted as id. For a pair of positive integers k n, let n k represent the n choose k binomial coeﬃcient. For any x R, the ﬂoor function x equals the largest integer smaller than or equal to x. Given two matrices A, B Rm m, A B indicates that B A is a positive semideﬁnite matrix. Random variables and vectors are denoted by capital letters, while lowercase letters represent their speciﬁc realizations. The equiprobable (uniform) probability distribution is represented by U, and its support is indicated with a subscript. If X is a random variable and P is a probability measure, the notation X P signiﬁes that X is distributed according to P. The Dirac probability measure on w is denoted as δw. The n-fold tensor product of a measure µ with itself is represented by µ n. For two distributions P and Q, P Q means that P is absolutely continuous with respect to Q. In supervised batch learning, we deﬁne X as the instance domain and Y as the label domain. A target function T : X Y exists, which maps any data instance to its label. We also deﬁne the hypothesis set H = {hw : w W} consisting of hypotheses indexed by the set W. A loss function ℓ: W X R+ is introduced. A learning algorithm is presented with a random training sequence S = (X1, ..., Xn) comprising n data instances along with their corresponding labels (T(X1), . . . , T(Xn)), where S is drawn i.i.d. from X with an unknown distribution µ. In other words, S µ n. During the training procedure, the algorithm selects h W H, modeled by a random transformation PW|S.

Deﬁnition 1 (Statistical Risk) For any w W, the statistical (or population) risk of w is deﬁned as

Lµ(w) := E[ℓ(w, X)], (6)

A primary goal of statistical learning is to identify computationally eﬃcient learning algorithms for which, given a random training set of size n, the expected statistical risk E[Lµ(W)] is small. Here, W is distributed with the marginal distribution of PWS = µ n PW|S. Next, we present some preliminary tools later used in the paper.

Deﬁnition 2 (Entropy) The Shannon entropy of a discrete random variable X, taking values on set A, is deﬁned as

x A PX(x) log PX(x).

The relative entropy between two distributions PX and QX, if PX QX is deﬁned as

D(PX QX) := X

x A PX(x) log PX(x)

otherwise, we deﬁne D(PX QX) := . The conditional relative entropy is

D PY |X QY |X PX := X

x A D(PY |X=x QY |X=x)PX(x)

= E D PY |X( |X) QY |X( |X) , X PX.

The following useful property of entropy is called the chain rule . For proof, see, for example, (Cover and Thomas, 2012, Theorem 2.5.3):

Lemma 3 (Entropy Chain Rule) Let PXY and QXY be two distributions. We have

D(PXY QXY ) = D(PX QX) + D PY |X QY |X PX .

The next deﬁnition relates to geometric transformations of probability measures:

Deﬁnition 4 (Escort and Tilted Distributions) Given a discrete probability measure P deﬁned on a set A, and any λ [0, 1], we deﬁne the escort distribution (P)λ for all a A as

(P)λ(a) := P λ(a) P x A P λ(x).

Given two discrete probability measures P and Q deﬁned on a set A, and any λ [0, 1], we deﬁne the tilted distribution (P, Q)λ as the following geometric mixture:

(P, Q)λ(a) := P λ(a)Q1 λ(a) P x A P λ(x)Q1 λ(x).

Evidently, if U is the equiprobable distribution on A, then

(P)λ = (P, U)λ.

In our analysis in Section 5, similar to (Asadi and Abbe, 2020), we encounter linear combinations of relative entropies. The next lemma shows the role of tilted distributions in dealing with such linear combinations; see, for example, (van Erven and Harremo es, 2014, Theorem 30):

Lemma 5 (Entropy Combination) Let λ [0, 1]. For any distributions P, Q and R deﬁned on a discrete set A such that P Q and P R, we have

λD(P Q) + (1 λ)D(P R) = D P (Q, R)λ log

x A Qλ(x)R1 λ(x)

An Entropy-Based Model for Hierarchical Learning

Proof Let Z := P x A Qλ(x)R1 λ(x). We have

λD(P Q) + (1 λ)D(P R) = λ X

x A P(x) log P(x)

x A P(x) log P(x)

x A P(x) log P(x) Qλ(x)R1 λ(x)

x A P(x) log

Qλ(x)R1 λ(x)

= D P (Q, R)λ log

x A Qλ(x)R1 λ(x)

We provide the following deﬁnition to later simplify the notation in the proof of Theorem 29:

Deﬁnition 6 (Congruent Functionals) We call two functionals L1(P) and L2(P) of a distribution P congruent and write L1 = L2 if L1 L2 does not depend on P.

For example, Lemma 5 implies that if Q and R are ﬁxed distributions, then as functionals of P, the following congruency holds:

λD(P Q) + (1 λ)D(P R) = D P (Q, R)λ .

Speciﬁcally, if U is the equiprobable distribution, then

λD(P Q) + (1 λ)D(P U) = D P (Q)λ . (7)

The following well-known result, sometimes referred to as the Gibbs variational principle, implies that the distribution that minimizes the sum of average energy (loss) and entropy (regularization) is the Gibbs measure:

Lemma 7 (Gibbs Variational Principle) Let W be an arbitrary ﬁnite set and UW be the equiprobable distribution on W. Given a function g : W R and λ > 0, we deﬁne the following Gibbs probability distribution for all w W:

QW (w) exp g(w)

P w W exp g(w )

Then, for any probability measure PW deﬁned on W, we have

E[g(W)] + λD(PW UW ) = λD(PW QW ) λ log

w W exp g(w )

where W PW .

Particularly, Lemma 7 yields the following congruency identity as functionals of PW :

E[g(W)] + λD(PW UW ) = λD(PW QW ). (8)

We later make use of the congruency relations (7) and (8) iteratively in the proof of Theorem 29. In Section 5, we require the following well-known result on the Log-Sum-Exp function:

Lemma 8 (Log-Sum-Exp) For any positive integer N, let z = (z1, z2, . . . , z N) RN be arbitrarily chosen. For any λ > 0, the Log-Sum-Exp function

Gλ(z) := λ log

satisﬁes min j=1,...,N zj λ log N Gλ(z) min j=1,...,N zj.

The proof of Theorem 33 requires some tools on the topic of concentration of measures, as stated next.

Deﬁnition 9 (Subgaussian) A random variable X is called σ-subgaussian if for all λ R, its cumulant generating function satisﬁes

log E h eλ(X EX)i λ2σ2

The next result is based on (Xu and Raginsky, 2017, Lemma 1), which itself can be derived from (Boucheron et al., 2013, Lemma 4.18). This result and its variants are key ingredients in information-theoretic generalization bounds.

Lemma 10 If g A, B is σ-subgaussian where A, B PAPB, then for all λ > 0,

E g A, B E[g(A, B)] λ(log |A| H(A|B)) + σ2

The Azuma Hoeﬀding inequality shows the subgaussianity of the sum of independent and bounded random variables:

Lemma 11 (Azuma Hoeﬀding) Let X1, . . . , Xn be independent random variables such that a Xi b for all 1 i n. Then,

E h e λ n Pn i=1(Xi EXi)i λ2

In other words, Pn i=1 Xi/n is (b a)/ n-subgaussian.

Let g : [a, b] R be a diﬀerentiable function and assume that n is a positive integer. Let ˆ := (b a)/n. We deﬁne the Riemann sum at level n as

i=1 g(a + (i 1) ˆ )

The subsequent well-known lemma, which we use in Section 6, bounds the approximation error of the Riemann sum. For proof, see, for example, (Hughes-Hallett et al., 2020).

An Entropy-Based Model for Hierarchical Learning

Lemma 12 (Riemann Sum) The approximation error of the Riemann sum is bounded as follows:

a g(x)dx Σn

ˆ M 2n(b a)2,

where ˆ M is the maximum absolute value of the derivative of g on [a, b].

3. Ladder Decompositions of Functions

In this section, we ﬁrst precisely deﬁne dilations and ladder decompositions of invertible functions. Then, we study Lipschitzness and smoothness of the components of the ladder decompositions of smooth functions T : Bm R Rm deﬁned on bounded Euclidean balls with radius R > 0. For simplicity, we ﬁrst consider the special case m = 1 in Subsection 3.1 and investigate multi-dimensional functions in Subsection 3.2. We ﬁrst provide the precise deﬁnition of dilations of a function.

Deﬁnition 13 For any 0 < γ 1, let the dilation of function T : Bm R Rm at scale γ be deﬁned as

T[γ](x) := T(γx)

γ , for all x Bm R .

If T is diﬀerentiable and T(0) = 0, as a continuous extension for γ = 0, we deﬁne T[0] as the derivative of T at the origin. Namely, T[0](x) := JT (0)x, where JT (0) is the Jacobian matrix of T at the origin.

The next lemma relates the inverse of the dilation with the dilation of the inverse function:

Lemma 14 Let T be an invertible function. For any 0 < γ 1, we have T[γ] 1 = T 1 [γ] .

Moreover, if T is diﬀerentiable and T(0) = 0, then T[0] 1 = T 1 [0] .

Proof If γ > 0, then T[γ] 1(x) = T 1(γx)/γ = T 1 [γ] (x). The case γ = 0 follows from the well-known inverse function theorem; see, for example, (Baxandall and Liebeck, 1986, Chapter 4).

The next deﬁnition characterizes the concept of ladder decompositions:

Deﬁnition 15 For all 1 k d, let Tk := T[γk] T 1 [γk 1]. We call the following multiscale

decomposition the ladder decomposition of T at scale parameters {γk}d k=0:

T = Td T1 T[γ0]. (9)

For all 1 k d, we further deﬁne ψk := Tk id.

Clearly, for all 1 k d, we have T[γk] = Tk T1 T[γ0]. Owing to the smoothness of the function T, for all 1 k d, when consecutive scale parameters γk and γk 1 are close, we intuitively expect that the transformation Tk, acting between the dilations of T at scales γk 1 and γk, be close to the identity function. In the rest of this section, we make this intuition precise, starting from the simpler case of one-dimensional functions (m = 1) and then studying the more general case of multi-dimensional functions (m 2).

3.1 One-Dimensional Functions

We begin by providing the deﬁnition of smooth functions for one-dimensional functions.

Deﬁnition 16 Let V be a bounded subset of R. The one-dimensional function T : V R is M-smooth if it is diﬀerentiable and its derivative T is M-Lipschitz.

If T : V R is twice diﬀerentiable, then it is M-smooth if |T (x)| M for all x V, where T denotes the second derivative of T. The next deﬁnition concerns the concept of diﬀeomorphisms:

Deﬁnition 17 A function T : ( R, R) R is an (M1, M2)-diﬀeomorphism if it is invertible and both T and its inverse T 1 are twice diﬀerentiable, M1-Lipschitz and M2-smooth.

If T is smooth, then one may expect T[γ] to get closer to T[0] as γ 0. The following preliminary result can be viewed as a formalization of this insight. Its proof is based on a simple extension of the proof of (Bartlett et al., 2018, Theorem 2).

Proposition 18 Let T : ( R, R) R be a M2-smooth function and T(0) = 0. Then, for any 0 γ 1, T[γ] T[0] Lip γM2R.

Proof The statement is trivial if γ = 0. Assume that 0 < γ 1, and let x, y ( R, R) be arbitrarily chosen. Based on the mean value theorem, there exists z between x and y such that T(γx) T(γy) = γT (γz)(x y). We can write

(T[γ](x) T[0](x)) (T[γ](y) T[0](y)) =

= T (γz)(x y) T (0)(x y)

= |x y||T (γz) T (0)|

|x y||γz|M2 |x y|γM2R,

which implies the statement.

Since T(0) = 0, then based on Proposition 18 we have (T[γ](x) T[0](x)) γM2R|x|.

Thus, the smaller γk is in the ladder decomposition, the closer function T[γk] is to the linear function T[0]. The next result makes precise the intuition that if two subsequent scale parameters γk and γk 1 are close, then Tk := T[γk] T 1 [γk 1] is a function close to the identity function:

Theorem 19 Let T : ( R, R) R be an invertible M2-smooth function such that T 1 is M1-Lipschitz and T(0) = 0. Then, for any 0 γ γ 1, we have T[γ ] T 1 [γ] id Lip γ γ M1M2R.

If T is further assumed to be an (M1, M2)-diﬀeomorphism, then T[γ ] T 1 [γ] id is (M2 1 + M1)M2-smooth as well.

An Entropy-Based Model for Hierarchical Learning

Proof The statement is trivial when γ = γ . It is enough to prove that for any 0 γ < γ 1 and any x, y in the domain of T[γ ] T 1 [γ] , the following inequality holds:

T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x γ γ M1M2R|y x|.

Let v := T 1 [γ] (x) and w := T 1 [γ] (y). Based on Lemma 14, x = T[γ](v) and y = T[γ](w). According to Deﬁnition 13, the domain of T[γ] is ( R, R), thus max{|v|, |w|} < R. By deﬁning the function r := T[γ ] T[γ], we observe that T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x = T[γ ](w) T[γ](w) T[γ ](v) T[γ](v)

= r(w) r(v).

Note that the derivative of r at point v is equal to r (v) = T (γ v) T (γv). Based on the mean value theorem, there exists c R between w and v such that

r(w) r(v) = (w v)r (c)

= (w v)(T (γ c) T (γc)).

Hence, T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x = |r(w) r(v)|

= |w v| T (γ c) T (γc)

|w v|M2|γ c γc| (10)

= |w v|M2|c| γ γ

|w v|M2R γ γ (11)

|y x|M1M2R γ γ , (12)

where (10) follows from T being M2-smooth, (11) follows from |c| max{|v|, |w|} < R, and (12) is based on the assumption that T 1 is M1-Lipschitz. We now prove the smoothness property. Let g(x) := T 1(x) and ψ := T[γ ] T 1 [γ] id. The chain rule of derivatives yields

ψ (x) = γ T γ

γ g(γx) g (γx) 2 + γT γ

γ g(γx) g (γ).

Since T and g are both M1-Lipschitz and M2-smooth and 0 γ, γ 1, we deduce

|ψ (x)| (M2 1 + M1)M2.

Therefore, ψ is (M2 1 + M1)M2-smooth.

Corollary 20 Theorem 19 implies that for the ladder decomposition of T at scale parameters {γk}d k=0, the following inequality holds for all 1 k d:

ψk Lip (γk γk 1)M1M2R. (13)

The following is an example in which the functions ψk, 1 k d, have a closed-form expression:

Example 1 Let T(x) := tanh(x) be the hyperbolic tangent function where it is known that T 1(x) = 1

2 ln 1+x 1 x for |x| < 1. Assume that γk = 2k d for all 0 k d. For all |x| < 1, we can derive

ψk(x) = Tk(x) x

= T[γk] T 1 [γk 1](x) x

= exp 2γk T 1 [γk 1](x) 1

γk exp 2γk T 1 [γk 1](x) + 1 x

= exp γk γk 1 ln 1+γk 1x 1 γk 1x 1

γk exp γk γk 1 ln 1+γk 1x 1 γk 1x + 1 x

1+γk 1x 1 γk 1x γk γk 1 1

1+γk 1x 1 γk 1x γk γk 1 + 1 x

= 2γk 1x γk 1 + γ2 k 1x2 x

= x 1 + γ2 k 1x2 x

1 + γ2 k 1x2

= 1 γ 2 k 1x 3 + x 1 .

Figure 1 depicts the plot of ψk(x) for all 1 k d, where d = 5. It clearly demonstrates that the Lipschitz norm of ψk increases with k.

3.2 Multi-Dimensional Functions

In this subsection, we extend the ﬁrst part of Theorem 19 to multi-dimensional functions with m 2. We commence by providing the deﬁnition of smooth multi-dimensional functions. The well-known extension of Deﬁnition 16 to real-valued functions of multiple variables is as follows:

Deﬁnition 21 The scalar-valued function g : Bm R R is M-smooth if it is diﬀerentiable, and its gradient g is M-Lipschitz with respect to the Euclidean distance. Namely, for any x, y Bm R , we have

| g(x) g(y)| M|x y|.

An Entropy-Based Model for Hierarchical Learning

-1.0 -0.5 0.0 0.5 1.0

Figure 1: A plot of ψk(x) for diﬀerent values k: k = 1 the solid line, k = 2 the dotted line, k = 3 the dashed line, k = 4 the dotted-dashed line and k = 5 the large dashed line.

A twice diﬀerentiable function g : Bm R R is M-smooth if the absolute value of any eigenvalues of its Hessian Hg(x) is smaller than or equal to M, for all x Bm R . In other words, if I denotes the identity matrix, then MI Hg(x) MI for all x Bm R . We now extend the previous deﬁnition to vector-valued functions:

Deﬁnition 22 The multi-dimensional function T : Bm R Rm is M-smooth if it is diﬀerentiable and its Jacobian JT is M-Lipschitz with respect to the spectral distance. In other words, for any x, y Bm R , the subsequent inequality holds:

JT (x) JT (y) 2 M|x y|.

The next result shows the relation between Deﬁnitions 21 and 22.

Proposition 23 Assume that T : Bm R Rm is a function with components

T(x) = (T1(x), . . . , Tm(x)).

Then, the multi-dimensional function T is M-smooth if and only if each scalar-valued function Ti : Bm R R is M-smooth, where 1 i m.

Proof Let x, y Bm R be arbitrarily chosen. Assume that each function Ti is M-smooth for all 1 i m. For any u Rm, we observe

|(JT (x) JT (y))u| max 1 i m

( Ti(x) Ti(y))T u .

Moreover, given any 1 i m, the Cauchy-Schwartz inequality implies ( Ti(x) Ti(y))T u | Ti(x) Ti(y)||u|

Thus, for any u Rm,

|(JT (x) JT (y))u| M|x y||u|,

which implies that

JT (x) JT (y) 2 M|x y|.

This proves the if part. To prove the only if part, assume that T is M-smooth. For any 1 i m, let ei be the ith standard unit vector. Then, for all 1 i m,

( Ti(x) Ti(y))T = |(JT (x) JT (y))ei|

JT (x) JT (y) 2|ei|

which implies that Ti is M-smooth.

The proof of Theorem 19 relied upon the mean value theorem. The following result is a multi-dimensional extension of the mean value theorem:

Lemma 24 Let r : Bm R Rm be a diﬀerentiable function. For any a, b Bm R , there exists λ (0, 1) such that for c := λb + (1 λ)a Bm R , we have

|r(b) r(a)| Jr(c) 2|b a|,

where Jr(c) is the Jacobian of r at c.

Proof Deﬁne ξ : [0, 1] R as ξ(t) := (r(b) r(a))T r(tb + (1 t)a). Based on the mean value theorem, there exists λ (0, 1) such that ξ(1) ξ(0) = ξ (λ). Thus, according to the chain rule of derivatives, we can write

(r(b) r(a))T (r(b) r(a)) = (r(b) r(a))T Jr(c)(b a).

The Cauchy-Schwartz inequality yields

|r(b) r(a)|2 |r(b) r(a)||Jr(c)(b a)|

|r(b) r(a)| Jr(c) 2|b a|,

which implies the statement.

We now have suﬃcient tools to extend the ﬁrst part of Theorem 19 to multi-dimensional functions.

Theorem 25 Let T : Bm R Rm be an invertible M2-smooth function such that T 1 is M1-Lipschitz and T(0) = 0. Then, for any 0 γ γ 1, we have T[γ ] T 1 [γ] id Lip γ γ M1M2R.

An Entropy-Based Model for Hierarchical Learning

Proof The case γ = γ is easy to verify. Thus, it is enough to prove that for any 0 γ < γ 1 and any x, y in the domain of T[γ ] T 1 [γ] , we have T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x γ γ M1M2R|y x|.

Let v := T 1 [γ] (x) and w := T 1 [γ] (y). Based on Lemma 14, x = T[γ](v) and y = T[γ](w). According to Deﬁnition 13, the domain of T[γ] is Bm R , thus max{|v|, |w|} < R. Let r := T[γ ] T[γ]. We observe that T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x = T[γ ](w) T[γ](w) T[γ ](v) T[γ](v)

= r(w) r(v).

The Jacobian of function r at point z is equal to Jr(z) = JT (γ z) JT (γz), where JT is the Jacobian of T, for all z Bm R . According to Lemma 24, there exists c Rm as a convex combination of v and w such that

|r(w) r(v)| Jr(c) 2|w v|.

Hence, T[γ ] T 1 [γ] (y) y T[γ ] T 1 [γ] (x) x = |r(w) r(v)|

|w v| Jr(c) 2 = |w v| JT (γ c) JT (γc) 2 |w v|M2|c| γ γ (14)

|w v|M2R γ γ (15)

|y x|M1M2R γ γ , (16)

where (14) follows from T being M2-smooth, (15) is based on the fact that |c| max{|v|, |w|} < R, and (16) follows from the assumption that T 1 is M1-Lipschitz.

Corollary 26 Similar to Corollary 20, Theorem 25 implies that for the ladder decomposition of multi-dimensional function T at scale parameters {γk}d k=0, the following inequality holds for all 1 k d:

ψk Lip (γk γk 1)M1M2R. (17)

Remark 27 The proof of (Bartlett et al., 2018, Theorem 2) establishes that, for all 1 k d, function ψk is C(γk γk 1)/γk-Lipschitz for some constant C. However, this implication is weaker than our result. For example, for each 1 k d, when scale parameters are chosen as γk = k/d, then our result yields that ψk Lip = O(1/d). However, (Bartlett et al., 2018, Theorem 2) states that there exist functions T1, . . . , Td such that T = Td T1 and for each 1 k d, Tk id Lip = O((log d)/d).

Corollaries 20 and 26, when applied to the ladder decomposition of the target function T, are key results underpinning our learning model.

4. The Proposed Learning Model

In this section, we precisely formulate the learning model and discuss its strengths. Assume that for 0 < ε < R, the data instance domain is

X := {x Rm : ε |x| < R}.

Typically, ε is much smaller than R. The label set is Y = Rm, and the target function T is an invertible M2-smooth function such that T 1 is M1-Lipschitz. Suppose that from previous knowledge or intuition, we know the behavior of function T at data instances in very small scales 0 |x| < ε. This is equivalent to knowing T[γ0], where γ0 := R/ε. Without loss of generality, we further assume that T(0) = 0.1 For example, it may be assumed that T[γ0] is a linear function equal to T[0], the derivative of T at the origin. Let γ := (R/ε)1/d. We split the data domain X into d scales. For all 1 k d, deﬁne

Xk := n x Rm : εγk 1 |x| < εγko .

Clearly, X = d k=1Xk, indicating that these sets form a partition of X. Each set Xk is called the domain of instances at scale k. We deﬁne the scale parameters {γk}d k=0 to constitute a geometric sequence such that for all 0 k d,

γk := γk d. (18)

Consider the ladder decomposition of T at scale parameters {γk}d k=0 (Deﬁnition 15). For all 1 k d, we have

T[γk] = Tk T[γk 1] = T[γk 1] + ψk T[γk 1]. (19)

To progressively learn the target function T on X through multiple stages, we introduce a d-level hierarchical learning model that aligns with the relation (19). We set h0 := T[γ0], and for all 1 k d, deﬁne

hk(x) := hk 1(x) + f(hk 1(x); wk), (20)

where f is a function dependent on the parametrization of the learning model. The choice of function f is important for ensuring enough representation power of the model, and we explore an example in Section 6.

Remark 28 One may rely on the universal approximation theorem for neural networks and assume that function f is a two-layer neural network with suﬃcient width; see a discussion in (Bartlett et al., 2018).

We denote the parameters of the kth level of the learning model with wk, allowing it to take values from a ﬁnite set Wk during training. The training mechanism aims to make the mapping between the input and each layer hk approximate T[γk], the dilated version of the

1In general, one can replace T( ) with T( ) := T( ) T(0) and learn this new function. Clearly, T(0) = 0.

An Entropy-Based Model for Hierarchical Learning

target function T at scale γk. For a successfully trained model, f( ; wk) should eﬀectively approximate ψk( ). Let C1 := M1M2R. According to Theorem 25, we have

ψk Lip = C1(γk γk 1) = C1(γ 1)γk d 1. (21)

Since T(0) = 0, we have ψk(0) = 0, for all 1 k d. Thus, (21) implies that

|ψk( )| O(γk γk 1) = O(γk). (22)

Hence, we regularize our hypothesis set as follows:

|f( ; wk)| ρk, for all wk Wk, (23)

where ρk is at least the maximum value of |ψk( )|. In a successfully trained model, each level hk approximates T[γk]. Knowing T[γk] is enough to determine the label of each instance x Xk with appropriate rescalings. Thus, we deﬁne the output of our model hw(x) as follows:

if x X2, ... hd(x) if x Xd.

Therefore, there is no necessity to propagate every data instance x through all levels of our hierarchical model. Rather, the processing steps required to evaluate hw(x) for an input instance x X are proportional to the logarithm of the instance norm |x|, determining in which Xk does x belong to. Consequently, the logarithm of the norm of each data instance x can be construed as a metric for its diﬃculty or complexity. Let the n-tuple of training instances be denoted by s = (x1, . . . , xn). We assume the training set includes instance-label pairs (xi, T(xi)) for all 1 i n. The training mechanism commences with the simplest training examples whose instances are at the smallest scale X1, and progressively trains the model s layers by using the larger-scaled (more complex) examples. At each level, corresponding to each scale of the data, training is represented by sampling wk from a Gibbs measure with the loss (energy) as the empirical risk evaluated for that speciﬁc scale of the training data and with hyperparameter (temperature) λk. The temperature vector (λk)d k=1 consists of model hyperparameters that can be chosen based on cross-validation or by Corollary 34, which provides their values that optimize the derived bound on the statistical risk. It is well-known that Gibbs probability measures are maximum-entropy distributions; see (Jaynes, 1957), a property that we later use in the analysis. Precisely, for all 1 k d, given trained values for w1:(k 1), we sample the vector value for wk with the following probability:

PWk|W1:(k 1) wk w1:(k 1) := exp 1 nλk P xi s Xk

γkhk xi γk ; w1:k T(xi)

P w k Wk exp 1 nλk P xi s Xk

γkhk xi γk ; w1:(k 1)w k T(xi) .

This training mechanism is hierarchical and stochastic. Furthermore, it exhibits selfsimilarity, as at all steps, we sample from Gibbs distributions consisting of loss functions with similar absolute error forms, albeit on data at diﬀerent scales and with diﬀerent temperatures. We refer to this training mechanism as multiscale entropy-based training, see Algorithm 1.

Algorithm 1 Multiscale entropy-based training Hyperparameters: Temperature vector (λk)d k=1. Input: Training data (xi, T(xi))n i=1. Output: Trained parameters w1, . . . , wd.

1: for k = 1 to d do

2: Given training data at scale k (s Xk) and sampled values for w1:(k 1), obtain wk by sampling from the Gibbs measure: PWk|W1:(k 1) wk w1:(k 1) exp 1 nλk P xi s Xk

γkhk xi γk ; w1:k T(xi) .

The proposed learning model and the forthcoming analysis of its statistical risk possess the following advantages:

1. Hierarchical learning model with interpretable levels: The training objective is to ensure each level hk approximates the dilation of the target function T at scale γk, that is, T[γk]. Thus, each level in our hierarchical learning model holds a meaningful interpretation a departure from black-box hierarchical models such as commonly used neural networks. In other words, the mapping between the input of the whole compositional model

(id + f( ; wd)) (id + f( ; w1)) T[γ0],

with any of its intermediate levels is aimed to be close to dilations of the original target function at diﬀerent scales.

2. Measuring the complexity of any data instance with the logarithm of its norm: For any data instance x, the logarithm of its norm log |x| can be interpreted as a measure of its complexity. This complexity determines the training and inference stage at which the label of x can be predicted in the learning model. Thus, our learning model can also be observed as a mathematical model for curriculum learning. For illustrative examples of norm-based complexity interpretations, consider the following scenarios: In ﬁnance, predicting fraud becomes increasingly challenging as a company s revenue resources grow. In image processing, higher degrees of sparsity lead to lower norms in feature vectors, making their targets easier to predict.

3. Computational savings in inference: Assume that x Xk is a new data instance that we want to predict its label using our model. To compute the model s output for x and predict its label, it is suﬃcient to process x/γk only up to the kth level and calculate γkhk(x/γk). In other words, there is no necessity to pass the instance through all d levels of the learning model, and processing only the ﬁrst k levels is adequate, in contrast to commonly used neural networks. This eﬃciency stems from

An Entropy-Based Model for Hierarchical Learning

the fact that, on the one hand, hk approximates the dilation T[γk]. On the other hand, given that |x| < γk and with proper rescaling, one can compute the value of T(x) just by knowing T[γk]. More precisely, we have T(x) = γk T[γk](x/γk) whenever |x| < γk. In practical terms, when using the trained model to predict the target of new data, the computational workload for computing the output of the learned model with a particular data instance as input is directly tied to the complexity of that instance. Since, by assumption (such as scale-invariant distributions (1)), data instances are distributed heterogeneously across diﬀerent scales and diﬃculties, this characteristic may lead to substantial computational savings and an increased inference speed.

4. Statistical guarantee stronger than uniform convergence: The statistical analysis of the risk of the trained compositional model is tailored to the hierarchical training mechanism and takes its multiscale structure into account when deriving the bound on its statistical risk. Consequently, the bound can be much sharper than a uniform convergence bound for the classical empirical-risk-minimizing training algorithm, as shown in the next section.

5. Robustness to interruptions during training: The training of current hierarchical learning models (such as neural networks) on massive datasets may take extensive amounts of time and are prone to interruptions. Our sequential training mechanism consists of d stages. If, for any reason, this mechanism terminates prematurely after stage k for any 1 k d 1, we can still ensure the availability of a useful model. More precisely, this model remains capable of accurately predicting the labels of data instances belonging to X1 Xk, which are all x Rm such that ε |x| < εγk.

6. Computational savings in training for diverse users: The proposed learning model can provide computational savings when there are d diﬀerent users, each requiring accurate prediction of the labels of data instances at scale k (set Xk) for k = 1, . . . , d. Instead of training separate models for each of the d users, we streamline the process by using the trained model for user k 1 as prior information in training the model for user k, for all 2 k d.

5. Statistical Analysis of the Learning Model

In this section, we statistically analyze the learning model s performance. In Subsection 5.1, we prove an upper bound on the training mechanism s chained risk. Then, in Subsection 5.2, we show that if the data instance distribution µ is scale-invariant, we can bound the statistical risk from above based on the chained risk.

5.1 Multiscale Entropy-Based Training and the Chained Risk

For all 1 k d, we deﬁne the loss at scale k as

ℓk(w1:k, x) :=

( |hw(x) T(x)| = γkhk x γk , w1:k T(x) if x Xk

0 if x / Xk. (25)

Furthermore, suppose that

ℓk(w1:k, s) := 1

i=1 ℓk(w1:k, xi).

Equation (24) in the previous section can be written more succinctly as

PWk|W1:(k 1) wk|w1:(k 1) = exp 1

λk ℓk(w1:k, s)

P w k Wk exp 1

λk ℓk w1:(k 1)w k, s .

We denote the whole random parameters as W := (W1, . . . , Wd). The measure

P W := PW1PW2|W1 . . . PWd|W1:(d 1)

models the total training mechanism. For ease of presentation, deﬁne the extra parameter λd+1 := 0. The next theorem indicates that the self-similar measure P W is the minimizing distribution of the sum of a multiscale loss and a multiscale entropy. Consider the deﬁnition of the multiscale loss

ℓ(λ)(w, s) :=

ℓk(w1:k, s) ℓk w1:(k 1), s , (26)

where, for all 1 k d,

ℓk w1:(k 1), s := λk log

w k Wk exp 1

λk ℓk w1:(k 1)w k, s

is a Log-Sum-Exp function.

Theorem 29 We have

P W = arg min PW

E h ℓ(λ)(W, s) i

k=1 (λk λk+1)H(W1:k)

Proof We develop what we call the multiscale congruent technique used in the proof of (Asadi and Abbe, 2020, Theorem 13). Speciﬁcally, recalling the deﬁnition of congruent functionals (Deﬁnition 6), we aim to show that, as a functional of PW,

E h ℓ(λ)(W, s) i

k=1 (λk λk+1)H(W1:k) =

k=1 λk D PWk|W1:(k 1)

PWk|W1:(k 1)

PW1:(k 1) .

(29) This will then immediately imply (28), as setting PW = P W makes the entropies in the right side of (29) vanish altogether. For all 1 k d, deﬁne

Lk(w1:k, s) :=

j=1 ℓj(w1:j, s),

An Entropy-Based Model for Hierarchical Learning

Q(k) W1:k(w1:k) := exp 1

λk Lk(w1:k, s)

P w 1:k W1 Wk exp 1

λk Lk w 1:k, s .

Recall that UW1:d denotes the equiprobable distribution. We can write

j=1 ℓj(w1:j, s)

k=1 (λk λk+1)D(PW1:k UW1:k)

k=1 (λk λk+1)D(PW1:k UW1:k) + (E[Ld(W1:d, s)] + λd D(PW1:d UW1:d))

k=1 (λk λk+1)D(PW1:k UW1:k) + λd D PW1:d Q(d) W1:d

k=1 (λk λk+1)D(PW1:k UW1:k)

+ λd D PW1:(d 1) Q(d) W1:(d 1)

+ λd D PWd|W1:(d 1)

Q(d) Wd|W1:(d 1)

PW1:(d 1) (31)

k=1 (λk λk+1)D(PW1:k UW1:k)

+ (λd 1 λd)D PW1:(d 1) UW1:(d 1) + λd D PW1:(d 1) Q(d) W1:(d 1)

+ λd D PWd|W1:(d 1)

Q(d) Wd|W1:(d 1)

k=1 (λk λk+1)D(PW1:k UW1:k) + λd 1D

Q(d) W1:(d 1)

+ λd D PWd|W1:(d 1)

Q(d) Wd|W1:(d 1)

PW1:(d 1) , (32)

where (30) is based on the Gibbs variational principle (Lemma 7), (31) is based on the chain rule of entropy (Lemma 3), and (32) is based on Lemma 5 on entropy combination. To reiterate, we have derived

j=1 ℓj(w1:j, s)

k=1 (λk λk+1)D(PW1:k UW1:k)

k=1 (λk λk+1)D(PW1:k UW1:k) + λd 1D

Q(d) W1:(d 1)

+ λd D PWd|W1:(d 1)

Q(d) Wd|W1:(d 1)

PW1:(d 1) . (33)

Marginalizing the distribution Q(d) W1:d over Wd yields

Q(d) W1:(d 1) exp 1

λd Ld 1 w1:(d 1), s

w d Wd exp 1

λd ℓd w1:(d 1)w d, s

Thus, its escort distribution is

Q(d) W1:(d 1)

λd λd 1 exp 1 λd 1 Ld 1 w1:(d 1), s

w d Wd exp 1

λd ℓd w1:(d 1)w d, s

Let Z and Z denote the normalizing constants (partition functions) of Q(d) W1:(d 1)

λd λd 1 and

Q(d 1) W1:(d 1), respectively. We have

D PW1:(d 1) Q(d 1) W1:(d 1)

Q(d) W1:(d 1)

w1:(d 1) log

Q(d) W1:(d 1)

λd λd 1 w1:(d 1)

Q(d 1) W1:(d 1) w1:(d 1)

PW1:(d 1) w1:(d 1)

w1:(d 1) log

ℓd w1:(d 1)w d, s

PW1:(d 1) w1:(d 1)

w1:(d 1) log

λd ℓd w1:(d 1)w d, s

PW1:(d 1) w1:(d 1)

w1:(d 1) log Z

PW1:(d 1) w1:(d 1)

w1:(d 1) log

λd ℓd w1:(d 1)w d, s

PW1:(d 1) w1:(d 1) + log Z

w1:(d 1) log

λd ℓd w1:(d 1)w d, s

PW1:(d 1) w1:(d 1)

= λd λd 1 E

λd ℓd w1:(d 1)w d, s

An Entropy-Based Model for Hierarchical Learning

D PW1:(d 1) Q(d 1) W1:(d 1)

Q(d) W1:(d 1)

λd ℓd W1:(d 1)w d, s

By adding both sides of (33) and (34), we obtain

j=1 ℓj(w1:j, s)

k=1 (λk λk+1)D(PW1:k UW1:k)

λd ℓd W1:(d 1)w d, s

k=1 (λk λk+1)D(PW1:k UW1:k) + λd 1D PW1:(d 1) Q(d 1) W1:(d 1)

+ λd D PWd|W1:(d 1)

Q(d) Wd|W1:(d 1)

PW1:(d 1) .

By iterating this argument for k = d 1, . . . , 1, we deduce that

E h ℓ(λ)(W, s) i +

k=1 (λk λk+1)D(PW1:k UW1:k)

k=1 λk D PWk|W1:(k 1)

Q(k) Wk|W1:(k 1)

PW1:(k 1) . (35)

For all 1 k d, it is apparent that

D(PW1:k UW1:k) = log(|W1 Wk|) H(W1:k) = H(W1:k).

Thus, based on (35), we obtain

E h ℓ(λ)(W, s) i

k=1 (λk λk+1)H(W1:k) =

k=1 λk D PWk|W1:(k 1)

Q(k) Wk|W1:(k 1)

PW1:(k 1) .

Since, for all 1 k d,

Q(k) Wk|W1:(k 1) wk w1:(k 1) = PWk|W1:(k 1) wk w1:(k 1) ,

we deduce that

E h ℓ(λ)(W, s) i

k=1 (λk λk+1)H(W1:k) =

k=1 λk D PWk|W1:(k 1)

PWk|W1:(k 1)

PW1:(k 1) ,

as desired.

Straightforwardly, the result extends for random training sequences S := (X1, . . . , Xn) µ n.

Corollary 30 Assume that for all 1 k d,

P Wk|W1:(k 1)S wk w1:(k 1)s := exp 1

λk ℓk(w1:k, s)

P w k Wk exp 1

λk ℓk w1:(k 1)w k, s .

Let P W|S = P W1|SP W2|W1S . . . P Wd|W1:(d 1)S. Then,

P W|S = arg min PW|S

E h ℓ(λ)(W, S) i

k=1 (λk λk+1)H(W1:k|S)

where (S, W) µ n PW|S.

Proof The expression in (36) is linear in PS = µ n. Conditioned on any realization S = s, we can use Theorem 29 to yield the result.

Recall the deﬁnition of the function ψk, for all 1 k d, in the ladder decomposition (19). We assume that our hypothesis set is realizable, meaning that there exist parameters ˆw = ( ˆw1, . . . , ˆwd) in our hypothesis index set such that, for all 1 k d,

ψk( ) = f( ; ˆwk). (37)

In other words, we assume that function T belongs to the hypothesis set and is represented with parameters ˆw as hˆw = T. In case T is represented with diﬀerent parameterizations; we choose one of these and denote it with ( ˆw1, . . . , ˆwd). Now, consider the deﬁnition of the chained risk:

Deﬁnition 31 (Chained Risk) Let X µ. For any w W, we deﬁne the chained risk of w as

L(C) µ (w) := E

ℓk(w1:k, X) ℓk w1:(k 1) ˆwk, X #

The training mechanism of the learning model chooses the values of the parameters w1, ..., wd sequentially. At the kth stage, choosing wk instead of the true target function parameter ˆwk results in the following diﬀerence between the risks at scale k:

E ℓk(w1:k, X) ℓk w1:(k 1) ˆwk, X .

The chained risk is equal to the accumulation of these deviations of risk at each of the d stages of the training mechanism. Obviously, we have L(C) µ (ˆw) = 0. In an intuitive and approximate sense, if the chained risk of w is small, then it suggests that hw could be close to the target function hˆw = T. It is informative to compare the chained risk with the statistical risk (6). Based on the deﬁnition of ℓk(w1:k, x) in (25) for all 1 k d, the loss of the model with parameters w

An Entropy-Based Model for Hierarchical Learning

on any data instance x X is equal to ℓ(w, x) = Pd k=1 ℓk(w1:k, x). Assume that X µ. The statistical risk of the model with parameters w W is

Lµ(w) := E[ℓ(w, X)]

k=1 ℓk(w1:k, X)

k=1 (ℓk(w1:k, X) ℓk( ˆw1:k, X))

where (39) is based on the realizability assumption Lµ(ˆw) = 0. The diﬀerence between the right sides of (38) and (39) is between the terms ℓk(w1:(k 1) ˆwk, X) and ℓk( ˆw1:k, X). In fact, we have

L(C) µ (w) = E

k=1 ℓk(w1:k, X)

k=1 ℓk w1:(k 1) ˆwk, X #

k=1 ℓk w1:(k 1) ˆwk, X #

Hence, the chained risk is always smaller than or equal to the statistical risk. However, in the next subsection, for scale-invariant data instance distributions µ with suﬃciently large shape parameters, we ﬁnd a reverse form of this inequality: an upper bound on the statistical risk based on the chained risk.

Remark 32 The diﬀerence between the chained risk and the statistical risk can be large when the data instance distribution µ is not multiscale. For example, assume that the data instance distribution µ is supported only on Xd. Consider parameters w = (w1, . . . , wd 1, ˆwd). We claim that the chained risk of w is zero. Since the probability that X µ takes values on X1 Xd 1 is zero, for any 1 k d 1 we have

E[ℓk(w1:k, X)] = E ℓk w1:(k 1) ˆwk, X = 0.

L(C) µ (w) = E[ℓd(w1:d, X)] E ℓd w1:(d 1) ˆwd, X = 0.

On the other hand, we have

k=1 ℓk(w1:k, X)

= E[ℓd(w1:d, X)].

However, if the parameters w1, . . . , wd 1 are chosen signiﬁcantly far from the optimal values ˆw1, . . . , ˆwd 1, then hw can be far from the target function T and E[ℓd(w1:d, X)] tends to be large, thereby resulting in a high statistical risk.

In the next theorem, we derive an upper bound on the expected value of the chained risk E h L(C) µ (W) i for the output of the training mechanism P W|S. Recall the deﬁnition of

{ρk}d k=1 when regularizing the hypothesis set in (23).

Theorem 33 Let (S, W) PSP W|S. Then,

E h L(C) µ (W) i 2

j=1 log |Wj|

+ 4γ2 kρ2 k n(λk λk+1)

ℓ(C)(w, s) :=

ℓk(w1:k, s) ℓk w1:(k 1) ˆwk, s .

Recall from (26) and (27) that

ℓ(λ)(w, s) :=

ℓk(w1:k, s) ℓk w1:(k 1), s ,

where, for all 1 k d,

ℓk w1:(k 1), s := λk log

w k Wk exp 1

λk ℓk w1:(k 1)w k, s

For all 1 k d, ℓk w1:(k 1), s is a Log-Sum-Exp function, thus Lemma 8 yields

ℓ(C)(w, s) =

ℓk(w1:k, s) ℓk w1:(k 1) ˆwk, s

ℓk(w1:k, s) min w k Wk ℓk w1:(k 1)w k, s !

ℓk(w1:k, s) ℓk w1:(k 1), s

= ℓ(λ)(w, s). (40)

Lemma 8 also implies that

ℓ(λ)(ˆw, s) =

ℓk( ˆw1:k, s) ℓk ˆw1:(k 1), s

ℓk( ˆw1:k, s) min w k Wk ℓk ˆw1:(k 1)w k, s + λk log |Wk|

k=1 λk log |Wk|. (41)

An Entropy-Based Model for Hierarchical Learning

Let W and S = X1, . . . , Xn be independent copies of W and S, respectively, and assume that W and S are independent from each other. Thus, W, S PWPS. We have,

E h L(C) µ (W) i = E h ℓ(C)( W, S) i .

Based on (20) and (25), for any ﬁxed w = (w1, . . . , wd) and all 1 k d, we can write

ℓk(w1:k, x) =

( γk hk 1 x γk

+ f hk 1 x γk

, wk T(x) if x Xk

0 if x / Xk,

ℓk w1:(k 1) ˆwk, x =

( γk hk 1 x γk

+ f hk 1 x γk

, ˆwk T(x) if x Xk

0 if x / Xk.

The triangle inequality implies that, for any a, b Rm, we have ||a| |b|| |a b|. Thus, for any x X,

ℓk(w1:k, x) ℓk w1:(k 1) ˆwk, x γk

2γkρk, (42)

where (42) is due to (23). Hence, based on Azuma Hoeﬀding s inequality (Lemma 11), for any ﬁxed w and all 1 k d,

ℓk(w1:k, S) ℓk w1:(k 1) ˆwk, S = 1

ℓk(w1:k, xi) ℓk w1:(k 1) ˆwk, xi

is 4γkρk/ n-subgaussian. Since W and S are independent, ℓk W1:k, S ℓk W1:(k 1) ˆwk, S

is 4γkρk/ n-subgaussian as well. Using Lemma 10 for all 1 k d and adding up the derived inequalities, we can write

E h ℓ(C)( W, S) i E h ℓ(C)(W, S) i

E ℓk W1:k, S ℓk W1:(k 1) ˆwk, S E ℓk(W1:k, S) ℓk W1:(k 1) ˆwk, S

λk(log |W1 Wk| H(W1:k|S)) + 8γ2 kρ2 k n λk

j=1 log |Wj| H(W1:k|S)

+ 8γ2 kρ2 k n λk

where we deﬁne, for all 1 k d,

λk := λk λk+1.

E h L(C) µ (W) i = E h ℓ(C)( W, S) i

E h ℓ(C)(W, S) i +

j=1 log |Wj| H(W1:k|S)

+ 8γ2 kρ2 k n λk

E h ℓ(λ)(W, S) i +

j=1 log |Wj| H(W1:k|S)

+ 8γ2 kρ2 k n λk

E h ℓ(λ)(ˆw, S) i +

j=1 log |Wj|

+ 8γ2 kρ2 k n λk

j=1 log |Wj|

+ λk log |Wk| + 8γ2 kρ2 k n λk

j=1 log |Wj|

+ 8γ2 kρ2 k n λk

where (45) is obtained by rewriting (44), (46) is based on (40), (47) is obtained based on Corollary 30 and by replacing P W|S with the conditional distribution PW|S = δˆw (the Dirac measure on ˆw), (48) is based on (41), and (49) is a simple calculation based on summation by parts.

Optimizing the bound in (49) over the values of (λ1, . . . , λd) gives the following result:

Corollary 34 Assume that (λ1, . . . , λd) are chosen such that for all 1 k d,

λk = λk λk+1 = 2γkρk r

n Pk j=1 log |Wj| (50)

Then, the right side of (49) is minimized with respect to (λ1, . . . , λd). In this case, the bound simpliﬁes to the following form:

E h L(C) µ (W) i 8 n

j=1 log |Wj|. (51)

5.2 Bounding the Statistical Risk Based on the Chained Risk

Until now, the analysis did not require any restrictions on the data instance distribution µ. However, we could derive an upper bound only on the expected chained risk of the training

An Entropy-Based Model for Hierarchical Learning

mechanism. We also observed that the chained risk is always smaller than or equal to the statistical risk. In this subsection, we show that if µ is scale-invariant, a small chained risk can imply a small statistical risk. Assume that the instance distribution µ is scale-invariant: If q(x) denotes the density function of µ, then there exists a shape parameter α > 0 such that for all x X such that x/γ X,

= γαq(x). (52)

Thus, scale-invariant distributions have homogenous density functions. One particular example of such distributions is the following power-law probability density function with shape parameter α:

q(x) = 1 Cq|x|α for all x X,

where Cq = R

X |x| αdx. Given this assumption, in the following result, we show an upper bound on the statistical risk based on the chained risk:

Theorem 35 If µ is a scale-invariant probability distribution with shape parameter α deﬁned on X Rm, then for any w W, we have

1 γm+1 α 1 + C1R(1 γ 1) Lµ(w) L(C) µ (w).

Proof For ease of presentation, let ˆhk(x) denote the kth level of the model given parameters ˆw and input x, for any 1 k d. For any 2 k d and any x Xk, we have

ℓk w1:(k 1) ˆwk, x = γk(id + ψk) hk 1

= γk(id + ψk) hk 1

γk(id + ψk) ˆhk 1

+ C1(γ 1)γk d 1γk

= γk 1 + C1(γ 1)γk d 1 hk 1

where in (53), we used the fact that ψk Lip C1(γ 1)γk d 1, according to (21). Now, deﬁne x := γk 1

We observe that x Xk 1, and that the transformation x x establishes a bijection between Xk and Xk 1. Therefore, we obtain

ℓk w1:(k 1) ˆwk, x γk 1 + C1(γ 1)γk d 1 hk 1

= γk 1 + C1(γ 1)γk d 1 hk 1

1 + C1(γ 1)γk d 1 γk 1hk 1

= γ 1 + C1(γ 1)γk d 1 ℓk 1 w1:(k 1), x . (54)

Since x Rm, the change of variables formula for diﬀerentials is dx = γmdx . Let X be distributed according to density q(x). Based on (54), for all 2 k d, we derive

E ℓk w1:(k 1) ˆwk, X

x Xk ℓk w1:(k 1) ˆwk, x q(x)dx

x Xk γ 1 + C1(γ 1)γk d 1 ℓk 1 w1:(k 1), x q(x)dx

x Xk γ 1 + C1(γ 1)γk d 1 ℓk 1 w1:(k 1), x γ αq x dx (55)

x Xk 1 γm+1 α 1 + C1(γ 1)γk d 1 ℓk 1 w1:(k 1), x q x dx

= γm+1 α 1 + C1(γ 1)γk d 1 E ℓk 1 w1:(k 1), X

γm+1 α 1 + C1(γ 1)γ 1 E ℓk 1 w1:(k 1), X , (56)

where (55) is due to the scale invariance property (52). We can extend inequality (56) to the case k = 1 by deﬁning ℓ0 0. Thus,

L(C) µ (w) =

E[ℓk(w1:k, X)] E ℓk w1:(k 1) ˆwk, X

E[ℓk(w1:k, X)] γm+1 α 1 + C1(γ 1)γ 1 E ℓk 1 w1:(k 1), X (57)

k=1 ℓk(w1:k, X)

γm+1 α 1 + C1(γ 1)γ 1 E

k=1 ℓk 1 w1:(k 1), X #

k=1 ℓk(w1:k, X)

γm+1 α 1 + C1(γ 1)γ 1 E

k=1 ℓk(w1:k, X)

= 1 γm+1 α 1 + C1(γ 1)γ 1 Lµ(w),

where (57) is based on (56).

An Entropy-Based Model for Hierarchical Learning

Theorem 35 in conjunction with Theorem 33 or Corollary 34 yields an upper bound on the expected statistical risk of multiscale entropy-based training, as follows:

Corollary 36 If the shape parameter α of density q satisﬁes

α > m + 1 + log 1 + C1(γ 1)γ 1

log γ , (58)

then the following upper bound on the expected statistical risk holds:

E[Lµ(W)] 2 Pd k=1 (λk λk+1) Pk j=1 log |Wj| + 4γ2 kρ2 k n(λk λk+1)

1 γm+1 α(1 + C1(γ 1)γ 1) . (59)

If the hyperparameters (λ1, . . . , λd) are chosen as in (50), then (59) reduces to

E[Lµ(W)] 8 n

Pd k=1 γkρk q Pk j=1 log |Wj|

1 γm+1 α(1 + C1(γ 1)γ 1)

Proof Inequality (58) is equivalent to 1 γm+1 α 1 + C1(γ 1)γ 1 > 0, which, in conjunction with Theorem 35 and Theorem 33 or Corollary 34, yields the results.

Given the realizability assumption on the hypothesis set, the regular union bound applied to the empirical-risk-minimizing hypothesis yields

E[Lµ(WERM)]

j=1 log |Wj|. (61)

Even when ignoring the eﬀect of 1 γm+1 α 1 + C1(γ 1)γ 1 /8, the right side of (60) can be quite smaller than the right side of (61). Consider the following example:

Example 2 Recall that γ0 := R/ε, γ = (γ0)1/d and γk = γk d, for all 1 k d, as in (18). Assume that ρk = ρ0γk for all 1 k d and |W1| = = |Wd|. We compute the following ratio

Pd k=1 γkρk q Pk j=1 log |Wj| Pd k=1 ρk q Pd j=1 log |Wj|

Pd k=1 γ2k d

k Pd k=1 γk

The power of two exists to compare the bounds on the required number of samples n. For example, given γ0 = 10 and d = 20, we obtain Λ 0.2648.

6. Bounded-Norm Parameterization Example

In this section, for the case of one-dimensional functions (m = 1), we show that any (M1, M2)-diﬀeomorphism deﬁned on ( R, R) can be represented with a hierarchical parameterized model with bounded-norm parameters of the form (20). The approach to

proving this is by constructing the ladder decomposition of the diﬀeomorphism. We then proceed to discretize the parameters of this model and apply the bound of Corollary 36 to the corresponding hypothesis set. Assume that the target function T is a (M1, M2)-diﬀeomorphism with T(0) = 0, and consider its ladder decomposition at scale parameters {γk}d k=1. The range of T[γk 1], which is the domain of ψk, is an interval subset of ( M1R, M1R) that includes 0. Note that for a (M1, M2)-diﬀeomorphism, since the function is M1-bilipschitz, it is easy to observe that M1 1. Therefore ( M1R, M1R) includes ( R, R). Let Ψ : R R be an arbitrary invertible φ1-Lipschitz and φ2-smooth function with support DΨ = (a1, a2) ( M1R, M1R) such that 0 DΨ (that is, a1 < 0 < a2) and Ψ(0) = 0. We will later replace Ψ with each function ψk of the ladder decomposition of the target function T. For any x DΨ, we can write

a1 Ψ (b)db + Ψ(a1)

a1 Ψ (b)Θ(x b)db + Ψ(a1), (62)

where Θ(x) is the Heaviside (unit) step function. Since Ψ is φ1-Lipschitz, for all b DΨ, we have |Ψ (b)| φ1. Moreover, since Ψ(0) = 0, we conclude that |Ψ(a1)| φ1|a1| φ1M1R. The idea is now to derive the Riemann sum approximation to the integral representation of Ψ(x) in (62) and to view it as a two-layer neural network. For a given integer τ 2, let a τ-width two-layer network with continuous parameters w := {w1, . . . , wτ, w(c)} be deﬁned for any x ( M1R, M1R) as

ψ(τ) w (x) :=

j=1 w(j)Θ(x bj) + w(c). (63)

For all 1 j τ, we set

Clearly, {bj}τ j=1 is an arithmetic progression with common diﬀerence := 2M1R/τ, where b1 M1R and bτ = M1R. We allow the parameters of (63) to take values ( w(j) Ψ (bj) if bj DΨ w(j) 0 if bj / DΨ

and w(c) Ψ(a1).

Using the Riemann sum approximation bound, given as Lemma 12 in Section 2, we deduce the following result:

Lemma 37 For all x DΨ, we have ψ(τ) w (x) Ψ(x) 6M1R

τ φ1 + 2(M1R)2

An Entropy-Based Model for Hierarchical Learning

Proof Recall that since Ψ is φ1-Lipschitz, for all b DΨ, we have |Ψ (b)| φ1. If x a1 < , then ψ(τ) w (x) = Ψ(a1). Based on the triangle inequality, we have

ψ(τ) w (x) Ψ(x) =

a1 Ψ (b)db Z x

Ψ (b) db < φ1 = 2M1R

which implies the result. Therefore, assume that x a1 , and let j1 and j2 be the smallest and largest integer j such that a1 bj x, respectively. We have

ψ(τ) w (x) Ψ(x) =

j=1 w(j)Θ(x bj) Z a2

a1 Ψ (b)Θ(x b)db

j=j1 Ψ (bj) Z x

j=j1 Ψ (bj) Z bj2

bj1 Ψ (b)db Z bj1

a1 Ψ (b)db Z x

bj2 Ψ (b)db

j=j1 Ψ (bj) Z bj2

bj1 Ψ (b)db

+ 2 φ1 (64)

j=j1 Ψ (bj) Z bj2

bj1 Ψ (b)db

+ 3 φ1, (65)

where, (64) and (65) are based on the triangle inequality. If j2 = j1, then

j=j1 Ψ (bj) Z bj2

bj1 Ψ (b)db

which implies the result. Otherwise, we have j2 > j1. In this case, we obtain

j=j1 Ψ (bj) Z bj2

bj1 Ψ (b)db

φ2(bj2 bj1)2

2(j2 j1) (66)

= φ2(j2 j1)2 2

2 (j2 j1) 2

where (66) is based on Lemma 12, and (67) is based on the fact that (j2 j1) (bτ b1) 2M1R. The proof of the statement is now complete.

We proceed to discretize the parameters of the network. Let η > 0 be the precision level of the parameters. For any y R, we denote the closest real number in ηZ = {. . . , 2η, η, 0, η, 2η, . . . } to y with [y]η. For any function Ψ, the approximate τ-width two-layer network with discretized parameters at precision level η is

ψ(τ,η) w (x) :=

ηΘ(x bj) + h w(c)i

We deﬁne the norm of the network ψ(τ,η) w as the sum of the absolute values of its parameters:

norm ψ(τ,η) w :=

The next result studies the approximation error, the output norm, and the norm of the network of ψ(τ,η) w .

Proposition 38 The two-layer network ψ(τ,η) w , approximating function Ψ, satisﬁes the following properties:

(a) For all x DΨ, ψ(τ,η) w (x) Ψ(x) (τ + 1)η

τ φ1 + 2(M1R)2

(b) For all x ( M1R, M1R), ψ(τ,η) w (x) norm ψ(τ,η) w .

(c) We have

norm ψ(τ,η) w (τ + 1)η

2 + 3M1R + 2M1R

φ1 + 2(M1R)2

τ φ2 =: ϱ(φ1, φ2). (70)

(a) For all x DΨ, we have ψ(τ,η) w (x) Ψ(x) = ψ(τ,η) w (x) ψ(τ) w (x) + ψ(τ) w (x) Ψ(x)

ψ(τ,η) w (x) ψ(τ) w (x) + ψ(τ) w (x) Ψ(x)

w(j) h w(j)i

+ w(c) h w(c)i

τ φ1 + 2(M1R)2

τ φ1 + 2(M1R)2

where (71) is based on Lemma 37 and the triangle inequality.

An Entropy-Based Model for Hierarchical Learning

(b) Based on the triangle inequality,

ψ(τ,η) w (x) =

ηΘ(x bj) + h w(c)i

= norm ψ(τ,η) w .

(c) We have

norm ψ(τ,η) w =

w(j) + w(c) + (τ + 1)η

w(j) + M1Rφ1 + (τ + 1)η

where (72) is based on w(c) = |Ψ(a1)| M1Rφ1. If a2 a1 < , then Pτ j=1 w(j) = 0 and the result is proven. Thus, assume that a2 a1 , and let j1 and j2 be the smallest and largest integer j such that a1 bj a2, respectively. If j1 = j2, then Pτ j=1 w(j) = | Ψ (bj1)| φ1, which implies the result. Therefore, assume that j2 > j1. Since Ψ(x) is φ2-smooth, Ψ (x) is φ2-Lipschitz. Thus, |Ψ (x)| is φ2-Lipschitz as well. Moreover, due to the fact that Ψ is invertible, |Ψ (x)| is either identical to Ψ (x) or Ψ (x), thus is diﬀerentiable. We derive

Ψ (bj) + φ1

bj1 |Ψ (b)|db + φ2(bj2 bj1)2

2(j2 j1) + φ1 (73)

bj1 |Ψ (b)|db + 2(M1R)2

τ φ2 + φ1 (74)

(bj2 bj1)φ1 + 2(M1R)2

(2M1R + )φ1 + 2(M1R)2

where (73) is due to Lemma 12 and (74) is based on the same derivation of (68) from the right side of (66). The result is now proven.

Now, consider the ladder decomposition of the target function T at scale parameters {γk}d k=1. For any 1 k d, we replace function ψk with the function Ψ in Proposition 38. Let C2 := (M2 1 + M1)M2. Based on Theorem 19, ψk is C1γk d 1(γ 1)-Lipschitz and C2-smooth. Thus, we take φ1 C1γk d 1(γ 1) and φ2 C2. For all 1 k d, deﬁne ρk := ϱ C1γk d 1(γ 1), C2 ,

where the function ϱ is deﬁned in (70). Note that, as function of k, ρk = O(γk), conforming with (22). Suppose Wk, the set of parameters for our learning model at level k, is

Wk := n w = w(1), . . . , w(τ), w(c) ηZτ+1 : |w|1 ρk o .

In (20), the recursive deﬁnition of the model, we deﬁne

f(hk 1(x); wk) := ψ(τ,η) wk (hk 1(x)),

for all 1 k d. Therefore, the kth level of our learning model is the following function:

hk = ψ(τ,η) wk + id ψ(τ,η) w2 + id ψ(τ,η) w1 + id .

Notice that since Wk is a discretization of the ℓ1-ball {|w|1 ρk}, it is a ﬁnite set. The next result ﬁnds the cardinality of Wk:

Proposition 39 For all 1 k d, we have

r=0 2τ+1 r τ + 1 r

ρk/η τ + 1 r

Proof For all 1 k d, suppose

W k := w Zτ+1 : |w|1 ρk

Based on the one-to-one correspondence w ηw, we have |Wk| = |W k|. We will employ a counting argument to ﬁnd |W k|. Assume that r components of w W k are equal to zero, and the rest are non-zero, where 0 r τ. There are τ+1 r ways to choose the r zero components. We now count the number of conﬁgurations that τ + 1 r positive integers have a sum less than or equal to ρk/η . For each such conﬁguration, there are 2τ+1 r ways to make each component retain its positive value or negate it. By adding a slack variable, it is easy to observe that the number of conﬁgurations that τ + 1 r positive integers have sum less than or equal to ρk/η is equal to the number of conﬁgurations that (τ + 1 r) + 1 positive integers have sum equal to ρk/η + 1. Based on a basic result in combinatorics (Niven, 1965, (4.7) in Section 4.2), this number is equal to ρk/η τ + 1 r

An Entropy-Based Model for Hierarchical Learning

|Wk| = |W k| =

r=0 2τ+1 r τ + 1 r

ρk/η τ + 1 r

Corollary 40 Based on Proposition 39, the optimized bound on the statistical risk in Corollary 36 for the example studied in this section is as follows:

E[Lµ(W)] 8 n

Pd k=1 γkρk q Pk j=1 log |Wj|

1 γm+1 α(1 + C1(γ 1)γ 1)

Pd k=1 γk dρk

r Pk j=1 log Pτ r=0 2τ+1 r τ+1 r ρj/η τ+1 r

1 γm+1 α(1 + C1(γ 1)γ 1)

where ρk = ϱ C1γk d 1(γ 1), C2 , for all 1 k d.

7. Conclusions

In this paper, we introduced an entropy-based hierarchical learning model designed to leverage the multiscale structure of data instance distributions and the smoothness inherent in real-world target functions. We started with the deﬁnition of ladder decompositions of invertible functions, followed by an examination of Lipschitz continuity and smoothness in the components of this decomposition for smooth functions. Subsequently, we analyzed the eﬀectiveness of the proposed multiscale entropy-based training, demonstrating its capability to achieve low chained risk. Notably, where the data distribution µ is scale-invariant with a suﬃciently large shape parameter, this paper showed an upper bound on the statistical risk based on the chained risk. Consequently, we derived a guarantee on the statistical risk of our training mechanism for these data distributions. Finally, for the speciﬁc case of onedimensional functions, the paper provided an illustrative example of a parameterized model featuring bounded-norm parameters to showcase a simple application of our methodology. Our proposed learning model oﬀers several noteworthy advantages. Firstly, it adopts a hierarchical structure with interpretable levels. Each level in the model is trained to approximate a dilation of the target function, oﬀering interpretable roles for diﬀerent levels, which contrasts opaque black-box hierarchical models. Secondly, our model introduces a new perspective on data instance complexity. The logarithm of the norm of data instances serves as a measure of their individual complexity. The training procedure can also be conceptualized as a mathematical model for curriculum learning. Another merit of the proposed model is the computational point of view. The amount of computation required to compute the output of the learned model for a given data instance is directly proportional to the complexity of that instance. Given the heterogeneous

distribution of data instances at diﬀerent scales and complexities, this characteristic may translate into signiﬁcant computational savings and faster inference speeds. Furthermore, our model s statistical analysis accounts for its hierarchical and multiscale structure, leading to sharper bounds on the statistical risk for scale-invariant distributions compared to uniform convergence bounds for empirical-risk-minimizing mechanisms. Additionally, the multi-staged nature of our training mechanism can be beneﬁcial when dealing with the challenge of extensive training times on massive datasets. Even if the training terminates prematurely, the mechanism ensures a useful model capable of accurately predicting the labels of data instances with norms smaller than a speciﬁc threshold, depending on the terminated stage. Finally, our learning model proves advantageous in scenarios involving multiple users learning the target function at a particular scale of data instances. It provides computational savings by using each user s learned model as prior information to train the model for the following user. Our current work has limitations in that the derived risk bounds are applied only when the parameters of our model are discrete and the hypothesis set is ﬁnite. Section 6 provided an example of parameterization where diﬀeomorphisms are expressed using parameters with bounded norms. Consequently, the hypothesis set becomes ﬁnite when these parameters are discretized due to, for instance, real-world memory constraints. While the multiscale entropy-based training procedure (Algorithm 1) remains well-deﬁned in the scenario where the hypothesis set is inﬁnite, an extension of our technique is needed to analyze the statistical risk comprehensively. It is plausible that leveraging the Lipschitz property of the hypothesis set, as demonstrated in the derivation of the risk bound on the classical Gibbs distribution with inﬁnite hypothesis set in (Xu and Raginsky, 2017), could prove beneﬁcial. We defer the exploration of this direction to our future investigations.

Acknowledgments

I would like to express my sincere appreciation to the anonymous reviewers for their constructive comments and suggestions, which signiﬁcantly enriched the quality of this manuscript. This research was supported by Leverhulme Trust grant ECF-2023-189 and Isaac Newton Trust grant 23.08(b).

Amir R. Asadi and Emmanuel Abbe. Chaining meets chain rule: Multilevel entropic regularization and training of neural networks. Journal of Machine Learning Research, 21 (139):1 32, 2020.

Amir R. Asadi and Po-Ling Loh. Entropic regularization of neural networks: Self-similar approximations. Journal of Statistical Planning and Inference, 233:106181, 2024.

Amir R. Asadi, Emmanuel Abbe, and Sergio Verd u. Chaining mutual information and tightening generalization bounds. In Advances in Neural Information Processing Systems, pages 7234 7243, 2018.

An Entropy-Based Model for Hierarchical Learning

Jean-Yves Audibert and Olivier Bousquet. Combining PAC-Bayesian and generic chaining bounds. Journal of Machine Learning Research, 8(Apr):863 889, 2007.

Peter L. Bartlett, Steven N. Evans, and Philip M. Long. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization. ar Xiv preprint ar Xiv:1804.05012, 2018.

Peter Baxandall and Hans Liebeck. Vector Calculus. New York: Oxford University Press, 1986.

Yoshua Bengio, J erˆome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual International Conference on Machine Learning, pages 41 48, 2009.

St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.

Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening mutual information based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 2020.

Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SIAM Review, 51(4):661 703, 2009. ISSN 00361445, 10957200.

Eugenio Clerico, Amitis Shidani, George Deligiannidis, and Arnaud Doucet. Chained generalisation bounds. In Proceedings of Machine Learning Research, pages 4212 4212, 2022.

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

R. M. Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290 330, 1967.

Weinan E. Principles of Multiscale Modeling. Cambridge University Press, 2011.

Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1 11, 2017.

Alastair Fletcher and Vladimir Markovic. Decomposing diﬀeomorphisms of the sphere. Bulletin of the London Mathematical Society, 44(3):599 609, 2012.

Ross Girshick, JeﬀDonahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 580 587, 2014.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016.

Deborah Hughes-Hallett, Andrew M. Gleason, and William G. Mc Callum. Calculus: Single and Multivariable. John Wiley & Sons, 2020.

Edwin T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4): 620, 1957.

David A. Mc Allester. PAC-Bayesian model averaging. In Proceedings of the twelfth annual conference on Computational Learning Theory, pages 164 170, 1999.

Shahar Mendelson. Lower bounds for the empirical minimization algorithm. IEEE Transactions on Information Theory, 54(8):3797 3803, 2008.

Michael Mitzenmacher. A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1(2), 2004.

M. E. J. Newman. Power laws, Pareto distributions and Zipf s law. Contemporary Physics, 46(5):323 351, 2005.

Ivan Niven. Mathematics of Choice: Or, How to Count Without Counting, volume 15. MAA, 1965.

John P. Nolan. Univariate Stable Distributions. Springer, 2020.

Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu. Information-theoretic analysis of stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop, pages 26 30. IEEE, 2016.

Daniel Russo and James Zou. How much does your data exploration overﬁt? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302 323, 2019.

Gennady Samoradnitsky and Murad S. Taqqu. Stable Non-Gaussian Random Processes: Stochastic Models with Inﬁnite Variance. Routledge, 2017.

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

Didier Sornette. Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder: Concepts and Tools. Springer Science & Business Media, 2006.

Michel Talagrand. Upper and Lower Bounds for Stochastic Processes: Modern Methods and Classical Problems, volume 60. Springer Science & Business Media, 2014.

Tim van Erven and Peter Harremo es. R enyi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014.

An Entropy-Based Model for Hierarchical Learning

Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 1999.

Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pages 2524 2533, 2017.