# conditionally_strongly_logconcave_generative_models__773dabc6.pdf

Conditionally Strongly Log-Concave Generative Models

Florentin Guth * 1 Etienne Lempereur * 1 Joan Bruna 2 St ephane Mallat 3

There is a growing gap between the impressive results of deep image generative models and classical algorithms that offer theoretical guarantees. The former suffer from mode collapse or memorization issues, limiting their application to scientific data. The latter require restrictive assumptions such as log-concavity to escape the curse of dimensionality. We partially bridge this gap by introducing conditionally strongly log-concave (CSLC) models, which factorize the data distribution into a product of conditional probability distributions that are strongly log-concave. This factorization is obtained with orthogonal projectors adapted to the data distribution. It leads to efficient parameter estimation and sampling algorithms, with theoretical guarantees, although the data distribution is not globally logconcave. We show that several challenging multiscale processes are conditionally log-concave using wavelet packet orthogonal projectors. Numerical results are shown for physical fields such as the φ4 model and weak lensing convergence maps with higher resolution than in previous works.

1. Introduction

Generative modeling requires the ability to estimate an accurate model of a probability distribution from a training dataset, as well as the ability to efficiently sample from this model. Any such procedure necessarily introduces errors, due to limited expressivity of the model class, learning errors of selecting the best model within that class, and sampling errors due to limited computational resources. For

*Equal contribution 1D epartement d informatique, Ecole Normale Sup erieure, Paris, France 2Courant Institute of Mathematical Sciences and Center for Data Science, New York University, USA 3Coll ege de France, Paris, France, and Flatiron Institute, New York, USA. Correspondence to: Florentin Guth <florentin.guth@ens.fr>, Etienne Lempereur <etienne.lempereur@ens.fr>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

high-dimensional data, it is highly challenging to control all errors with polynomial-time algorithms. Overcoming the curse of dimensionality requires exploiting structural properties of the probability distribution. For instance, theoretical guarantees can be obtained with restrictive assumptions of log-concavity, or with low-dimensional parameterized models. In contrast, recent deep-learning-based approaches such as diffusion models (Ramesh et al., 2022; Saharia et al., 2022; Rombach et al., 2022) have obtained impressive results for distributions which do not satisfy these assumptions. Unfortunately, in such cases, theoretical guarantees are lacking, and diffusion models have been found to memorize their training data (Carlini et al., 2023; Somepalli et al., 2022), which is inappropriate for scientific applications. The disparity between these two approaches highlights the need for models which combine theoretical guarantees with sufficient expressive power. This paper contributes to this objective by defining the class of conditionally strongly logconcave distributions. We show that it is sufficiently rich to model the probability distributions of complex multiscale physical fields, and that such models can be sampled with fast algorithms with provable guarantees.

Sampling and learning guarantees. While the theory for sampling log-concave distributions is well-developed (Chewi, 2023), simultaneous learning and sampling guarantees for general non-log-concave distributions are less common. Block et al. (2020) establish a fast mixing rate of multiscale Langevin dynamics under a manifold hypothesis. Koehler et al. (2022) studies the asymptotic efficiency of score-matching compared to maximum-likelihood estimation under a global log-Sobolev inequality, which is not quantitative beyond globally log-concave distributions. Chen et al. (2022b;a) establish polynomial sampling guarantees for a reverse score-based diffusion, given a sufficiently accurate estimate of the time-dependent score. Sriperumbudur et al. (2013); Sutherland et al. (2018); Domingo-Enrich et al. (2021) study density estimation with energy-based models under different infinite-dimensional parametrizations of the energy. They use various metrics including score-matching to establish statistical guarantees that avoid the curse of dimensionality, under strong smoothness or sparsity assumptions of the target distribution. Finally, Balasubramanian et al. (2022) derive sampling guarantees in Fisher divergence of Langevin Monte-Carlo beyond

Conditionally Strongly Log-Concave Generative Models

log-concave distributions. While these hold under a general class of target distribution, such Fisher guarantees are much weaker than Kullback-Leibler guarantees. Bridging this gap requires some structural assumptions on the distribution.

Multiscale generative models. Images include structures at all scales, and several generative models have relied on decompositions with wavelet transforms (Yu et al., 2020; Gal et al., 2021). More recently, Marchand et al. (2022) established a connection between the renormalization group in physics and a conditional decomposition of the probability distribution of wavelet coefficients across scales. These models rely on maximum likelihood estimations with iterated Metropolis sampling, which leads to a high computational complexity. They have also been used with score matching (Guth et al., 2022; Kadkhodaie et al., 2023) in the context of score-based diffusion models (Song et al., 2021), which suffer from memorisation issues.

Conditionally strongly log-concave distributions. We consider probability distributions whose Gibbs energy is dominated by quadratic interactions,

Z e E(x) with E(x) = 1

2x TKx + V (x).

The matrix K is positive symmetric and V is a non-quadratic potential. If V is non-convex, then p is a priori not logconcave. However, the Hessian of E may be dominated by the large eigenvalues of K, whose corresponding eigenvectors define directions in which p is log-concave. For multiscale stationary distributions, K is a convolution whose eigenvalues have a power-law growth at high frequencies. As a result, the conditional distribution of high frequencies given lower frequencies may be log-concave.

Section 2 introduces factorizations of probability distributions into products of conditional distributions with arbitrary hierarchical projectors. If the projectors are adapted to obtain strongly log-concave factors, we prove that maximum likelihood estimation can be replaced by score matching, which is computationally more efficient. The MALA sampling algorithm also has a fast convergence due to the conditional log-concavity. Section 3 describes a class of multiscale physical processes that admit conditionally strongly log-concave (CSLC) decompositions with wavelet packet projections. This class includes the φ4 model studied in statistical physics. These results thus provide an approach to provably avoid the numerical instabilities at phase transitions observed in such models. We then show in Section 4 that wavelet packet CSLC decompositions provide accurate models of cosmological weak lensing images, synthesized as test data for the Euclid outer-space telescope mission (Laureijs et al., 2011).

The main contributions of the paper are:

The definition of general CSLC models, which provide learning guarantees by score matching and sampling convergence bounds with MALA.

CSLC models of multiscale physical fields using wavelet packet projectors. We show that φ4 and weak lensing both satisfy the CSLC property, which leads to efficient and accurate generative modeling.

The code to reproduce our numerical experiments is available at https://github.com/Elempereur/WCRG.

2. Conditionally Strongly Log-Concave Models

Section 2.1 introduces conditionally strongly log-concave models, by factorizing the probability density into conditional probabilities. For these models, Sections 2.2 and 2.3 give upper bounds on learning errors with score matching algorithms, and Section 2.4 on sampling errors with a Metropolis-Adjusted Langevin Algorithm (MALA). Proofs of the mathematical results can be found in Appendix E.

2.1. Conditional Factorization and Log-Concavity

We introduce a probability factorization based on orthogonal projections on progressively smaller-dimensional spaces. The projections are adapted to define strongly log-concave conditional distributions.

Orthogonal factorization. Let x Rd. A probability distribution p(x) can be decomposed into a product of autoregressive conditional probabilities

p(x) = p(x[1])

i=2 p(x[i] | x[1], . . . , x[i 1]). (1)

However, more general factorizations can be obtained by considering blocks of variables in an orthogonal basis. We initialize the decomposition with x0 = x. For j = 1 to J, we recursively split xj 1 in two orthogonal projections:

xj = Gjxj 1 and xj = Gjxj 1,

where Gj and Gj are unitary operators such that GT j Gj + GT j Gj = Id. It follows that

xj 1 = GT j xj + GT j xj. (2)

Let dj = dim(xj) and dj = dim( xj), then dj 1 = dj + dj.

Since the decomposition is orthogonal, for any probability distribution p we have

p(xj 1) = p(xj, xj) = p(xj)p( xj|xj).

Conditionally Strongly Log-Concave Generative Models

Cascading this decomposition J times gives

p(x) = p(x J)

j=1 p( xj|xj), (3)

which generalizes the autoregressive factorization (1). The properties of the factors p( xj|xj) depend on the choice of the orthogonal projectors Gj and Gj, as we shall see below.

Model learning and sampling. A parametric model pθ(x) of p(x) can be defined from Equation (3) by computing parametric models of p(x J) and each p( xj|xj):

pθ(x) = pθJ(x J)

j=1 p θj( xj|xj), (4)

with θ = (θJ, θj)j J.

Learning this model then amounts to optimizing the parameters θJ, ( θj)j from available data, so that the resulting distributions are close to the target. We measure the associated learning errors with the Kullback-Leibler divergences ϵL J = KLx J(p(x J) pθJ(x J)) and

ϵL j = Exj h KL xj(p( xj|xj) p θj( xj|xj)) i , j J.

Once the parameters have been estimated, we sample from pθ as follows. We first compute a sample x J of pθJ. The sampling introduces an error, which we measure with ϵS J = KLx J(ˆpθJ(x J) pθJ(x J)), where ˆpθJ is the law of the samples returned by the algorithm. For each j J, given the sampled xj, we compute a sample xj of p θj( xj|xj) and recover xj 1 with Equation (2), up to j = 1, where it computes x = x0. Let ˆp θj be the law of computed samples xj. It also introduces an error

ϵS j = Exj h KL xj(ˆp θj( xj|xj) p θj( xj|xj)) i , j J.

Let ˆp be the (joint) law of the computed samples x. The following proposition relates the total variation distance TV(ˆp, p) with the learning and sampling errors for each j.

Proposition 2.1 (Error decomposition).

TV(ˆp, p) 1

v u u tϵL J +

v u u tϵS J +

The overall error depends on the sum of learning and sampling errors for each conditional probability distribution. Therefore, to control the total error, we need sufficient conditions ensuring that each of these sources of error is small. We introduce CSLC models for this purpose.

Figure 1. A globally log-concave distribution is conditionally logconcave (top left), but the converse is not true (top right): a nonconvex support can have convex vertical slices (and horizontal projection). Conditional log-concavity also depends on the choice of orthogonal projectors: a distribution can fail to be conditionally log-concave in the canonical basis (bottom left) but be conditionally log-concave after a rotation of 45 degrees (bottom right).

Conditional strong log-concavity. We recall that a distribution p is strongly log-concave (SLC) if there exists β[p] α[p] > 0 such that

α[p]Id 2 x log p(x) β[p]Id, x. (5)

Definition 2.1. We say that p(x) = p(x J) QJ j=1 p( xj|xj) is conditionally strongly log-concave (CSLC) if each p( xj|xj) is strongly log-concave in xj for all xj.

Conditional log-concavity is a weaker condition than (joint) log-concavity. If p(x) is log-concave, then it has a convex support. On the other hand, conditional log-concavity only constraints slices (through conditioning) and projections (through marginalization) of the support of p(x). Figure 1 illustrates that a jointly log-concave distribution is conditionally log-concave (and p(x J) is furthermore log-concave), but the converse is not true. Conditional log-concavity also depends on the choice of the orthogonal projections Gj and Gj which need to be adapted to the data. A major issue is to identify projectors that define a CSLC decomposition, if it exists. We show in Section 3 that this can be achieved for a class of physical fields with wavelet packet projectors.

The following subsections provide bounds on the learning and sampling errors ϵL j and ϵS j for CSLC models. To simplify notations, in the following we drop the index j and replace p θj( xj|xj) with p θ( x|x). We shall suppose that the dimension d J = dim(x J) is sufficiently small so that

Conditionally Strongly Log-Concave Generative Models

x J can be modeled and generated with any standard algorithm with small errors ϵL J and ϵS J (d J = 1 in our numerical experiments).

2.2. Learning Guarantees with Score Matching

Fitting probabilistic models p θ( x|x) by directly minimizing the KL errors ϵL is computationally challenging because of intractable normalization constants. Strong log-concavity enables efficient yet accurate learning via a tight relaxation to score matching.

There exist several frameworks to fit a parametric probabilistic model to the data, most notably the maximum-likelihood estimator of a general energy-based model p θ( x|x) = Z 1 θ (x)e E θ(x, x), where E θ is an arbitrary parametric class. This is computationally expensive due to the need to estimate the gradients of the normalization constants θ log Z θ = Ep θ[ θ E θ] during training, which requires the ability to sample from p θ( x|x). An appealing alternative which has enjoyed recent popularity is score matching (Hyv arinen & Dayan, 2005), which instead minimizes the Fisher Divergence FI:

2FI x(p( x|x) p θ( x|x))

2 x log p( x|x) x E θ(x, x) 2 .

With a change of variables we obtain

ℓ( θ) = Ex, x

2 x E θ 2 x E θ

showing that ℓ( θ) can be minimized from available samples without estimating normalizing constants or sampling from p θ. Indeed, given i.i.d. samples {( x1, x1), . . . , ( xn, xn)} from p( x, x), the empirical risk ˆℓ( θ) associated with score matching on p( x|x) is given by

2 x E θ(xi, xi) 2 x E θ(xi, xi) . (7)

The score-matching objective avoids the computational barriers associated with normalization and sampling in highdimensions, at the expense of defining a weaker metric than the KL divergence. This weakening of the metric is quantified by the log-Sobolev constant ρ[p] associated with p. It is the largest ρ > 0 such that KL(q p) 1 2ρFI(q p) for any q. Learning via score matching can therefore be seen as a relaxation of maximum-likelihood training, whose tightness is controlled by the log-Sobolev constant of the hypothesis class (Koehler et al., 2022). This constant can be exponentially small for general multimodal distributions, making this relaxation too weak. A crucial exception, however, is given by SLC distributions (or small perturbations

of them), as shown by the Bakry-Emery criterion (Bakry et al., 2014, Definition 1.16.1): if α[p θ( x|x)] α > 0 for all x, or equivalently if 2 x E θ αId for all x, x, then ρ[p θ( x|x)] α for all x, and therefore

αℓ( θ). (8)

We remark that while Equation (8) does not make explicit CSLC assumptions on the reference distribution p, a consistent learning model implies that the conditional distribution p( x|x) is arbitrarily well approximated (in KL divergence) with SLC distributions thus justifying the structural CSLC assumption on the target.

2.3. Score Matching with Exponential Families

In numerical applications, one cannot minimize the true score-matching loss ℓas only a finite amount of data is available. We now show that a similar control as Equation (8) can be obtained for the empirical loss minimizer, whenever prior information enables us to define low-dimensional exponential models for p θ( x|x) with good accuracy. It also provides a control on the critical parameter α, addressing the optimization and statistical errors.

We consider a linear model E θ(x, x) = θT Φ(x, x) with a fixed potential vector Φ(x, x) Rm (m is thus the number of parameters), and the corresponding minimization of the (conditional) score matching objective in Equation (7). Thanks to this linear parameterization, it becomes a convex quadratic form ˆℓ( θ) = 1

2 θT ˆH θ θTˆg, with

i=1 x Φ(xi, xi) x Φ(xi, xi)T Rm m,

i=1 x Φ(xi, xi) Rm.

It can be minimized in closed-form by inverting the Hessian matrix: ˆ θ = ˆH 1ˆg. As discussed, the sampling and learning guarantees of the model critically rely on the CSLC property, which is ensured as long as ˆ θ Θ α := { θ | 2 x E θ(x, x) αId, (x, x)} with α > 0.

The following theorem leverages the finite-dimensional linear structure of the score-matching problem to establish fast non-asymptotic rates of convergence, controlling the excess risk in KL divergence.

Theorem 2.1 (Excess risk for CSLC exponential models). Let θ = arg min ℓ( θ) and ˆ θ = arg min ˆℓ( θ). Assume:

(i) θ Θ α for some α > 0,

(ii) H = E h x Φ x ΦTi ηId with η > 0,

Conditionally Strongly Log-Concave Generative Models

(iii) the sufficient statistics Φ satisfy moment conditions

E.2, regularity conditions E.3, and Φk(x, x) is M ΦLipschitz for any k m and all x (see Appendix E).

Then when n > m, the empirical risk minimizer ˆ θ satisfies

ˆ θ Θˆ α with E( xi,xi) ˆ α α O η 1 rm

and, for t mℓ( θ ),

α (1 + t) (10)

with probability greater than 1 exp{ O(n log(tn/ m))} over the draw of the training data. The constants in O( ) only depend on moment and regularity properties of Φ.

The theorem provides learning guarantees for the empirical risk minimizer ˆ θ (compare Equations (8) and (10)), and hinges on three key properties: the ability of the exponential family to approximate the true conditionals at each block (i) with small Fisher approximation error ℓ( θ ), (ii) with a sufficiently large strong log-concavity parameter α, and (iii) with a well-conditioned kernel H. In numerical applications, the number of parameters m should be small enough to control the learning error for finite number of samples n, and to be able to compute and invert the Hessian matrix ˆH. We will define in Section 3 low-dimensional models that can approximate a wide range of multiscale physical fields.

The proof uses concentration of the empirical covariance ˆH, and combines both upper and lower tail probability bounds (Mourtada, 2022; Vershynin, 2012) to bound the expectation, similarly as known results for least-squares (Mourtada, 2022; Hsu et al., 2012). The statistical properties of score matching under exponential families have been studied in the infinite-dimensional setting by Sriperumbudur et al. (2013); Sutherland et al. (2018), where kernel ridge estimators achieve non-parametric rates n s, s < 1. Compared to these, as an intermediate result, we achieve the optimal rate in FI divergence in n 1 directly with the ridgeless estimator (Equation (36)). The key assumption is (i), namely that the optimal model in the exponential family is SLC. Since our structural assumption on the target p is precisely that its conditionals are SLC, it is reasonable to expect this to be generally true. For instance, this is the case if the model is well specified (p = p θ ).

2.4. Sampling Guarantees with MALA

We illustrate the efficient sampling properties of CSLC distributions by focusing on a reference sampler given by the Metropolis-Adjusted Langevin Algorithm (MALA) with algorithmic warm-start, which enjoys well-understood convergence properties in this case:

Proposition 2.2 (MALA Sampling, Altschuler & Chewi (2023, Theorem 5.1)). Suppose that αId 2 x E θ( x|x) βId for all x, x, and let d = dim( x). Then N steps of MALA produce a sample x with conditional law ˆp θ( x|x) satisfying

MALA can thus be used to sample from CSLC distributions with an exponential convergence, whose mixing time e O( d β/ α) is sublinear in the dimension d and linear in the condition number β/ α of the Hessian 2 x E θ. We also note that similar guarantees will hold for other high-precision Metropolis-Hastings samplers, such as Hamilton Monte Carlo. Together, Propositions 2.1 and 2.2 and Theorem 2.1 imply a control on the total accumulated error for CSLC exponential models.

3. Wavelet Packet Conditional Log-Concavity

The CSLC property depends on the choice of the projectors ( Gj, Gj) which need to be adapted to the data. We show that for a class of stationary multiscale physical processes, CSLC models can be obtained with wavelet packet projectors. These models exploit the dominating quadratic interactions at high frequencies by splitting the frequency domain in sufficiently narrow bands. It reveals a powerful mathematical structure in this class of complex distributions.

3.1. Energies with Scalar Potentials

In the following, x Rd is a

d image or twodimensional field. We denote x[i] the value of x at pixel or location i. An important class of stationary probability distributions p(x) = Z 1e E(x) are defined in physics from an energy composed of a two-point interaction term K plus a potential that is a sum of scalar potentials v:

i v(x[i]). (11)

The matrix K is a positive symmetric convolution operator. Equation (11) generalizes both zero-mean Gaussian processes (if v = 0 then K is the inverse covariance) and distributions with i.i.d. components (if K = 0 then v is the negative log-density of the pixel values). The energy Hessian is given by

2 x E(x) = K + diag v (x[i])

If v (t) < 0 for some t R then we may get negative eigenvalues for some x, in which case the energy is not convex.

Conditionally Strongly Log-Concave Generative Models

Equation (11) provides models of a wide class of physical phenomena (Marchand et al., 2022), including ferromagnetism. An important example is the φ4 energy in physics, which is a non-convex energy allowing to study phase transitions and explain the nature of numerical instabilities (Zinn-Justin, 2021). It has a kinetic energy term defined by K = β where is a discrete Laplacian that enforces spatial regularity, and its scalar potential is v(t) = t4 (1 + 2β)t2. It has a double-well shape which pushes the values of each x[i] towards +1 and 1, and is thus non-convex. β is an inverse temperature parameter. In the thermodynamic limit d of infinite system size, the φ4 energy has a phase transition at βc 0.68 (Kaupuˇzs et al., 2016). At small temperature (β βc), the local interactions in the energy give rise to long-range dependencies. Gibbs sampling then critically slows down (Podgornik, 1996; Sethna, 2021) due to these long-range dependencies.

Fast sampling can nevertheless be obtained by exploiting conditional strong log-concavity. Assume that there exists γ > 0 such that v (t) γ for all t R. It then follows that 2 x E K γId. We can thus obtain a convex energy by restricting K over a subspace where its eigenvalues are larger than γ. The convolution K is diagonalized by the Fourier transform, with positive eigenvalues that we write ˆK(ω) at all frequencies ω. The value ˆK(ω) typically increases when the frequency modulus |ω| increases. A convex energy is then obtained with a projector over a space of high-frequency images, as shown in the following proposition.

Proposition 3.1 (Conditional log-concavity of scalar potential energies). Consider the energy defined in Equation (11) and assume that γ v δ for some γ, δ > 0 and that ˆK(ω) = λ|ω|η for some η > 0. Let G1 be an orthogonal projector over a space of signals whose Fourier transform have a support included over frequencies ω such that |ω| |ω0| with |ω0| > (γ/λ)1/η. Then the conditional probability p( x1|x1) is strongly log-concave for all x1.

The proof is in Appendix F and relies on a direct calculation of the Hessian of the conditional energy. This proposition proves that we obtain a strongly log-concave conditional distribution p( x1|x1) with a sufficiently high-frequency filter G1. It is illustrated in the bottom row of Figure 1 on a simplified two-dimensional example inspired from the φ4

energy. The distribution has two modes x = (1, 1) and x = ( 1, 1), and the Fourier coefficients are computed with a 45 degrees rotation: x1 = (x[1] + x[2])/

2 and x1 = (x[2] x[1])/

2, which leads to a log-concave conditional distribution.

Multiscale physical fields with scalar potential energies (11) are often self-similar over scales, in the sense that lowerfrequency fields xj can also be described with an energy in the form of Equation (11), with different parameters

(Wilson, 1971). This explains why Proposition 3.1 can be iterated to obtain a CSLC decomposition. For φ4 energies, the range of G1 is non-empty as soon as β 1 2, which includes the critical temperature βc 0.68 (though δ = ). At the critical temperature, x1 is further described by the same parameters K and v as x, so that a complete CSLC decomposition is obtained by iteratively selecting projectors Gj which isolate the highest frequencies of xj 1.

Proposition 3.1 can be extended to general energies

2x TKx + V (x),

by assuming that the Hessian 2V (x) is bounded above and below. Conditional log-concavity may then be found by exploiting dominating quadratic energy terms with a PCA of K. We believe that this general principle may hold beyond the case of scalar potential energies (11) considered here.

3.2. Wavelet Packets and Renormalization Group

We now define wavelet packet projectors Gj and Gj, which are orthogonal projectors on localized zones of the Fourier plane. They are computed by convolutions with conjugate mirror filters and subsamplings (Coifman et al., 1992), described in Appendix A. These filters perform a recursive split of the frequency plane illustrated in Figure 2.

The wavelet packet Gj is a projector on a high-frequency domain, whereas Gj is a projection on the remaining lowerfrequency domain. An orthogonal wavelet transform is a particular example, which decomposes the Fourier plane into annuli of about one octave bandwidth, as shown in the top left and bottom panels of Figure 2. However, it may not be sufficiently well localized in the Fourier domain to obtain strictly convex energies. The frequency localization is improved by refining this split, as illustrated on the top right panel of Figure 2. Each Gj then performs a projection over a frequency annulus whose bandwidth is a half octave. Wavelet packets can adjust the frequency bandwidth to 2 M+1 octave for any integer M 1. It allows reducing the support of Gj, which is necessary to obtain a CSLC decomposition according to Proposition 3.1.

3.3. Multiscale Scalar Potentials

The probability distribution p(x) is approximated by pθ(x) = pθJ(x J) QJ j=1 p θj( xj|xj), where each xj and xj are computed with wavelet packet projectors Gj and Gj. We introduce a parameterization of p θj with scalar potential energies, following Marchand et al. (2022). We shall suppose that the dimension d J = dim(x J) is sufficiently small so that p(x J) may be approximated with any standard algorithm (d J = 1 in our numerical experiments).

The self-similarity property of multiscale fields with scalar

Conditionally Strongly Log-Concave Generative Models

Figure 2. Top: frequency localization of the decomposition (x J, x J, . . . , x1) with wavelet packet projectors of 1 (left) and 1/2 (right) octave bandwidths. Bottom: iterative decomposition of x = x0 with ( Gj, Gj) implementing a wavelet packet transformation over J = 2 layers of 1 octave bandwidth.

energies motivates the definition of each p θj( xj|xj) with an interaction energy

E θj(xj, xj) = 1

2 x T j Kj xj + x T j K jxj + X

i vj(xj 1[i])

= θT j Φj(xj, xj), (13)

which derives from the fact that p(xj 1) defines an energy of the form (11) (Marchand et al., 2022). Φj captures the interaction terms and performs a parametrized approximation of vj, defined in Appendix B.1.

The parameters θj are estimated from samples by inverting the empirical score matching Hessian as in Section 2.3. We generate samples from the resulting distribution pθ by sampling from pθJ and then iteratively from each p θj with MALA. The learning and sampling algorithms are summarized in Appendix B.2. Additionally, Appendix D explains that a parameterized model of the global energy (11), which is crucial for scientific applications, can be recovered with free-energy score matching.

4. Numerical Results

This section demonstrates that a wavelet packet decomposition of φ4 scalar fields and weak-lensing cosmological fields defines strongly log-concave conditional distributions. It allows efficient learning and sampling algorithms, and leads to higher-resolution generations than in previous works.

4.1. φ4 Scalar Potential Energy

We learn a wavelet packet model of φ4 scalar fields at different temperatures, using the decomposition and models presented in Section 3. The wavelet packet exploits the conditionally strongly log-concave property of φ4 scalar fields (Proposition 3.1) to obtain a small error in the generated samples, as shown in Section 2. We first verify qualitatively and quantitatively that this error is small.

We evaluate the wavelet packet model at three different temperatures, which have different statistical properties: β = 0.50, the disorganized state, β = 0.68 βc the critical point, and β = 0.76 the organized state. The computational efficiency of our approach enables generating high-resolution 128 128 images, as opposed to 32 32 in Marchand et al. (2022). Indeed, learning the model parameters for 64 64 images with score matching takes seconds on GPU, whereas doing the same with maximum likelihood takes hours on CPU (as sequential MCMC steps are not easily parallelized). The generated samples are shown in Figure 3 and are qualitatively indistinguishable from the training data. The experimental setting is detailed in Appendix C.

A distribution p(x) having a scalar potential energy (11) is a maximum-entropy distribution constrained by secondorder moments and hence by the power spectrum, and by the marginal distribution of all x[i]. These statistics specify the matrix K and the scalar potential v(t). Our model pθ also has a scalar potential energy in this case. To guarantee that pθ = p, it is thus sufficient to show that they have the same power spectrum and same marginal distributions. We perform a quantitative validation of generated samples by comparing their marginal densities and Fourier spectrum with the training data. Figure 3 shows that these statistics are well recovered by our model.

4.2. Conditional Log-Concavity

We numerically verify that φ4 at critical temperature is CSLC (Definition 2.1), with appropriate wavelet packet projectors. It amounts to verifying that the eigenvalues of the conditional Hessian 2 xj E θj(xj, xj) are positive for all xj and xj. We can restrict xj to typical samples from p(xj). However, it is important that the Hessian be positive even for xj outside of the support of p( xj|xj). Indeed, negative eigenvalues occur at local directional maxima of the energy, rather than minima which would correspond to most likely samples. We thus evaluate the Hessian at xj = 0, which is expected to be such an adversarial point.

Figure 4 shows distributions of eigenvalues of 2 xj E θj for decompositions ( Gj, Gj) of various frequency bandwidths. It shows that the smallest eigenvalues become larger and eventually cross zero as the frequency bandwidth of Gj

Conditionally Strongly Log-Concave Generative Models

Figure 3. Comparison between training and generated samples for φ4 energies. In columns: training samples, generated samples, histograms of marginal distributions p(x[i]) and power spectrum. In rows: disorganized state β = 0.50, critical point β = 0.68 βc, and organized state β = 0.76.

becomes narrower, as predicted by Proposition 3.1. Furthermore, the condition number of the Hessian becomes smaller as eigenvalues concentrate towards their mean.

As shown in Equation (12), both the quadratic part K and the scalar potential v contribute to the Hessian. As a way to visualize both contributions, we define the equivalent scalar potential v0 as v0(t) = v(t) + Tr(K)

2d t2. It corresponds to extracting the mean quadratic value Tr(K)/2d x 2 from the quadratic part and reinterpreting it as a scalar potential. This allows visualizing the average energy on a pixel value when neglecting spatial correlations. The right panel of Figure 4 compares these equivalent scalar potentials for the energy Ej of xj and the conditional energy Ej. It shows that the non-convex double-well potential in the global energy becomes convex after the conditioning. It verifies Proposition 3.1, as the mean quadratic value becomes larger when we restrict K to a subspace of high-frequency signals.

We also verify the sampling efficiency predicted by Proposition 2.2. As we cannot evaluate the KL divergences ϵS j , we rather compute the decorrelation mixing time τ, a measure of the number of steps of conditional MALA to reach a given fixed error threshold averaged over all scales j. The precise definition is given in Appendix C.3. We compare it with the decorrelation mixing time τ of MALA on the non-convex global energy E.

Sampling maps of size

d from the global φ4 energy E at the critical temperature requires a number of steps τ d1.0 (Zinn-Justin, 2021). This phenomena is known as critical slowing down (Podgornik, 1996; Sethna, 2021),

Figure 4. Conditional strong log-concavity of φ4 at critical temperature. All scales j yield similar results. Left: distribution of eigenvalues of 2 xj E θj for different frequency bandwidths (j = 1 is shown). Right: equivalent scalar potentials vj and vj (j = 3 is shown).

Figure 5. Mixing times for direct (τ) and conditional ( τ) sampling for φ4 at critical temperature.

a consequence of long-range correlations. We numerically show that our algorithm does not suffer from it. Figure 5 indeed demonstrates an empirical scaling τ d0.35. Note that this is not directly comparable with Proposition 2.2 as the decorrelation mixing time defines a different convergence rate than the KL mixing time.

4.3. Application to Cosmological Data

We now apply our algorithm to generate high-resolution weak lensing convergence maps (Bartelmann & Schneider, 2001; Kilbinger, 2015) with an explicit probability model. Weak lensing convergence maps measure the bending of light near large gravitational masses on two-dimensional slices of the universe. We used simulated convergence maps computed by the Columbia lensing group (Zorrilla Matilla et al., 2016; Gupta et al., 2018) as training data. They simulate the next generation outer-space telescope Euclid of the European Space Agency (Laureijs et al., 2011), which will be launched in 2023 to accurately determine the large scale geometry of the universe governed by dark matter. Estimating the probability distribution of such maps is therefore an outstanding problem (Marchand et al., 2022). We demonstrate that the CSLC property is surprisingly verified in this real-world example, and can be used to efficiently model

Conditionally Strongly Log-Concave Generative Models

Figure 6. Comparison between training and generated samples for weak-lensing maps. Upper left: histograms of marginal distributions p(x[i]). Lower left: power spectrum. Center: training samples. Right: generated samples.

and generate these complex fields.

We use the same models and algorithms as for the φ4 energy. The experimental setting is detailed in Appendix C. Figure 6 shows that our generated samples are visually highly similar to the training data. Quantitatively, they have nearly the same power spectrum. The marginal distribution of all x[i] are also nearly the same, with a long tail corresponding to high amplitude peaks, which are typically difficult to reproduce. As opposed to microcanonical simulations with moment-matching algorithms (Cheng & M enard, 2021), we compute an explicit probability distribution model, which is exponential. As a maximum-entropy model, it has a higher entropy than the true distribution, and therefore does not suffer from lack of diversity. By relying on the CSLC property, we can use the fast score-matching algorithm and compute 128 128 images, at four times the 32 32 resolution than with a maximum-likelihood algorithm used in Marchand et al. (2022).

Figure 7 shows the equivalent scalar potentials of the conditional energies at all scales, which are all convex and thus verify the CSLC property of weak lensing model. It demonstrates that this property can be used to efficiently model and generate high-resolution complex data.

5. Discussion

We introduced conditionally strongly log-concave (CSLC) models and proved that they lead to efficient learning with score matching and sampling with MALA, while controlling errors. These models rely on iterated orthogonal projections of the data that are adapted to its distribution. We showed mathematically and numerically that complex multiscale physical fields satisfy the CSLC property with wavelet

Figure 7. Equivalent scalar potentials vj at each scale j for weaklensing maps (normalized for viewing purposes).

packet projectors. The argument is general and relies on the presence of a quadratic (kinetic) energy term which ensures strong log-concavity at high-frequencies. It provides high-quality and efficient generation of high-resolution fields even when the underlying distribution is unknown. The CSLC property guarantees diverse generations without memorization issues, which is critical in scientific applications.

CSLC models can be extended by introducing latent variables. The guarantees of Section 2 extend to the case where the data is a marginal of a CSLC distribution. A notable example is a score-based diffusion model, for which the data x = x0 is a marginal of a higher-dimensional process (xt)t whose conditionals p(xt δ|xt) are approximately Gaussian white when δ is small, thus introducing a tradeoff between the number of terms in the CSLC decomposition and the condition number of its factors. Score diffusion is a generic transformation, but it assumes that the score xt log p(xt) can be estimated with deep networks at any t 0 (Song et al., 2021; Ho et al., 2020). For high-resolution images, the score estimation often uses conditional multiscale decompositions with or without wavelet transforms (Saharia et al., 2021; Ho et al., 2022; Dhariwal & Nichol, 2021; Guth et al., 2022). Understanding the log-concavity properties of natural image distributions under such transformations is a promising research avenue to understand the effectiveness of score-based diffusion models.

Acknowledgments

This work was partially supported by a grant from the PRAIRIE 3IA Institute of the French ANR-19-P3IA-0001 program. We thank Misaki Ozawa for providing the φ4

training dataset and his helpful advice on the numerical experiments. We thank the anonymous reviewers and area chair whose feedback have improved the paper significantly.

Conditionally Strongly Log-Concave Generative Models

Altschuler, J. M. and Chewi, S. Faster high-accuracy logconcave sampling via algorithmic warm starts. ar Xiv preprint ar Xiv:2302.10249, 2023.

Bakry, D., Gentil, I., Ledoux, M., et al. Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.

Balasubramanian, K., Chewi, S., Erdogdu, M. A., Salim, A., and Zhang, S. Towards a theory of non-log-concave sampling: first-order stationarity guarantees for langevin monte carlo. In Conference on Learning Theory, pp. 2896 2923. PMLR, 2022.

Bartelmann, M. and Schneider, P. Weak gravitational lensing. Physics Reports, 340:291 472, 2001. ISSN 0370-1573. doi: 10.1016/S0370-1573(00) 00082-X. URL https://doi.org/10.1016/ S0370-1573(00)00082-X.

Block, A., Mroueh, Y., Rakhlin, A., and Ross, J. Fast mixing of multi-scale langevin dynamics under the manifold hypothesis. ar Xiv preprint ar Xiv:2006.11166, 2020.

Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tram er, F., Balle, B., Ippolito, D., and Wallace, E. Extracting training data from diffusion models. ar Xiv preprint ar Xiv:2301.13188, 2023.

Chen, H., Lee, H., and Lu, J. Improved analysis of scorebased generative modeling: User-friendly bounds under minimal smoothness assumptions. ar Xiv preprint ar Xiv:2211.01916, 2022a.

Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. R. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. ar Xiv preprint ar Xiv:2209.11215, 2022b.

Cheng, S. and M enard, B. Weak lensing scattering transform: dark energy and neutrino mass sensitivity. Monthly Notices of the Royal Astronomical Society, 507(1):1012 1020, 07 2021. ISSN 0035-8711. doi: 10.1093/mnras/stab2102. URL https://doi.org/ 10.1093/mnras/stab2102.

Chewi, S. Log-Concave Sampling. draft, 2023.

Coifman, R. R., Meyer, Y., and Wickerhauser, V. Wavelet analysis and signal processing. In In Wavelets and their applications. Citeseer, 1992.

Daubechies, I. Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, 1992. doi: 10.1137/ 1.9781611970104. URL https://epubs.siam. org/doi/abs/10.1137/1.9781611970104.

Dhariwal, P. and Nichol, A. Diffusion models beat GAN on image synthesis. ar Xiv preprint ar Xiv:2105.05233, 2021.

Domingo-Enrich, C., Bietti, A., Vanden-Eijnden, E., and Bruna, J. On energy-based models with overparametrized shallow neural networks. In International Conference on Machine Learning, pp. 2771 2782. PMLR, 2021.

Gal, R., Hochberg, D. C., Bermano, A., and Cohen-Or, D. Swagan: A style-based wavelet-driven generative model. ACM Transactions on Graphics (TOG), 40(4):1 11, 2021.

Gupta, A., Matilla, J. M. Z., Hsu, D., and Haiman, Z. Non-gaussian information from weak lensing data via deep learning. Phys. Rev. D, 97: 103515, May 2018. doi: 10.1103/Phys Rev D.97. 103515. URL https://link.aps.org/doi/10. 1103/Phys Rev D.97.103515.

Guth, F., Coste, S., De Bortoli, V., and Mallat, S. Wavelet score-based generative modeling. In Advances in Neural Information Processing Systems, 2022.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., and Salimans, T. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1 33, 2022.

Hsu, D., Kakade, S. M., and Zhang, T. Random design analysis of ridge regression. In Conference on learning theory, pp. 9 1. JMLR Workshop and Conference Proceedings, 2012.

Hyv arinen, A. and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

Kadkhodaie, Z., Guth, F., Mallat, S., and Simoncelli, E. P. Learning multi-scale local conditional probability models of images. In International Conference on Learning Representations, volume 11, 2023.

Kaupuˇzs, J., Melnik, R. V. N., and Rimˇs ans, J. Corrections to finite-size scaling in the φ4 model on square lattices. International Journal of Modern Physics C, 27(09):1650108, 2016. doi: 10.1142/ S0129183116501084. URL https://doi.org/10. 1142/S0129183116501084.

Kilbinger, M. Cosmology with cosmic shear observations: a review. Reports on Progress in Physics, 78 (8):086901, jul 2015. doi: 10.1088/0034-4885/78/8/ 086901. URL https://dx.doi.org/10.1088/ 0034-4885/78/8/086901.

Conditionally Strongly Log-Concave Generative Models

Koehler, F., Heckett, A., and Risteski, A. Statistical efficiency of score matching: The view from isoperimetry. ar Xiv preprint ar Xiv:2210.00726, 2022.

Laureijs, R., Amiaux, J., Arduini, S., Augueres, J.-L., Brinchmann, J., Cole, R., Cropper, M., Dabin, C., Duvet, L., Ealet, A., et al. Euclid definition study report. ar Xiv preprint ar Xiv:1110.3193, 2011.

Lee, G. R., Gommers, R., Waselewski, F., Wohlfahrt, K., and O Leary, A. Pywavelets: A python package for wavelet analysis. Journal of Open Source Software, 4(36): 1237, 2019. doi: 10.21105/joss.01237. URL https: //doi.org/10.21105/joss.01237.

Mallat, S. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell., 11:674 693, 1989.

Mallat, S. A wavelet tour of signal processing. Academic Press, third edition edition, 2009.

Marchand, T., Ozawa, M., Biroli, G., and Mallat, S. Wavelet conditional renormalization group. ar Xiv preprint ar Xiv:2207.04941, 2022.

Mourtada, J. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. The Annals of Statistics, 50(4):2157 2178, 2022.

Podgornik, R. Principles of condensed matter physics. p. m. chaikin and t. c. lubensky, cambridge university press, cambridge, england, 1995. Journal of Statistical Physics, 83:1263 1265, 06 1996. doi: 10.1007/BF02179565.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. ar Xiv e-prints, art. ar Xiv:2204.06125, April 2022.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022.

Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image super-resolution via iterative refinement. ar Xiv preprint ar Xiv:2104.07636, 2021.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., et al. Photorealistic text-to-image diffusion models with deep language understanding. ar Xiv preprint ar Xiv:2205.11487, 2022.

Sethna, J. P. Statistical Mechanics: Entropy, Order Parameters, and Complexity, volume 14. Oxford University Press, USA, 2021.

Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and Goldstein, T. Diffusion art or digital forgery? investigating data replication in diffusion models. ar Xiv preprint ar Xiv:2212.03860, 2022.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyv arinen, A., and Kumar, R. Density estimation in infinite dimensional exponential families. ar Xiv preprint ar Xiv:1312.3516, 2013.

Sutherland, D. J., Strathmann, H., Arbel, M., and Gretton, A. Efficient and principled score estimation with nystr om kernel exponential families. In International Conference on Artificial Intelligence and Statistics, pp. 652 660. PMLR, 2018.

Vershynin, R. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, 25(3):655 686, 2012.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Wilson, K. G. Renormalization group and critical phenomena. ii. phase-space cell analysis of critical behavior. Physical Review B, 4(9):3184, 1971.

Yu, J. J., Derpanis, K. G., and Brubaker, M. A. Wavelet flow: Fast training of high resolution normalizing flows. Advances in Neural Information Processing Systems, 33: 6184 6196, 2020.

Zinn-Justin, J. Quantum Field Theory and Critical Phenomena: Fifth Edition. Oxford University Press, 04 2021. ISBN 9780198834625. doi: 10.1093/oso/ 9780198834625.001.0001. URL https://doi.org/ 10.1093/oso/9780198834625.001.0001.

Zorrilla Matilla, J. M., Haiman, Z., Hsu, D., Gupta, A., and Petri, A. Do dark matter halos explain lensing peaks? Phys. Rev. D, 94:083506, Oct 2016. doi: 10.1103/ Phys Rev D.94.083506. URL https://link.aps. org/doi/10.1103/Phys Rev D.94.083506.

Conditionally Strongly Log-Concave Generative Models Appendices

A Definition of Wavelet Packet Projectors 12

B Score Matching and MALA Algorithms for CSLC Exponential Families 15

C Experimental Details 17

D Energy Estimation with Free-Energy Modeling 18

E Proofs of Section 2 20

F Proof of Proposition 3.1 27

A. Definition of Wavelet Packet Projectors

The fast wavelet transform (Mallat, 1989) splits a signal in frequency into two orthogonal coarser signals, using two orthogonal conjugate mirror filters g and g.

We review the construction of such filters in Appendix A.1. A description of the fast wavelet transform is then given in Appendix A.2. Finally, we define in Appendix A.3 the wavelet packet (Coifman et al., 1992) projectors (Gj, Gj) used in the numerical section 3.

A.1. Conjugate Mirror Filters

Conjugate mirror filters g and g satisfy the following orthogonal and reconstruction conditions:

g T g = g Tg = 0, g Tg + g T g = Id. (14)

In one dimension, the conditions (14) are satisfied (Mallat, 1989) by discrete filters (g(n))n Z, ( g(n))n Z whose Fourier transforms ˆg(ω) = P

n g(n)e inω and ˆ g(ω) = P

n g(n)e inω satisfy

|ˆg(ω)|2 + |ˆg(ω + π)|2 = 2, ˆg(0) =

2, ˆ g(ω) = e iωˆg(ω + π). (15)

We first design a low-frequency filter g such that ˆg(ω) satisfies (15), and then compute g with

g(n) = ( 1)1 ng(1 n). (16)

The choice of a particular low pass filter g is a trade-off between a good localization in space and a good localization in the Fourier frequency domain. Choosing a perfect low-pass filter g(ω) = 1ω [ π/2,π/2] leads to Shannon wavelets, which are well localized in the frequency domain but have a slow decay in space. On the opposite, a Haar wavelet filter g(n) =

2 1n {0,1} has a small support in space but is poorly localized in frequency. Daubechies filters (Daubechies, 1992) provide a good joint localization both in the spatial and Fourier domains. The Daubechies-4 wavelet is shown in Figure 8.

In two dimensions (for images), wavelet filters which satisfy the orthogonality conditions in (14) can be defined as separable products of the one-dimensional filters g and g (Mallat, 2009), applied on each coordinate. It defines one low-pass filter g2 and 3 high-pass filters g2 = ( gk 2)1 k 3: g2(n1, n2) = g(n1)g(n2), g1 2(n1, n2) = g(n1) g(n2), g2 2(n1, n2) = g(n1)g(n2), g3 2(n1, n2) = g(n1) g(n2).

Conditionally Strongly Log-Concave Generative Models

Figure 8. Fourier transform of Daubechies-4 orthogonal filters ˆg(ω) (in green) and ˆ g(ω) (in orange).

For simplicity we shall write g and g the filters g2 and g2. g outputs the concatenation of the 3 filters gk 2.

A.2. Orthogonal Frequency Decomposition

We introduce the orthogonal decomposition of a signal xj 1 with the low pass filter g and the high pass filter g, followed by a sub-sampling. It outputs (xj, xj), which has the same dimension as xj 1, defined in one dimension by

n R2 g[n 2p]xj 1[n],

n R2 g[n 2p]xj 1[n]. (18)

The inverse transformation is

n R2 g[p 2n]xj+1[n] + P

n R2 g[p 2n] xj+1[n]. (19)

The orthogonal frequency decomposition in two dimensions is defined similarly. It decomposes a signal x of size

d into a low frequency signal and 3 high frequency signals, each of size

A.3. Wavelet Packet Projectors

An orthogonal frequency decomposition projects a signal into high and low frequency domains. In order to refine the decomposition (by separating different frequency bands), wavelet packets projectors are obtained by cascading this orthogonal frequency decomposition.

The usual fast wavelet transform starts from a signal x0 of dimension d, decomposes it into a low-frequency x1 and a high frequency x1, and then iterates this decomposition on the low-frequency x1 only. It iteratively decomposes xj 1 into the lower frequencies xj and the high-frequencies xj. The resulting orthogonal wavelet coefficients are ( xj, x J)1 j J. The resulting decomposition remains of dimension d.

To obtain a finer frequency decomposition, we use the M-band wavelet transform (Mallat, 2009), a particular case of wavelet packets (Coifman et al., 1992). It first applies the fast wavelet transform to the signal, and obtains ( xj, x J)1 j J. Each high-frequency output xj undergoes an orthogonal decomposition using g and g. Then both outputs of the decomposition are again decomposed, and so on, (M 1)-times. The coefficients are then sorted according to their frequency support, and also labeled as ( xj, x J)1 j J , with J = J2M 1, also referred to as J in the main text.

The wavelet packet decomposition corresponds to first decomposing the frequency domain dyadically into octaves, and then each dyadic frequency band is further decomposed into 2M 1 frequency annuli. We say this decomposition corresponds to a 1/2M 1 octave bandwidth. Precisely, if j = j 2M 1 + r, then xj has a frequency support over an annulus in the frequency

Conditionally Strongly Log-Concave Generative Models

Figure 9. In one dimension, a wavelet packet transform is obtain by cascading filterings and subsamplings with the filters g and g along a binary splitting tree which outputs x J and xj for j J.

Figure 10. Low-frequency maps xj for M = 2 for a φ4 realization.

domain, with frequencies with modulus of order 2 j π(1 2 M+1(r 1/2)). A two-dimensional visualization of the frequency domain can be found in Figure 2, for M = 1 and M = 2, corresponding to 1 and 1/2 octave bandwidths.

Figure 9 shows the iterative use of g and g used to obtain the decomposition, in one dimension, for M = 2. Note that the filters g and g successively play the role of lowand high-pass filters because of the subsampling (Mallat, 2009).

We now introduce the corresponding orthogonal projectors Gj and Gj, defined such that

xj = Gjxj 1, xj = Gjxj 1, (20)

where the ( xj)j, sorted in frequency, have been obtained trough the M-band wavelet transform, as described above, and xj refers to the signal reconstructed using (x J, xj )j j+1. Let us emphasize that the image xj 1 is reconstructed from xj and the higher frequencies xj, and defined on a spatial grid which is either the same as xj or twice larger. For M = 2, Figure 10 shows that x0 and x1 are defined on the same grid, although x1 has a lower-frequency support. Similarly x2 and x3 are both represented on the same grid, which is twice smaller, and so on.

The orthogonal projectors satisfy GT j Gj + GT j Gj = Id. We then have the following inverse formula:

xj 1 = GT j xj + GT j xj. (21)

This decomposition using Gj and Gj recursively splits the signal in frequencies, from high to low frequencies.

Conditionally Strongly Log-Concave Generative Models

Figure 11. Sub-bands of xj for a wavelet packet decomposition with a half-octave bandwidth.

B. Score Matching and MALA Algorithms for CSLC Exponential Families

B.1. Multiscale Energies

This section introduces the explicit parametrization of the energies E θj and EθJ.

The conditional energies E θj(xj, xj) are defined with a bilinear term which represents the interaction between xj and xj and a scalar potential:

E θj(xj, xj) = 1

2 x T j Kj xj + X

l>j x T j K l,j xj+l + X

i vj(xj 1[i]), (22)

with xj 1 = GT j xj + GT j xj. Equation (22) is an equivalent reparametrization of Equation (13). Considering ( xl)l>j instead of xj allows fixing some coefficients of the K l,j to zero instead of learning them. First, we set K l,j = 0 if xj and xj+l are not defined on the same spatial grid. In the sequel, sums over l only refer to theses terms, which differ depending on the wavelet decomposition. We enforce spatial stationarity by averaging the bilinear interaction terms across space. We further kept only the non-negligible terms which correspond to neighboring frequencies and neighboring spatial locations. As displayed in Figure 11, xj is composed of sub-bands xk j . We kept the interaction terms xk j [i] xk+δk j+l [i + δi] for l {0, 1}, δk {0, 1}, and δi {0, 1, 2, 3, 4}2, which correspond to local interactions in both space and frequency.

The scalar potential vj(t) is decomposed on a family of predefined functions ρk,j(t):

k αk,j ρk,j(t). (23)

ρj,k is defined in order to expand the scalar potential vj which captures the marginal distributions of the xj 1[i], which do not depend on i due to stationarity. We divide this marginal into N quantiles. Each ρk,j is chosen to be a regular bump function having a finite support on the k-th quantile. This parametrization performs a pre-conditioning of the score matching Hessian.

Let ρ be a bump function with a support in [ 1/2, 1/2]. For each j, let aj,k and lj,k be respectively the center and width of the k-th quantile of the marginal distribution of xj, we define

ρk,j(t) = lk

Conditionally Strongly Log-Concave Generative Models

with the condition ρ 2 2 = 1 Gj 2 2 , (25)

in order to balance the magnitude of the scalar potentials with the quadratic potentials.

The potential vector is thus

Φj(xj, xj) =

i xk j [i] xk+δk j+l [i + δi], X

i ρk ,j(xj 1[i])

0 l 1,0 δk 1,0 δi 4,1 k N . (26)

Similarly, we define EθJ as the sum of a quadratic energy and a scalar potential:

EθJ(x J) = 1

2x T J KJx J + X

i v J(x J[i]). (27)

The bilinear interaction terms are averaged across space to enforce stationarity. The scalar potential v J(t) is also decomposed over a family of predefined functions ρk,J(t):

k αk,J ρk,J(t), (28)

defined similarly as above. This yields a potential vector

i x J[i]x J[i + δi], ρk,J(x J)

0 δi 4,1 k N , (29)

EθJ(x J) = θT J ΦJ(x J), (30)

with θJ = (KJ, αk,J)k.

B.2. Pseudocode

The procedure to learn the parameters ( θj)j of the conditional energies E θj(xj, xj) by score matching is detailed in Algorithm 1. The procedure to generate samples from the distribution pθ(x) with MALA is detailed in Algorithm 2.

Algorithm 1 Score matching for exponential families with CSLC distributions

Require: Training samples (xi)1 i n.

Initialize xi 0 = xi for 1 i n. for j = 1 to J do

Decompose xi j Gjxi j 1 and xi j Gjxi j 1 for 1 i n. Compute the score matching quadratic term Hj 1

n Pn i=1 xj Φj(xi j, xi j) xj Φj(xi j, xi j)T Rm m.

Compute the score matching linear term gj 1

n Pn i=1 xj Φj(xi j, xi j) Rm.

Set θj H 1 j gj. end for return Model parameters ( θj)j.

Conditionally Strongly Log-Concave Generative Models

Algorithm 2 MALA sampling from CSLC distributions Require: Model parameters ( θj)j, an initial sample x J from p(x J), step sizes (δj)j, number of steps (Tj)j.

for j = J to 1 do

Initialize xj,0 = 0. for t = 1 to Tj do

Sample yj,t N xj,t 1 δj xj E θj(xj, xj,t 1), 2δj Id .

Set a = xj E θj(xj, yj,t)) 2 + xj E θj(xj, xj,t 1)) 2 .

Set b = D yj,t xj,t 1, xj E θj(xj, yj,t) xj E θj(xj, xj,t 1) E .

Set c = E θj(xj, yj,t) E θj(xj, xj,t 1).

Compute acceptance probability p = exp δj

2b c . Set xj,t = yj,t with probability p and xj,t = xj,t 1 with probability 1 p. end for Reconstruct xj 1 = GT j xj + GT j xj,Tj. end for return a sample x0 from ˆpθ(x).

C. Experimental Details

C.1. Datasets

Simulations of φ4. We used samples from the φ4 model generated using a classical MCMC algorithm, for 3 different temperatures, at the critical temperature βc 0.68, above the critical temperature at β = 0.50 < βc, and below the critical temperature at β = 0.76 > βc. For β = 0.76, we break the symmetry and only generate samples with positive mean. For each temperature, we generate 104 images of size 128 128.

Weak lensing. We used down-sampled versions of the simulated convergence maps from the Columbia Lensing Group (http://columbialensing.org/; Zorrilla Matilla et al., 2016; Gupta et al., 2018). Each map, originally of size 1024 1024, is downsampled twice with local averaging. We then extract random patches of size 128 128.

To pre-process the data, we subtract the minimum of the pixel values over the entire dataset, and then take the square root. This process is reversed after generating samples. We also do not consider the outliers (less than 1% of the dataset) with pixels above a certain cutoff, in order to reduce the extent of the tail and attenuate weak lensing peaks. Our dataset is made of 4 103 images.

C.2. Experimental Setup

Wavelet filter. We used the Daubechies-4 wavelet (Daubechies, 1992), see the filter in Figure 8.

Wavelet packets. We implemented wavelet packets in Py Torch, inspired from the Py Wavelets software (Lee et al., 2019). The source code is available at https://github.com/Elempereur/WCRG.

Score matching. We pre-condition the score matching Hessian Hj by normalizing its diagonal before computing H 1 j gj in Algorithm 1. After this normalization, we obtain condition numbers κ θj which satisfy κ θj 2 103 at all j.

Sampling. The MALA step sizes δj are adjusted to obtain an optimal acceptance rate of 0.57. Depending on the scale j, the stationary distribution is reached in Tj 20 400 iterations from a white noise initialization. We used a qualitative stopping criterion according to the quality of the matching of the histograms and power spectrum.

C.3. Mixing Times in MALA

Sampling from pθ requires sampling from pθJ, and then conditionally sampling from p θj( xj|xj). This last step is performed with a Markov chain whose stationary distribution is p θj( xj|xj) for a given xj. It generates successive samples xj(t) where

Conditionally Strongly Log-Concave Generative Models

t is the the step number in the Markov chain.

We introduce the conditional auto-correlation function:

Aj(t) = E xj(t) E[ xj | xj] xj(0) E[ xj | xj]

E[δ x2 j] .

The expected value E is taken with respect to both xj and the sampled xj. Aj(t) has an exponential decay. Let τj be the mixing time defined as the time it takes for the Markov chain to generate two independent samples:

Aj(t) Aj(0) exp t

τj is computed by regressing log(Aj(t)) over t.

Each iteration of MALA with p θj( xj | xj) computes a gradient of size dj. In order to estimate the real computational cost of the sampling of pθ, we average τj proportionally to the dimension dj:

dj d τj + τJ d J

where d is the dimension of x.

D. Energy Estimation with Free-Energy Modeling

This section explains how to recover an explicit parametrization of the negative log-likelihood log pθ from the parameterized energies E θj. We introduce a parameterization of the normalization constant of the Gibbs energies for each j and describe an efficient score-matching algorithm to learn the parameters. This leads to a decomposition of the negative log-likelihood log pθ over scales.

D.1. Free-Energy Score Matching

From the decomposition

pθ(x) = pθJ(x J)

j=1 p θj( xj|xj),

log pθ(x) = EθJ(x J) +

E θj(xj, xj) + log Z θj(xj) + cst, (31)

where Z θj(xj) is the normalization constant for E θj(xj, xj). To retrieve the global negative log-likelihood log pθ(x), we thus compute an approximation of log Z θj(xj) with a parametric family F θj.

The parameters θj of the approximation of the normalizing factors Z θj can be learned in a manner similar to denoising score matching. Indeed, using the identity

xj log Z θj(xj) = E h xj E θj(xj, xj) | xj i ,

which can be proven by a direct computation of the gradient, the parameters θj can be estimated by minimizing

ℓj( θj) = E xj F θj xj E θj

For an exponential model F θj = θT j Φj with a fixed potential vector Φj, Equation (32) is quadratic in θ and admits a closed-form solution: θj = E h xj Φj xj ΦT j i 1 E h xj Φj xj E θj

Conditionally Strongly Log-Concave Generative Models

We finally obtain the energy decomposition

log pθ(x) = EθJ(x J) +

E θj(xj, xj) F θj(xj) + cst. (33)

This score-based method is much faster and simpler to implement than likelihood-based methods such as the thermodynamic integration of Marchand et al. (2022), which requires generation of many samples while varying the parameters θj of the conditional energy E θj.

D.2. Parameterized Free-Energy Models

The potential vector Φj is modeled in the class of Equation (11), following Marchand et al. (2022) and similarly to Appendix B.1:

F θj(xj) = 1

2x T j Kjxj + Vj(xj) + X

i vj(xj[i])

k αj,k ρj,k(t),

which gives θj = ( Kj, αj,k)k and an associated potential vector

2xjx T j , ρj,k(xj)

D.3. Multiscale Energy Decomposition

We now expand the models for the conditional energies E θj and the so-called free energies F θj in Equation (33). All the

quadratic terms (KJ, Kj, Kj)j can be regrouped in an equivalent quadratic term K. We then have

log pθ(x) = 1

v J(x J[i]) +

vj(xj 1[i]) vj(xj[i])

v1(x0[i]) +

vj+1(xj[i]) vj(xj[i])

with v J+1 = v J. This defines multiscale scalar potentials Vj:

Vj = vj+1 vj, V0 = v1,

such that we have the global negative log-likelihood or energy function:

log pθ(x) = 1

i Vj(xj[i]).

For φ4 at critical temperature, as derived in (Marchand et al., 2022), the only non-zero scalar potential will be V0. The other Vj potentials are zero, up to a quadratic term.

As a numerical test, Figure 12 verifies that on φ4 at critical temperature, vj+1 and vj indeed cancel out so that Vj = 0 for j > 0. In order to ensure that the quadratic difference mentioned above vanishes, we subtract to vj the quadratic interpolation of vj vj+1.

Conditionally Strongly Log-Concave Generative Models

Figure 12. For φ4 at βc, the conditional potentials vj+1 and free-energy potential vj cancel out. Only j = 1 is shown, other scales show similar behavior.

E. Proofs of Section 2

E.1. Proof of Proposition 2.1

Proposition 2.1 (Error decomposition).

TV(ˆp, p) 1

v u u tϵL J +

v u u tϵS J +

Proof. We use the following decomposition of KL divergence in terms of conditional distributions:

Lemma E.1. Let p(x) = p(x J) QJ j=1 p( xj|xj) and q(x) = q(x J) Q

j q( xj|xj). We have KL(p q) = P

j Exj p KL(p( |xj) q( |xj)) .

Using Lemma E.1 we obtain that KL(p pθ) = ϵL J + P j ϵL j and KL(ˆp pθ) = ϵS J + P j ϵLSj. We conclude with the Pinsker inequality

TV(ˆp, p) TV(ˆp, pθ) + TV(pθ, p) 1

KL(p pθ) + p

KL(ˆp pθ) .

Proof of Lemma E.1. We proceed by induction over J. Observe that log p(x) = log p( x1|x1) + log p(x1), so

KL(p q) = Ep[log(p) log(q)]

= Ep[log p(x1) log q(x1)]+

Ex1 p(x1)E x1 p( x1|x1)[log p( x1|x1) log q( x1|x1)]

= KL(p(x1) q(x1))

+ Ex1 p(x1)KL(p( |x1) q( |x1)) .

The first term KL(p(x1) q(x1)) now involves J 1 factors, and hence we can apply the induction step to conclude.

Conditionally Strongly Log-Concave Generative Models

E.2. Proof of Theorem 2.1

We will use a general concentration result of the empirical covariance for general distributions with mild moment assumptions (Vershynin, 2018), as well as anticoncentration properties of the random design (Mourtada, 2022). Together, they provide enough control on the probability tails so that the inverse covariance concentrates to the precision matrix in expectation.

Assumption E.1. Let X Rm d be a random matrix. Assume that there exists K 1 such that X F KE[ X 2 F ]1/2

almost surely.

Theorem E.1 (General Covariance Estimation with High Probability, (Vershynin, 2018, Theorem 5.6.1, Ex 5.6.4)). Let X Rm d be a random matrix satisfying assumption E.1. Let Σ = E[XX ], and for any n let ˆΣn = 1

i Xi X i be the sample covariance matrix, where Xi are n iid copies of X. There exists an absolute constant C such that for any δ > 0, it holds

K2m(log(m) + log(2/δ))

n + K2m(log(m) + log(2/δ))

with probability at least 1 δ.

Assumption E.2 (Moment Condition). Assume that there exists KX and KY such that X := ( x Φk( x, x))k m Rm d, and Y = ( x Φk( x, x))k m Rm satisfy Assumption E.1 with constants KX and KY respectively, where ( x, x) p( x, x).

Assumption E.3 (Anticoncentration Condition, (Mourtada, 2022, Assumption 1)). The random matrix X = ( x Φk( x, x))k m Rm d satisfies the following: there exists constants C 1 and ν (0, 1] such that for every θ Rm \ {0} and t > 0, P(θ XX θ t2θ E[XX ]θ) (Ct)ν.

Theorem E.2 ((Mourtada, 2022, Corollary 3)). Let X Rm d be a random matrix satisfying Assumption E.3 and such that E[ X 2 F ] < , with Σ = E[XX ]. Then, if m/n ν/6, for every t (0, 1), the empirical covariance matrix ˆΣn obtained from an iid sample of size n satisfies ˆΣn tΣ

with probability with probability greater than 1 ( Ct)νn/6, where C only depends on C and ν in Assumption E.3.

Theorem E.3 (Excess risk for CSLC exponential models, Theorem 2.1 restated). Let θ = arg min ℓ( θ) and ˆ θ = arg min ˆℓ( θ). Assume:

(i) θ Θ α for some α > 0,

(ii) H = E h x Φ x ΦTi ηId with η > 0,

(iii) the sufficient statistics Φ satisfy moment conditions E.2, regularity conditions E.3, and Φk(x, x) is M Φ-Lipschitz for any k m and all x.

Then when n > m, the empirical risk minimizer ˆ θ satisfies:

ˆ θ Θˆ α with E( xi,xi) ˆ α α O η 1 rm

E( xi,xi) h ℓ(ˆ θ) i h ℓ( θ ) + O κ(H)η 1 m

and, for t mℓ( θ ),

α (1 + t) (37)

with probability greater than 1 exp{ O(n log(tn/ m))} over the draw of the training data. The constants in O( ) only depend on moment and regularity properties of Φ.

Proof. We can rewrite the score-matching population risk in terms of a joint distribution (X, Y ) Rm d Rm:

min θ ℓ(θ) = E(X,Y )

2θ XX θ θ Y = 1

2θ Hθ θ g ,

Conditionally Strongly Log-Concave Generative Models

where H = E[XX ] and g = E[Y ]. The empirical objective is the quadratic form

min θ 1 2θ ˆHθ θ ˆg , (38)

with ˆH = 1

n Pn i=1 Xi X i and ˆg = 1

We want to control the expected excess risk Eℓ(ˆθ) ℓ(θ ) and the norm ˆθ θ , where

ˆθ = ˆH 1ˆg , θ = H 1g .

Since ℓ(θ) is quadratic and θ is its global minimum, observe that

ℓ(θ) ℓ(θ ) = θℓ(θ ) (θ θ ) + 1

2(θ θ ) 2 θℓ(θ )(θ θ )

2(θ θ ) H(θ θ ) , (39)

which shows that the excess risk can be bounded from the mean-squared error E ˆθ θ 2 with

Eℓ(ˆθ) ℓ(θ ) H

2 E ˆθ θ 2 . (40)

Let υ := ˆg g and Υ = ˆH 1 H 1. By definition, we have

ˆθ θ = ˆH 1(g + υ) H 1g = Υg + ˆH 1υ , (41)

so E ˆθ θ 2 2(E Υ 2) g 2 + 2E ˆH 1υ 2 . (42)

Let us begin with the first term in the RHS of (42), involving Υ. We claim that there exists C0, only depending on the assumption parameters in E.2 and E.3, such that

E Υ 2 C0 H 2

The main technical ingredient is to exploit upper and lower tail bounds of ˆH = ˆHn to establish a control on expectation, via the following Lemma.

Lemma E.2 (From tail bounds to Expectation). Suppose the empirical covariance ˆΣn satisfies the following lower and upper tail bounds:

ˆΣn (1 + s)Σ with probability greater than 1 ηn(s) , s 0, ˆΣn (1 t)Σ with probability greater than 1 δn(t) , t (0, 1) . (44)

E ˆΣ 1 n Σ 1 Σ 1 Z

E ˆΣ 1 n Σ 1 2 Σ 1 2 Z

Thanks to assumptions E.3 and E.1, the tail bounds of Theorems E.1 and E.2 apply, yielding

δn(t) = min(( C(1 t))νn/6, 2m exp( n2t2/Cm)) , ηn(s) = 2m exp( n2s2/Cm) . (47)

We now apply Lemma E.2 with these values. Let us first address the term ηn. We have

ηn(β/(1 β)) = 2m exp( n2β2(1 β) 2/(Cm)) ,

Conditionally Strongly Log-Concave Generative Models

and hence Z 1

0 βηn(β/(1 β))dβ = 2m Z 1

0 β exp( n2β2(1 β) 2/(Cm))dβ

0 β exp( n2β2/(Cm))dβ

n EZ N(0,Cm/(2n2))[|Z|] (48)

Let us now study the term in δn. For any β we have

0 βδn(β(1 + β) 1)dβ 2m Z β

0 β exp( n2β2(1 + β) 2/(Cm))dβ + Z

β β( C(1 + β) 1)νn/6dβ ,

0 β exp( n2β2(1 + β ) 2/(Cm))dβ + Cνn/6

νn/6 2(1 + β ) νn/6+2

2 πC(1 + β )2 m3

νn/6 2(1 + β ) νn/6+2 .

Picking β = C above gives Z

0 βδn(β(1 + β) 1)dβ C n + O

where C only depends on ν, C, C. From (48) and (50) we conclude that

E ˆH 1 n H 1 2 = O

proving (43).

Let us now bound the second term in the RHS of (42). We have

ˆH 1υ 2 ˆH 2 υ 2 ,

so by Cauchy-Schwartz we obtain

E[ ˆH 1υ 2] E[ ˆH 4 ] 1/2 E[ υ 4] 1/2 . (52)

By assumption, we have E[ υ 4] 1/2 KY E[ υ 2] = KY

n E[ Y 2] . (53)

Finally, we use the following lemma, showing that E[ ˆH 4 ] is bounded.

Lemma E.3 (Finite Second and Fourth Moments of ˆH 1). Assume n > 24/ν. Then

E[ ˆH 2 ] C2 H 1 2 and E[ ˆH 4 ] C4 H 1 4 . (54)

From (52), (53) and (54) we obtain

E[ ˆH 1υ 2] ξ q

C4 H 1 2E[ Y 2]

Conditionally Strongly Log-Concave Generative Models

which, together with (43) yields

H 1 2( E[Y ] 2 + KY E[ Y 2])

and therefore

Eℓ(ˆθ) ℓ(θ ) O

κ H 1 ( E[Y ] 2 + KY E[ Y 2])

proving (36) as claimed.

Let us now control ˆ α such that ˆθ Θˆ α. From log pθ( x|x) = θ Φ(x, x) we directly obtain

2 log pˆθ( x|x) = 2 log pθ ( x|x) +

k=1 (ˆθk θ k) 2 Φk( x|x) ,

and thus, for any ( x, x),

2 log pˆθ( x|x) 2 log pθ ( x|x) X

k |ˆθk θ k| 2 Φk( x|x)

ˆθ θ 2 Φ( x|x) ,

ˆθ θ m M Φ , (58)

where 2 Φ( x|x) 2 := Pm k=1 2 Φk( x|x) 2, and M Φ = maxk supx, x 2 Φk( x|x) < by assumption (ii). It follows from (58) that inf ( x,x) λmin( 2 log pˆθ( x, x)) α ˆθ θ m M Φ . (59)

We will now use tail probability bounds for the norm ˆθ θ , captured in the following lemma:

Lemma E.4 (Tail bounds for ˆθ θ ). We have

P( ˆθ θ > t) fn(t/ H 1 ), (60)

n2 (s/(2 g ))2

(1 + (s/(2 g )))2Cm

, ( C(2CY )s 1)νn/6 #

+ 2m exp( n2(s/(2 g ))2/Cm) + C0

νn/6 , (62)

where C, C, CY , g , ν are constants from Assumptions E.3, E.2. Moreover, for s 1, we have

fn(s) = exp( O(n(log n + log s))) (64)

From (39) and (59) we obtain

Ex KL(p pˆθ) 1

ℓ(θ ) + ˆθ θ 2 H

ℓ(θ ) + ˆθ θ 2 H

α ˆθ θ m M Φ ,

Conditionally Strongly Log-Concave Generative Models

and therefore

1 + bt + ℓ(θ ) 1 H t2

1 fn(t/ H 1 ) ,

where b = m M Φ. As a result, for t m MΦℓ H 1 we have

= 1 exp( O(n(log t + log n log m))) , (66)

proving (37).

Finally, let us prove (35). From (41) we have

ˆθ θ Υ g + ˆH 1 υ . (67)

The same argument leading to (51) can be now applied to the first moment E Υ , yielding Z 1

0 ηn(β/(1 β))dβ = 2m Z 1

0 exp( n2β2(1 β) 2/(Cm))dβ

0 exp( n2β2/(Cm))dβ

n , and (68)

0 δn(β(1 + β) 1)dβ 2m Z β

0 exp( n2β2(1 + β) 2/(Cm))dβ + Z

β ( C(1 + β) 1)νn/6dβ ,

0 exp( n2β2(1 + β ) 2/(Cm))dβ + Cνn/6

νn/6 1(1 + β ) νn/6+1

C(1 + β )m3/2

νn/6 1(1 + β ) νn/6+1 .

Picking again β = C above gives Z

0 δn(β(1 + β) 1)dβ Cm3/2

and therefore

From (67), using (70) and again Cauchy-Schwartz, we obtain

E ˆθ θ E[ Υ ] g +

E[ ˆH 2 ]E[ Y 2]

proving (35).

Conditionally Strongly Log-Concave Generative Models

Proof of Lemma E.2. Using a crude union bound, we have

(1 t)Σ ˆΣn (1 + s)Σ (72)

with probability greater than 1 δn(t) ηn(s). Under the event (72), we equivalently have

(1 + s) 1Σ 1 ˆΣ 1 n (1 t) 1Σ 1 ,

and hence ˆΣ 1 n Σ 1 Σ 1 max(|1 (1 + s) 1|, |1 (1 t) 1|) .

Denoting Z = ˆΣ 1 n Σ 1 , we thus have

P(Z Σ 1 β) P (1 tβ)Σ ˆΣn (1 + sβ)Σ (73)

1 δn(tβ) ηn(sβ) , (74)

where sβ, tβ are defined such that

|1 (1 + sβ) 1| = β , |1 (1 tβ) 1| = β .

We thus obtain sβ = β 1 β for β (0, 1), and tβ = β 1+β for β (0, ). For a non-negative random variable Z with c.d.f. F(β) = P(Z β) we have

0 β2F (β)dβ = Z

0 β(1 F(β))dβ ,

and therefore

0 β(1 F(β))dβ

0 β(1 F( Σ 1β))dβ

0 βδn(β/(1 + β))dβ + Z 1

0 βηn(β/(1 β))dβ .

Proof of Lemma E.3. By directly applying Theorem E.2, we have

P( ˆH 1 n t 1 H 1 ) 1 ( Ct)νn/6 . (75)

If F(β) = P( ˆH 1 n β), it follows that

E[ ˆH 4 ] = Z

0 β4F (β)dβ = 4 Z

0 β3(1 F(β))dβ

0 β3 min(1, ( C H 1 β 1)νn/6)dβ

= 4 H 1 4 C4 Z

0 min(1, β3 νn/6)dβ

= C4 H 1 4 ,

where we used νn/6 > 4 in the last step. The second moment is treated analogously.

Proof of Lemma E.4. As we argued previously, from (41) we have that

ˆθ θ Υ g + ˆH 1 υ .

Conditionally Strongly Log-Concave Generative Models

We will use tail bounds for Υ , ˆH 1 and υ and combine them with a crude union bound to yield the desired tail control. Recall from eq (73) that P( Υ H 1 t) 1 γn(t) , (76)

δn t 1+t + ηn t 1 t , if t 1 ,

δn t 1+t otherwise, (77)

with δn(s) = min(( C(1 s))νn/6, 2m exp( n2s2/Cm)) , ηn(s) = 2m exp( n2s2/Cm) . (78)

We also obtained in (75) P( ˆH 1 n t H 1 ) 1 γn(t) , (79)

with γn(t) = min(1, ( Ct 1)νn/6) , (80)

and by Assumption E.2 we know that υ KY

E[ Y 2] n almost surely. Therefore, via a union bound we obtain

P( ˆθ θ H 1 t) P max Υ g , ˆH 1 KY

E[ Y 2]/n H 1 t/2 (81)

1 γn(t/(2 g )) γn( nt/(2CY )) , (82)

and hence P( ˆθ θ > s) fn s H 1

fn(s) = γn(s/(2 g )) + γn ns/(2CY ) .

Finally, we verify that

fn(s) = γn(s/(2 g )) + min(1, ( C(2CY )s 1n 1/2)νn/6)

= γn(s/(2 g )) + C0

n2 (s/(2 g ))2

(1 + (s/(2 g )))2Cm

, ( C(2CY )s 1)νn/6 #

+ 2m exp( n2(s/(2 g ))2/Cm) + C0

(1 + C1s)2Cm + log(2m)

, (C0s 1)νn/6 #

Cm + log(2m)

Finally, we verify that if log s 1, the last term dominates as n increases, showing (64).

F. Proof of Proposition 3.1

We directly compute the Hessian

2 x1 log p( x1|x1) = G1 2 x log p(x) GT 1

= G1 K diag (v (x[i]))

where we have used

p( x1|x1) = p(x)

Conditionally Strongly Log-Concave Generative Models

Both terms in the Hessian can now be bounded from below. The assumption on the range of G1 implies that

G1K GT 1 λ|ω0|ηId,

and the assumption on v implies that

G1diag (v (x[i]))

i GT 1 γ G1 GT 1 = γId,

where we have used the fact that G1 is an orthogonal projector.

Combining the two then gives

2 x1 log p( x1|x1) (λ|ω0|η γ)Id,

and the assumption on |ω0| guarantees that λ|ω0|η γ > 0. Similarly, the assumption v δ implies that

2 x1 log p( x1|x1) (λΩη + δ)Id,

where Ω= sup |ω| is the maximum frequency, which concludes the proof.