# exponential_tilting_of_subweibull_distributions__5aa2f42a.pdf

Published in Transactions on Machine Learning Research (09/2025)

Exponential tilting of subweibull distributions

F. William Townes ftownes@andrew.cmu.edu Department of Statistics and Data Science Carnegie Mellon University

Reviewed on Open Review: https: // openreview. net/ forum? id= BQBk11IE7I

The class of subweibull distributions has recently been shown to generalize the important properties of subexponential and subgaussian random variables. We describe alternative characterizations of subweibull distributions, illustrate their application to concentration inequalities, and detail the conditions under which their tail behavior is preserved after exponential tilting.

1 Introduction

Subexponential and subgaussian distributions are of fundamental importance in the application of high dimensional probability to machine learning (Vershynin, 2018; Wainwright, 2019). Recently it has been shown that the subweibull class unifies the subexponential and subgaussian families, while also incorporating distributions with heavier tails (Vladimirova et al., 2020; Kuchibhotla & Chakrabortty, 2022). Informally, a q-subweibull (q > 0) random variable has a survival function that decays at least as fast as exp( λxq) for some λ > 0. For example, the exponential distribution is 1-subweibull and the Gaussian distribution is 2subweibull. Here, we provide alternative characterizations of the subweibull class and introduce a distinction between strictly and broadly subweibull distributions. As an example, the Poisson distribution is shown to be strictly subexponential (q = 1) but not subweibull for any q > 1. We demonstrate how subweibull properties can be used to prove Bernstein concentration inequalities in both heavy and light-tailed settings. Finally, we detail the conditions under which the subweibull property is preserved after exponential tilting.

To motivate this last result, consider the setting of adapting a model to a target distribution that differs from its training distribution. Maity et al. (2023) proposed an importance weighting strategy based on the assumption that the target distribution is well approximated by an exponential tilt of the training distribution. Our results will clarify particular settings where this approximation breaks down, and when one may transfer tail bounds from the training distribution to the target distribution.

2 Preliminaries

2.1 Laplace-Stieltjes transforms

Definition 2.1. The bilateral Laplace-Stieltjes transform (BLT) of a random variable X with distribution function F is

LX(t) = E[exp( t X)] = Z

exp( tx)d F(x)

We do not restrict X to be nonnegative or to have a density function. In the special case that LX(t) < for all t in an open interval around t = 0, then X has a moment generating function (MGF) which is MX(t) = E[exp(t X)] = LX( t). The BLT can characterize the distribution even if the MGF does not exist.

Lemma 2.0.1. If the BLTs of random variables X and Y satisfy LX(t) = LY (t) for all t in any nonempty

open interval (a, b) R, not necessarily containing zero, then X d= Y .

Published in Transactions on Machine Learning Research (09/2025)

For a proof refer to Mukherjea et al. (2006). A random variable X is considered subexponential iff the MGF exists (Vershynin, 2018). If LX(t) = for all t > 0 (respectively t < 0), X is said to have a heavy left (respectively, right) tail (Nair et al., 2022). If a tail is not heavy it is said to be light. It is well known that the one-sided Laplace-Stieltjes transform characterizes nonnegative distributions (Feller, 1971). Lemma 2.0.1 shows that the BLT characterizes any distribution with at least one light tail.

2.2 Orlicz norms

An Orlicz function ψ : R+ 7 R+ is a nondecreasing, convex function with ψ(0) = 0. Unless explicitly stated below, we further restrict it to be strictly increasing.

Definition 2.2. The ψ-Orlicz norm of a random variable X is given by

X ψ = inf t > 0 : E ψ |X|

where inf{ } = .

Since ψ |X|/t is a decreasing function of t, it is clear that E [ψ(|X|/t)] 1 for all t > X ψ. Like other norms, Orlicz norms have the following properties.

Homogeneity: a X ψ = |a| X ψ

Subadditivity: X + Y ψ X ψ + Y ψ

Positive definiteness: X ψ = 0 implies X = 0 almost surely.

Suppose b R is a constant. If X is a degenerate random variable with Pr(X = b) = 1, homogeneity implies b ψ = |b|/ψ 1(1). Finiteness of the ψ-Orlicz norm is preserved under location-scale transformations. If X ψ < and a, b R, then

a X + b ψ |a| X ψ + b ψ <

If X ψ = , then a X ψ = as well. Here are some examples of Orlicz norms. Let ψ(x) = xp for p 1.

Then X ψ = E |X|p 1/p and X Lp has finite moments up to order p iff X ψ < . Here, we will focus primarily on the norm derived from the Orlicz function ψq(x) = exq 1, which is convex for q 1.

Definition 2.3. The ψq-Orlicz norm (q 1) of random variable X is given by

X ψq = inf t > 0 : E exp |X|

The condition that X ψq < clearly implies that |X|q is subexponential. The following lemma, modified from Pollard (2024) is useful in establishing when a ψ-Orlicz norm is finite.

Lemma 2.0.2. Let c0, K0 > 0 be constants and ψ an Orlicz function. The following are equivalent.

X ψ c0 max{K0, 1}

Published in Transactions on Machine Learning Research (09/2025)

3 Subweibull random variables

Definition 3.1. A random variable X is q-subweibull if E[exp(λq|X|q)] < for some λ > 0. X is strictly q-subweibull if the condition is satisfied for all λ > 0. If X is q-subweibull but not strictly so, we refer to it as broadly q-subweibull.

The first part of this definition was also proposed by Kuchibhotla & Chakrabortty (2022) and by Vladimirova et al. (2020) using a parameterization equivalent to 1/q. Clearly X is (strictly) q-subweibull if and only if |X|q is (strictly) subexponential. As an example, the Laplace distribution is broadly 1-subweibull (ie broadly subexponential).

Definition 3.2. The radius of convergence of a q-subweibull random variable X is defined by

Rq = sup {λ > 0 : E[exp(λq|X|q)] < }

and if no such λ > 0 exists we adopt the convention that Rq = 0.

In the case of strictly q-subweibull distributions, Rq = . X has heavy tails (in the sense of Nair et al., 2022) iff it is not subexponential (R1 = 0).

Lemma 3.0.1. Random variable X with Pr(X < 0) / {0, 1} is q-subweibull if and only if the nonnegative random variables A = [ X | X < 0] and B = [X | X 0] are q-subweibull. Let Rqx, Rqa, and Rqb denote the radii of convergence for X, A, and B, respectively. Then Rqx = min{Rqa, Rqb}.

Proposition 3.1. The following are equivalent characterizations of a q-subweibull random variable X where q > 0.

1. Tail bound:

(a) K1a > 0 such that t 0,

Pr(|X| > t) 2 exp (t/K1a)q

(b) K1b > 0 such that lim sup t Pr(|X| > t) exp (t/K1b)q <

2. Growth rate of absolute moments:

(a) K2 > 0 such that p 1, E[|X|p] 1/p K2p1/q

E[|X|p] 1/p

3. MGF of |X|q finite in open interval of zero: K3 > 0 such that

(a) 0 < λ < 21/q

E [exp(λq|X|q)] 1 1 λq Kq 3/2

(b) 0 < λ 1/K3 E [exp(λq|X|q)] exp(Kq 3λq)

A similar result (excluding conditions 1b and 2b) was proven by Vladimirova et al. (2020). Our proof provides explicit constants, which reveals the connection with the Orlicz norm.

Published in Transactions on Machine Learning Research (09/2025)

Proposition 3.2. A random variable X is q-subweibull (q 1) if and only if the ψq-Orlicz norm is finite. Furthermore, when q 1 the constants in Proposition 3.1 are related to the norm by the global constant

Cq = exp 2Γ(1/q)(eq)1/q

(eq log 2)1/q (1)

such that X ψq

Cq Kj Cq X ψq

for Kj {K1a, K2, K3}

An important special case is

A quasinorm for q < 1 (heavy tails) is discussed in Kuchibhotla & Chakrabortty (2022). This relationship to the Orlicz norm allows us to restate Proposition 2.7.1(e) of Vershynin (2018) with explicit constants. Corollary 3.0.1. If X is a subexponential random variable with mean zero, then

E[exp(λX)] exp(2K2λ2)

for all |λ| 1/(2K) where

K = e C1 X ψ1 = e4

It was shown by Vladimirova et al. (2020) that a q-subweibull distribution is also r-subweibull for all r < q. We now show that this also implies it is strictly r-subweibull. Corollary 3.0.2. If X is q-subweibull then it is strictly r-subweibull for all r (0, q). Corollary 3.0.3. Every bounded random variable is strictly q-subweibull for all q > 0.

Proof. If X is bounded then there exists M 0 such that |X| M. Then E[exp(λq|X|q)] exp(λq M q) < for all λ > 0 and q > 0.

Corollary 3.0.4. If X is not strictly q-subweibull with q 1 then it is not r-subweibull for any r > q.

3.1 Subweibull properties of the Poisson distribution

Corollaries 3.0.2 and 3.0.4 suggest a hierarchy of distributions based on the heaviness of the tails. Broadly qsubweibull distributions, which have a finite but nonzero radius of convergence (Rq), serve as critical points in the transition between the strictly r-subweibull regime (r < q), with Rq = and the not r-subweibull regime (r > q) with Rq = 0. However, the transition from strictly subweibull to not subweibull can be immediate, without passing through the stage of broadly subweibull. Here we provide a simple example: the Poisson tail is lighter than any exponential tail, but heavier than any weibull tail with q > 1. Proposition 3.3. The Poisson distribution is strictly q-subweibull for q 1 but not q-subweibull for any q > 1.

4 Concentration inequalities

Subweibull tail bounds can be straightforwardly used to improve the tightness of Bernstein s concentration inequality, which we restate here with explicit constants. Proposition 4.1. If X1, . . . , Xn are independent, subexponential random variables, then

( 2 exp t2 2K2 σ2 0 t K σ2

where θ = maxi{ Xi ψ1}, σ2 = Pn i=1 Xi 2 ψ1 and K = 2e C1 = 2e4/ log(2).

Published in Transactions on Machine Learning Research (09/2025)

This is a standard result (eg, Theorem 2.8.1 of Vershynin (2018)) so the proof is omitted. We now show that if the summands have lighter than exponential tails, the bound can be tightened.

Proposition 4.2. Light-tailed Bernstein inequality

If X1, . . . , Xn are independent, zero-mean q-subweibull random variables with q > 1, then

Cq q nq 1σq q

where σq q = Pn i=1 Xi q ψq, Cq is the global constant from Equation 1, and σ2, K, θ are from Proposition 4.1.

For sums of heavy-tailed subweibull distributions (q < 1), the MGF does not exist, but it is still possible to produce a uniform tail bound.

Proposition 4.3. Heavy-tailed Bernstein inequality

If X1, . . . , Xn are independent, zero-mean q-subweibull random variables with q < 1, then

(C1/ log 2) Pn i=1 |Xi|q ψ1

where C1 is the global constant from Equation 2, and |Xi|q ψ1 is the subexponential norm of |Xi|q.

Similar results are also found in Vladimirova et al. (2020); Kuchibhotla & Chakrabortty (2022).

5 Exponential tilting

Definition 5.1. Let X be a random variable with distribution function F. If the BLT satisfies LX( θ) < for some θ = 0, then the exponentially tilted distribution is given by

Fθ(x) = Z x

exp(θt) LX( θ)d F(t)

We adopt the convention of using θ instead of θ so that the interpretation of the tilting parameter is consistent with other works that assume X has an MGF, in which case one could equivalently require MX(θ) < .

From the Radon-Nikodym theorem, Fθ is absolutely continuous with respect to F. Since the density function eθx/LX( θ) is also strictly positive, exponential tilting does not change the support. Generally speaking it is possible to produce a subexponential distribution by exponential tilting of any distribution with at least one light tail.

Proposition 5.1. If X F is a random variable having at least one light tail then exponential tilting is possible for all θ in some open interval ( S, T) with S, T 0 and S +T > 0. The resulting tilted distribution Fθ is subexponential with MGF MZ(t) = LX( θ t)/LX( θ) finite for all t ( S θ, T θ).

As an example, if X F is a nonnegative, heavy tailed random variable (T = 0), its left tail is strictly subexponential (S = ) so exponential tilting is possible for all θ < 0. By Proposition 5.1 the resulting tilted distribution is subexponential and hence has lighter tails than the original distribution. On the other hand, if X is broadly subexponential, exponential tilting produces another broadly subexponential distribution, with a shifted interval of convergence.

While exponential tilting can alter the tail behavior of heavy tailed and broadly subexponential distributions, it does not affect the tail behavior of q-subweibull distributions with lighter than exponential tails (i.e., q > 1).

Lemma 5.0.1. Preservation of nonnegative subweibull tails under exponential tilting. Let θ be any real number. If X F is nonnegative and q-subweibull (q > 1), then the exponentially tilted variable Z Fθ is also nonnegative and q-subweibull with the same radius of convergence.

Published in Transactions on Machine Learning Research (09/2025)

1. E[exp(λq Xq)] < for all λ [0, Rq) implies E[exp(λq Zq)] < for all λ [0, Rq).

2. E[exp(λq Xq)] = for all λ > Rq implies E[exp(λq Zq)] = for all λ > Rq.

We now extend Lemma 5.0.1 to general random variables.

Theorem 5.1. Preservation of subweibull tails under exponential tilting. Let θ be any real number.

1. If X F is q-subweibull (q > 1) with radius of convergence Rq, then the exponentially tilted variable Z Fθ is also q-subweibull and has the same radius of convergence.

2. If X F is strictly q-subweibull (q 1), the exponentially tilted variable Z Fθ is also strictly q-subweibull.

3. If X F is not q-subweibull (q > 1), then Z Fθ is also not q-subweibull.

5.1 Application to domain adaptation

Consider a classification problem where labeled training data are drawn from distribution P and the goal is to make accurate predictions on a different target distribution Q. We are given unlabeled samples from Q. Maity et al. (2023) proposed an exponential tilt model to facilitate importance weighting:

q(x, Y = k) = exp {θ k T(x) + αk} p(x, Y = k)

where θk, αk are the tilting parameters. It can be straightforwardly shown that

αk = log EX P [exp(θ k T(X))] = log MT (x)(θk)

For simplicity we consider a univariate T(X). If its distribution has heavy tails on both sides, the MGF is not finite and exponential tilting is not possible. In practice, the method of Maity et al. (2023) relies on minimizing a discrepancy such as KL divergence between P and Q using samples. If the distribution of T(X) is heavy tailed, it may not be possible to consistently estimate the tilting parameters. If T(X) has a heavy tail on only one side, then the importance weights may still be validly estimated, but the tilting parameters involved must be constrained. On the other hand, if T(x) follows a q-subweibull distribution with q > 1, then any tail bounds available for T(X) may be readily transferred to the target distribution Q using Theorem 5.1.

6 Discussion

The theory of subexponential and subgaussian distributions is a key prerequisite to many results in theoretical machine learning and nonasymptotic statistics. That the important subexponential properties can be generalized to the broader subweibull class has been established by Vladimirova et al. (2020) and Kuchibhotla & Chakrabortty (2022). Our work differs from these in several ways: we provide explicit constants without requiring quasinorms, distinguish between strictly and broadly subweibull distributions, and address exponential tilting, which has not been previously examined to our knowledge. Exponential tilting is used in a variety of statistical areas such as causal inference (Mc Clean et al., 2024) and Monte Carlo sampling (Fuh & Wang, 2024). If Fθ is a tilted distribution, it is a natural exponential family with parameter θ. The exponential families are building blocks for generalized linear models (Mc Cullagh & Nelder, 1989). For another application of exponential tilting to machine learning beyond domain adaptation, see Li et al. (2023).

Here, we have provided a brief overview of subweibull distributions and their Orlicz norms. We showed that the Poisson distribution is strictly 1-subweibull but not q-subweibull for any q > 1. We illustrated the application of subweibull properties to prove concentration inequalities. Finally, we detailed the conditions under which the subweibull property is preserved under exponential tilting. Specifically, if a distribution is subweibull with a lighter than exponential tail, then the tail of the exponentially tilted distribution decays at the same rate.

Published in Transactions on Machine Learning Research (09/2025)

Acknowledgments

Thanks to Arun Kumar Kuchibhotla, Sam Power, Patrick Staples, Valerie Ventura, Larry Wasserman, and Matt Werenski for helpful comments and suggestions.

William Feller. An Introduction to Probability Theory and Its Applications, volume 2. John Wiley & Sons, 1971.

Cheng-Der Fuh and Chuan-Ju Wang. Efficient exponential tilting with applications. Statistics and Computing, 34(2):65, January 2024. ISSN 1573-1375. doi: 10.1007/s11222-023-10374-5. URL https: //doi.org/10.1007/s11222-023-10374-5.

Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Information and Inference: A Journal of the IMA, 11(4):1389 1456, December 2022. ISSN 2049-8772. doi: 10.1093/imaiai/iaac012. URL https://doi.org/10.1093/imaiai/iaac012.

Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. On Tilted Losses in Machine Learning: Theory and Applications. Journal of Machine Learning Research, 24(142):1 79, 2023. ISSN 1533-7928. URL http://jmlr.org/papers/v24/21-1095.html.

Subha Maity, Mikhail Yurochkin, Moulinath Banerjee, and Yuekai Sun. Understanding new tasks through the lens of training data via exponential tilting, February 2023. URL http://arxiv.org/abs/2205.13577.

Alec Mc Clean, Zach Branson, and Edward H. Kennedy. Nonparametric estimation of conditional incremental effects. Journal of Causal Inference, 12(1), January 2024. ISSN 2193-3685. doi: 10.1515/jci-2023-0024. URL https://www.degruyter.com/document/doi/10.1515/jci-2023-0024/html.

P. Mc Cullagh and John A. Nelder. Generalized Linear Models, Second Edition. CRC Press, August 1989. ISBN 978-0-412-31760-6.

A. Mukherjea, M. Rao, and S. Suen. A note on moment generating functions. Statistics & Probability Letters, 76(11):1185 1189, June 2006. ISSN 0167-7152. doi: 10.1016/j.spl.2005.12.026. URL https: //www.sciencedirect.com/science/article/pii/S016771520500475X.

Jayakrishnan Nair, Adam Wierman, and Bert Zwart. The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation. Cambridge University Press, June 2022. ISBN 978-1-009-06296-1.

David Pollard. Orlicz spaces. In Probability Tools, Tricks, and Miracles. 2024. URL http://www.stat. yale.edu/~pollard/Books/Pttm/.

Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018. ISBN 978-1-108-41519-4. doi: 10.1017/9781108231596. URL https://www.cambridge.org/core/books/ highdimensional-probability/797C466DA29743D2C8213493BD2D2102.

Mariia Vladimirova, Stéphane Girard, Hien Nguyen, and Julyan Arbel. Sub-Weibull distributions: Generalizing sub-Gaussian and sub-Exponential properties to heavier tailed distributions. Stat, 9(1):e318, 2020. ISSN 2049-1573. doi: 10.1002/sta4.318. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/ sta4.318.

Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2019. ISBN 978-1-108-49802-9. doi: 10.1017/9781108627771. URL https://www.cambridge.org/core/books/ highdimensional-statistics/8A91ECEEC38F46DAB53E9FF8757C7A4E.

Published in Transactions on Machine Learning Research (09/2025)

A Proofs for Section 2 (Preliminaries)

Lemma 2.0.2. Let c0, K0 > 0 be constants and ψ an Orlicz function. The following are equivalent.

X ψ c0 max{K0, 1}

Proof. (a) = (b): if K0 1 the result is immediate. If K0 > 1, let α (0, 1).

= E ψ (1 α)(0) + α|X|

E (1 α)ψ(0) + αψ |X|

Setting α = 1/K0 produces the result.

(b) = (a): Let X ψ t. Then

We need to show for every K1 > 0, there exists a c1 > 0 such that

For all K1 1 simply choose c1 = t. For K1 < 1,

= E ψ (1 K1)(0) + K1 |X|

E (1 K1)ψ(0) + K1ψ |X|

So we can set c1 max{t, t/K1}.

B Proofs for Section 3 (Subweibull random variables)

Lemma 3.0.1. Random variable X with Pr(X < 0) / {0, 1} is q-subweibull if and only if the nonnegative random variables A = [ X | X < 0] and B = [X | X 0] are q-subweibull. Let Rqx, Rqa, and Rqb denote the radii of convergence for X, A, and B, respectively. Then Rqx = min{Rqa, Rqb}.

Proof. Let p = Pr(X < 0) and define nonnegative random variables A = [ X | X < 0] and B = [X | X 0].

E[exp(λq|X|q)] = E[exp(λq( X)q) | X < 0]p + E[exp(λq Xq) | X 0](1 p)

= E[exp(λq Aq)]p + E[exp(λq Bq)](1 p)

The left hand side is finite if and only if both terms on the right hand side are finite. If Rqx is the radius of convergence for X then E[exp(λq|X|q)] < for all λ [0, Rqx). Clearly E[exp(λq Aq)] < and E[exp(λq Bq)] < for all λ [0, Rqx) also, implying min{Rqa, Rqb} Rqx. However, if min{Rqa, Rqb} > Rqx then there exists some λ > Rqx such that E[exp(λq|X|q)] < , which by Definition 3.2 means Rqx is not the radius of convergence of X, a contradiction. Therefore, min{Rqa, Rqb} = Rqx.

Published in Transactions on Machine Learning Research (09/2025)

Proposition 3.1. The following are equivalent characterizations of a q-subweibull random variable X where q > 0.

1. Tail bound:

(a) K1a > 0 such that t 0,

Pr(|X| > t) 2 exp (t/K1a)q

(b) K1b > 0 such that lim sup t Pr(|X| > t) exp (t/K1b)q <

2. Growth rate of absolute moments:

(a) K2 > 0 such that p 1, E[|X|p] 1/p K2p1/q

E[|X|p] 1/p

3. MGF of |X|q finite in open interval of zero: K3 > 0 such that

(a) 0 < λ < 21/q

K3 E [exp(λq|X|q)] 1 1 λq Kq 3/2

(b) 0 < λ 1/K3 E [exp(λq|X|q)] exp(Kq 3λq)

Proof. (1a) = (1b):

lim sup t Pr(|X| > t) exp (t/K1a)q sup t 0 Pr(|X| > t) exp (t/K1a)q 2 <

So we can simply set K1b = K1a.

(1b) = (1a): Assume lim sup t Pr(|X| > t) exp (t/K1b)q = K

Then, for every C > K, there exists some T such that for all t > T,

Pr(|X| > t) exp (t/K1b)q < C

For all t [0, T], Pr(|X| > t) 1 and exp (t/K1b)q exp (T/K1b)q . Therefore

sup t 0 Pr(|X| > t) exp (t/K1b)q max C, exp (T/K1b)q

Let U = max C, exp (T/K1b)q . If U 2 this directly implies (1a) with K1a = K1b. In the case that U > 2, set

Let f(t) = U exp (t/K1b)q , g(t) = 2 exp (t/K1a)q , and T = K1b(log U)1/q. Since f(T ) = g(T ) = 1 and g(t) is a strictly decreasing function, this implies that Pr(|X| > t) 1 g(t) for t [0, T ]. For t T , g(t) f(t) since K1a > K1b, and f(t) Pr(|X| > t) by assumption therefore 2 exp (t/K1a)q Pr(|X| > t) for all t 0.

Published in Transactions on Machine Learning Research (09/2025)

(1a) = (2a):

E[|X|p] = Z

0 Pr(|X|p > u)du

0 Pr(|X| > t)ptp 1dt

0 2 exp( (t/K1a)q)ptp 1dt

K1as1/q p 1 K1a

q s(1/q) 1 exp( s)ds

= 2(p/q)Kp 1a

0 sp/q 1e sds = (2p/q)Kp 1aΓ(p/q)

For x c > 0, the function Γ(x) is bounded above by Γ(c)(e/c)c(x/e)x. Therefore, with c = 1/q,

Γ(p/q) Γ(1/q)(eq)1/q(p/q)p/q exp( p/q)

Substituting this into the previous expressions.

E[|X|p] (2p/q)Kp 1aΓ(1/q)(eq)1/q(p/q)p/q exp( p/q)

E[|X|p]1/p 2Γ(1/q)(eq)1/q

q p 1/p K1a(eq) 1/qp1/q

Let a = 2Γ(1/q)(eq)1/q

q > 0. Consider the function f(x) = (ax)1/x which we would like to upper bound.

f(x) = exp 1

f (x) = f(x) 1

x2 log(ax) = f(x)

x2 (1 log(ax))

f (x) = f (x)

x2 (1 log(ax)) 2f(x)

x3 (1 log(ax)) f(x)

The global maximum occurs at x = e/a which is confirmed by checking f (e/a) = 0 and f (e/a) < 0. Therefore a suitable upper bound is (ax)1/x exp(a/e). Substitute this into the above expression.

E[|X|p]1/p exp 2Γ(1/q)(eq)1/q

(eq) 1/q K1ap1/q

Therefore, there exists some K2 C2K1a such that E[|X|p]1/p K2p1/q as required where

C2 = exp 2Γ(1/q)(eq)1/q

Note that C2 > 1 for all q.

(2a) = (2b):

E[|X|p] 1/p

p1/q sup p 1

E[|X|p] 1/p

(2b) = (2a): Assume

E[|X|p] 1/p

Published in Transactions on Machine Learning Research (09/2025)

Then for every C > K, there exists some p such that for all p > p ,

E[|X|p] 1/p

The Lp norm is increasing in p, so for p [1, p ], E[|X|p] 1/p E[|X|p ] 1/p and p1/q 1, which establishes

E[|X|p] 1/p

p1/q max n E[|X|p ] 1/p , C o

(2a) = (3a): The power series representation of the exponential function produces

E [exp(λq|X|q)] = E

λpq E[|X|pq]

From (2) we have E[|X|pq] Kpq 2 (pq)p and p! (p/e)p by Stirling approximation.

E [exp(λq|X|q)] 1 +

λq Kq 2pq p

λq Kq 2eq p = 1 1 λq Kq 2eq

where the last series converges when λq Kq 2eq < 1, or

λ < 1 K2(eq)1/q = 21/q

Since (2eq)1/q e2 for all q > 0, and 1/(1 x) is increasing for x < 1, set C3 = e2 so that λ < 21/q C3K2 implies series convergence and

E [exp(λq|X|q)] 1 1 λq Kq 2eq = 1 1 λq (2eq)1/q K2 q/2 1 1 λq(C3K2)q/2

Therefore there exists some K3 C3K2 such that if λ < 21/q

E [exp(λq|X|q)] 1 1 λq Kq 3/2

Clearly C3 > 1 for all q as well.

(3a) = (3b): Straightforward application of the numerical inequality 1 1 x e2x for 0 x 1/2.

(3b) = (1a): Set C1 = (log 2) 1/q and note that C1 > 1 for all q > 0. Therefore we may choose

λ = (log 2)1/q

K3 = 1 C1K3 1

so that E[exp(λq|X|q)] exp(λq Kq 3) = 2. Then,

Pr(|X| > t) = Pr exp(λq|X|q) > exp(λqtq)

E[exp(λ|X|q)] exp( (λt)q)

2 exp t C1K3

So there exists some K1a C1K3 such that condition (1a) is satisfied as desired.

Published in Transactions on Machine Learning Research (09/2025)

Proposition 3.2. A random variable X is q-subweibull (q 1) if and only if the ψq-Orlicz norm is finite. Furthermore, when q 1 the constants in Proposition 3.1 are related to the norm by the global constant

Cq = exp 2Γ(1/q)(eq)1/q

(eq log 2)1/q (1)

such that X ψq

Cq Kj Cq X ψq

for Kj {K1a, K2, K3}

Proof. Proposition 3.1 (3b) implies finite norm: Let ψq(x) = exp(xq) 1 as before. This is a convex function for q 1. Rearranging terms produces

= E [exp(λq|X|q)] 1 exp(λq Kq 3) 1

From Lemma 2.0.2 this implies

X ψq (1/λ) max {1, exp(Kq 3λq) 1}

where 0 < λ < 1/K3. Observing that the second term in the max is increasing in λ and ranges from 0 to e 1 > 1, while 1/λ is a decreasing function, the tightest bound in terms of λ is achieved when exp(Kq 3λq) 1 = 1 which produces λ = log 2 1/q/K3. This establishes X ψq C4K3 < with C4 = (log 2) 1/q.

Finite norm implies Proposition 3.1 (1a): Let K4 = X ψq(1 + ϵ) for some arbitrarily small ϵ > 0.

Pr(|X| t) = Pr exp |X|

q exp (t/K4)q

q exp (t/K4)q

2 exp (t/K4)q

This shows there exists some K1a K4 = X ψq(1 + ϵ) such that the desired condition is satisfied.

To show the relationship to the global constant, recall from the proof of Proposition 3.1 that

C1 = C4 = (log 2) 1/q

C2 = exp 2Γ(1/q)(eq)1/q

Cq = C1C2C3 = exp 2Γ(1/q)(eq)1/q

(eq log 2)1/q

Since C1, C2, C3 1, Cq 1 also. We can choose ϵ C1 1 > 0 so that

K1a X ψq(1 + ϵ) C1 X ψq Cq X ψq

and X ψq C1K3 C1C3K2 C1C3C2K1a = Cq K1a

By a similar argument, K2 C2K1a C2C1 X ψq Cq X ψq

Published in Transactions on Machine Learning Research (09/2025)

and X ψq C1K3 C1C3K2 Cq K2

Finally, K3 C3C2K1a C3C2C1 X ψq = Cq X ψq

and X ψq C1K3 Cq K3.

Corollary 3.0.1. If X is a subexponential random variable with mean zero, then

E[exp(λX)] exp(2K2λ2)

for all |λ| 1/(2K) where

K = e C1 X ψ1 = e4

Proof. By Propositions 3.1 and 3.2 there is some K2 C1 X ψ1 such that E[|X|p] Kp 2pp for all p 1 where C1 = e3/ log(2) (Equation 2).

E[exp(λx)] = 1 + (0) +

|λ|p Kp 2pp

(p/e)p = 1 + (λe K2)2

where the sum converges whenever |λe K2| < 1. The function 1 + x2/(1 |x|) is bounded above by exp(2x2) whenever |x| 1/2, so

E[exp(λx)] exp 2(λe K2)2 exp 2λ2 e C1 X ψ1 2

whenever |λe K2| 1/2 which is satisfied by

|λ| 1 2e C1 X ψ1

Setting K = e C1 X ψ1 yields the desired result.

Corollary 3.0.2. If X is q-subweibull then it is strictly r-subweibull for all r (0, q).

Proof. If X is q-subweibull, by Proposition 3.1 we may assume there exists K > 0 such that p 1,

E[|X|p] 1/p Kp1/q

Let r (0, q). The MGF of |X|r is given by

E [exp(λr|X|r)] = 1 +

λpr E[|X|pr]

λpr Kpr(pr)pr/q

λr Krerr/q ppp(r/q 1)

Apply the root test to the series to determine convergence.

R(p) = λr Krerr/qpr/q 1

Since r < q, then limp R(p) = 0 and the series converges regardless of the value of λ, which shows X is strictly r-subweibull.

Published in Transactions on Machine Learning Research (09/2025)

Corollary 3.0.4. If X is not strictly q-subweibull with q 1 then it is not r-subweibull for any r > q.

Proof. From Proposition 3.1 we may assume λ > 0 such that

lim sup t Pr(|X| > t) exp(λtq) =

which implies there is an infinite sequence tn such that

lim n Pr(|X| > tn) exp(λtq n) =

Now let ρ > 0 and r > q. Whenever t t = (λ/ρ)1/(r q), we have exp(ρtr) exp(λtq). Let {tm} be the infinite subsequence of {tn} excluding the elements less than t . Clearly tm as well. Then

lim m Pr(|X| > tm) exp(ρtr m) lim m Pr(|X| > tm) exp(λtq m) =

which implies X cannot be r-subweibull.

Proposition 3.3. The Poisson distribution is strictly q-subweibull for q 1 but not q-subweibull for any q > 1.

Proof. Since the Poisson distribution has a finite MGF with infinite radius of convergence, it is strictly subexponential and by Corollary 3.0.2 strictly q-subweibull for all q 1. Let X Poi(µ). Without loss of generality assume t > 1 and let n = t + 1 with t < n t + 1.

Pr(X > t) =

j=n Pr(X = j) Pr(X = n) = µn exp( µ)

n! = µn exp( µ)

Since tΓ(t) is increasing for t 1, we have nΓ(n) (t + 1)Γ(t + 1). Also, Γ(n) nn for n 1. For the µn

term, it is increasing for µ 1 and decreasing for µ < 1, so µn min{µt+1, µt} = µt min{µ, 1}. Combining these we obtain

Pr(X > t) µt min{µ, 1}e µ

(t + 1)Γ(t + 1) = µt min{µ, 1}e µ

(t + 1)(t)Γ(t) µt min{µ, 1}e µ

(t + 1)(t)tt

To assess whether the tail follows a subweibull rate of decay, choose any λ > 0 and q > 1, then

lim sup t Pr(X > t) exp(λtq) min{µ, 1}e µ lim t µt

(t + 1)(t)tt exp(λtq)

= min{µ, 1}e µ exp h lim t t log µ log(t + 1) log t t log t + λtqi

The expression inside brackets is of the form so we rearrange terms and apply L Hopital s rule. Define

lim t t log µ log(t + 1) log t t log t + λtq

= lim t (t log t) lim t log µ

log t log(t + 1)

= lim t (t log t) lim t 0 1/(t + 1)

1 + log t (0) + λqtq 1

= lim t (t log t) lim t λq(q 1)tq 2

Therefore lim sup t Pr(X > t) exp(λtq) =

Since this holds for all λ > 0, X cannot satisfy Proposition 3.1 and therefore is not q-subweibull for any q > 1.

Published in Transactions on Machine Learning Research (09/2025)

C Proofs for Section 4 (Concentration inequalities)

Proposition 4.2. Light-tailed Bernstein inequality

If X1, . . . , Xn are independent, zero-mean q-subweibull random variables with q > 1, then

Cq q nq 1σq q

where σq q = Pn i=1 Xi q ψq, Cq is the global constant from Equation 1, and σ2, K, θ are from Proposition 4.1.

Proof. By Corollary 3.0.2 each Xi is subexponential, so the bound for small deviations t follows directly from Proposition 4.1. For large deviations, let S = Pn i=1 Xi. Since q > 1 the Orlicz norm exists and by subadditivity,

i=1 Xi ψq n1 1/q n X

i=1 Xi q ψq

By Propositions 3.1 and 3.2, there exists some K1a Cq S ψq such that

Pr(|S| t) 2 exp( (t/K1a)q) 2 exp tq

Cq qnq 1σq q

For all t 0. In particular, it holds for t K σ2/θ.

Proposition 4.3. Heavy-tailed Bernstein inequality

If X1, . . . , Xn are independent, zero-mean q-subweibull random variables with q < 1, then

(C1/ log 2) Pn i=1 |Xi|q ψ1

where C1 is the global constant from Equation 2, and |Xi|q ψ1 is the subexponential norm of |Xi|q.

Proof. Since f(x) = xq is concave for x 0 and q (0, 1),

With λ > 0 this implies

i=1 |Xi|q tq !

i=1 E [exp {λ|Xi|q}] exp { λtq}

If Xi is q-subweibull, then |Xi|q is subexponential. By Proposition 3.1 (3b), for each Xi there is some K3i C1 |Xi|q ψ1 such that E[exp(λ|Xi|q)] exp(K3iλ)

if λ 1/K3i. Therefore set θq = maxi { |Xi|q ψ1} and restrict λ 1/(C1θq). Plugging this back into the previous expression,

i=1 K3iλ λtq !

exp (λ(C1σq tq))

Published in Transactions on Machine Learning Research (09/2025)

where σq = Pn i=1 |Xi|q ψ1. The bound is decreasing in λ when t t = (C1σq)1/q, in which case the tightest bound occurs with λ = 1/(C1θq). This leads to

exp 1 C1θq (C1σq tq) = exp σq

Referring to Proposition 3.1 (1b), we can see that

lim sup t exp σq

exp ((t/K1b)q) <

with K1b = (C1θq)1/q. Therefore the sum of q-subweibull random variables is also q-subweibull. To get a bound over all t 0, define

θq (t/K1b)q

g(t) = log(2) (t/K1a)q

Since σq/θq 1 > log(2), we can choose K1a > K1b such that f(t) g(t) 0 for t t and g(t) f(t) for t t , where f(t ) = g(t ) = 0.

1/q K1b = (log 2)1/q K1a

K1a = σq θq log 2

1/q (C1θq)1/q = C1σq

D Proofs for Section 5 (Exponential tilting)

Proposition 5.1. If X F is a random variable having at least one light tail then exponential tilting is possible for all θ in some open interval ( S, T) with S, T 0 and S +T > 0. The resulting tilted distribution Fθ is subexponential with MGF MZ(t) = LX( θ t)/LX( θ) finite for all t ( S θ, T θ).

Proof. Without loss of generality assume the right tail is light so LX( θ) < for some θ > 0. For all θ [0, θ), LX( θ ) = E[exp(θ X)] E[exp(θX)] <

Set T = sup{θ : LX( θ) < } > 0. If X has a heavy left tail then LX( θ) = for all θ < 0, so the interval of convergence is ( S, T) with S = 0. If X has a light left tail then we can set S = inf{θ : LX( θ) < } > 0. This establishes the interval is ( S, T) with S, T 0 and S + T > 0. Let Z Fθ follow the tilted distribution with θ ( S, T). Its BLT is

LZ(t) = E[exp( t Z)] = Z

exp( tz)d Fθ(z) = Z

exp( tx) exp(θx)

LX( θ)d F(x)

= E[exp( (t θ)X)]/LX( θ) = LX( (θ t))/LX( θ)

This is finite when θ t ( S, T) or equivalently t ( T + θ, S + θ). Since θ ( S, T), the interval of convergence for LZ(t) is an open interval containing zero, which proves Z is subexponential and has the MGF MZ(t) = LZ( t) = LX( θ t)/LX( θ)

which is finite on the interval t ( S θ, T θ).

Lemma 5.0.1. Preservation of nonnegative subweibull tails under exponential tilting. Let θ be any real number. If X F is nonnegative and q-subweibull (q > 1), then the exponentially tilted variable Z Fθ is also nonnegative and q-subweibull with the same radius of convergence.

Published in Transactions on Machine Learning Research (09/2025)

1. E[exp(λq Xq)] < for all λ [0, Rq) implies E[exp(λq Zq)] < for all λ [0, Rq).

2. E[exp(λq Xq)] = for all λ > Rq implies E[exp(λq Zq)] = for all λ > Rq.

Proof. If X is q-subweibull with q > 1 then by Corollary 3.0.2 it is strictly subexponential and LX( θ) < for all θ R. Let Z Fθ. The MGF of Zq is

E[exp(λq Zq)] =

R exp(λqxq + θx)d F(x)

(1) case of λ < Rq. If θ 0 then Z exp(λqxq + θx)d F(x) Z exp(λqxq + 0)d F(x) = E[exp(λq Xq)] <

If θ > 0, choose ρ (λ, Rq) and define

x = θ ρq λq

Then for x > x , λqxq + θx ρqxq. Therefore, Z exp(λqxq + θx)d F(x) = Z x

0 exp(λqxq + θx)d F(x) + Z

x exp(λqxq + θx)d F(x)

0 exp θx + λq(x )q d F(x) + Z

x exp(ρqxq)d F(x)

exp θx + λq(x )q Pr(X x ) + Z

0 exp(ρqxq)d F(x)

(2) case of λ > Rq. If θ 0 then Z exp(λqxq + θx)d F(x) Z exp(λqxq + 0)d F(x) = E[exp(λq Xq)] =

If θ < 0. Choose ρ (Rq, λ) and define

x = θ λq ρq

Then for x > x , λqxq + θx ρqxq. Therefore, Z exp(λqxq + θx)d F(x) = Z x

0 exp(λqxq + θx)d F(x) + Z

x exp(λqxq + θx)d F(x)

0 exp(λqxq + θx)d F(x) + Z

x exp(ρqxq)d F(x)

The first term is finite. We will show the second term is infinite. By assumption, Z

0 exp(ρqxq)d F(x) =

0 exp(ρqxq)d F(x) + Z

x exp(ρqxq)d F(x)

0 exp(ρqxq)d F(x) exp ρq(x )q Pr(X x ) <

Published in Transactions on Machine Learning Research (09/2025)

Therefore Z

x exp(ρqxq)d F(x) =

implying Z exp(λqxq + θx)d F(x) =

Theorem 5.1. Preservation of subweibull tails under exponential tilting. Let θ be any real number.

1. If X F is q-subweibull (q > 1) with radius of convergence Rq, then the exponentially tilted variable Z Fθ is also q-subweibull and has the same radius of convergence.

2. If X F is strictly q-subweibull (q 1), the exponentially tilted variable Z Fθ is also strictly q-subweibull.

3. If X F is not q-subweibull (q > 1), then Z Fθ is also not q-subweibull.

Proof. (1) By Corollary 3.0.2, X is strictly subexponential so LX( θ) < for all θ R. Choose any arbitrary θ and set M1 = LX( θ). Define nonnegative random variables A = [ X | X < 0] and B = [X | X 0] with distributions F and F +, respectively. By Lemma 3.0.1 both A and B are q-subweibull and strictly subexponential. Let Rqa and Rqb be the radii of convergence of A and B, respectively. Let p = Pr(X < 0) and assume p / {0, 1} (otherwise simply apply Lemma 5.0.1 to X or X). Note that

M1 = LA(θ)p + LB( θ)(1 p) (3)

E[exp(λq|Z|q)] = Z

exp(λq|z|q)d Fθ(z) = Z

exp(λq|x|q)exp(θx)

= E[exp(λq|X|q + θX)]

= p M1 E[exp(λq Aq θA)] + (1 p)

M1 E[exp(λq Bq + θB)]

0 exp(λqxq)exp( θx)

LA(θ) d F (x) . . .

. . . + (1 p)LB( θ)

0 exp(λqxq)exp(θx)

LB( θ)d F +(x)

0 exp(λqzq)d F ( θ)(z) + (1 p) Z

0 exp(λqzq)d F + θ (z)

= p E[exp(λq U q)] + (1 p)E[exp(λq V q)]

where (see Equation 3), p = p LA(θ)/M1 so that p (0, 1). The nonnegative random variable U is distributed as F ( θ), which is the exponentially tilting of A F by θ and V F + θ is similarly defined as the exponential tilting of B F +. By Lemma 5.0.1, this implies U and V are q-subweibull with radii of convergence Rqa and Rqb, respectively. Let Rqz be the radius of convergence of Z. Note that

Pr(Z 0) = Z

0 d Fθ(z) = Z

exp(θx) LX( θ)d F(x) = E[exp(θX) | X 0] Pr(X 0)

= E[exp(θB)](1 p)

M1 = LB( θ)(1 p)

Published in Transactions on Machine Learning Research (09/2025)

So Pr(Z < 0) = p. By Lemma 3.0.1 this implies Z is q-subweibull with radius of convergence min{Rqa, Rqb}, which is also the radius of convergence of X.

(2) For q = 1 apply Proposition 5.1 with S = and T = . For q > 1, apply (1) with Rq = .

(3) This is a direct corollary of (1) obtained in the case of Rq = 0.