# semiparametric_inference_using_fractional_posteriors__6bd8869e.pdf

Journal of Machine Learning Research 24 (2023) 1-61 Submitted 1/23; Revised 11/23; Published 12/23

Semiparametric Inference Using Fractional Posteriors

Alice L Huillier alice.lhuillier@sorbonne-universite.fr LPSM, Sorbonne Universit e 4, place Jussieu 75005, Paris, France

Luke Travis luke.travis15@imperial.ac.uk Department of Mathematics Imperial College London London SW7 2AZ, United Kingdom

Isma el Castillo ismael.castillo@upmc.fr LPSM, Sorbonne Universit e 4, place Jussieu 75005, Paris, France

Kolyan Ray kolyan.ray@imperial.ac.uk Department of Mathematics Imperial College London London SW7 2AZ, United Kingdom

Editor: Debdeep Pati

We establish a general Bernstein von Mises theorem for approximately linear semiparametric functionals of fractional posterior distributions based on nonparametric priors. This is illustrated in a number of nonparametric settings and for diﬀerent classes of prior distributions, including Gaussian process priors. We show that fractional posterior credible sets can provide reliable semiparametric uncertainty quantiﬁcation, but have inﬂated size. To remedy this, we further propose a shifted-and-rescaled fractional posterior set that is an efﬁcient conﬁdence set having optimal size under regularity conditions. As part of our proofs, we also reﬁne existing contraction rate results for fractional posteriors by sharpening the dependence of the rate on the fractional exponent.

Keywords: fractional posteriors, Bernstein von Mises theorem, uncertainty quantiﬁcation, Gaussian processes, histograms.

1. Introduction

In this work, we establish theoretical guarantees for the fractional or tempered or αnposterior, which is obtained in a similar way to the usual Bayesian posterior distribution, but with the likelihood raised to a power αn (0, 1]. Suppose that we model data Y = Y n

with a log-likelihood ℓn(η; Y n) = ℓn(η), and that we assign a prior distribution Π = Πn to

Equal contribution.

c 2023 Alice L Huillier, Luke Travis, Isma el Castillo and Kolyan Ray.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/23-0089.html.

L Huillier, Travis, Castillo and Ray

the parameter η S. The fractional posterior is then deﬁned as

Παn(B|Y n) =

B eαnℓn(η)dΠ(η) R

S eαnℓn(η)dΠ(η) , B measurable. (1)

One interpretation is that αn induces a tempering eﬀect: for αn < 1 the contribution of the data in Bayes formula is downweighted, thus lowering the importance of the data relative to the prior. When αn = 1, this reduces to the usual posterior distribution. We study here the frequentist behaviour of the fractional posterior for semiparametric inference, that is when estimating a low-dimensional functional ψ(η) of the parameter η when the latter is assigned a highor inﬁnite-dimensional prior. As reﬂected in our notation, we will allow the power αn to possibly depend on n. Fractional posteriors have been used in a wide variety of settings, including Bayesian model selection (O Hagan, 1995), marginal likelihood approximation (Friel and Pettitt, 2008), empirical Bayes methods (Martin and Tang, 2020) and more recently variational inference (Alquier et al., 2016; Huang et al., 2018; Burgess et al., 2017; Alquier and Ridgway, 2020; Medina et al., 2022). One motivation for their use in statistical inference is their greater robustness to possible model misspeciﬁcation compared to the usual Bayesian posterior. Gr unwald and van Ommen (2017) empirically demonstrate that in a misspeciﬁed linear regression setting, fractional posteriors can outperform traditional posteriors, motivating their safe Bayesian approach (Gr unwald, 2012; Gr unwald, 2018), which consists of a data-driven choice of αn. The C-posterior (Miller and Dunson, 2019) is another special case of the fractional posterior, which has empirically been shown to be more robust to model misspeciﬁcation than the full posterior in speciﬁc examples. Bissiri et al. (2016) argue that within a decision-theoretic framework, fractional posteriors can be viewed as principled ways to update prior beliefs. In particular, under model misspeciﬁcation, they show that a choice αn = 1 may be necessary for good performance. Computationally, fractionally downweighting parallel distributions can also improve sampling convergence and yield faster mixing times (Geyer and Thompson, 1995). In all cases, the choice of the fractional power αn, often termed the learning rate, plays a key role. There are many proposals for picking this (see, e.g., Gr unwald, 2012; Gr unwald and van Ommen, 2017; Holmes and Walker, 2017; Lyddon et al., 2019; Syring and Martin, 2019), each aiming to achieve a diﬀerent target. However, one common and major motivation for using generalized Bayesian methods is to provide uncertainty quantiﬁcation via the use of generalized posterior credible sets, whose performance are sensitive to the choice of αn in practice (Wu and Martin, 2023). This motivates our work, whose main contribution is to obtain a precise theoretical characterization of the role of αn for some widely-used Bayesian nonparametric priors, in particular Gaussian processes and histograms. More precisely, for these common highand inﬁnite-dimensional priors, we obtain nonparametric convergence rates and semiparametric Bernstein-von Mises theorems having the correct dependence on both n and αn. We further use these insights to construct rescaled credible sets from the αn-posterior that are optimal from an information-theoretic perspective for uncertainty quantiﬁcation. To both gain some intuition for the results ahead and relate these to the existing literature, consider the simple parametric example where we observe Y1, . . . , Yn iid N(θ, 1) with

Semiparametric Inference Using Fractional Posteriors

a conjugate prior Π = N(µ, σ2) for θ. A direct calculation yields the fractional posterior

Παn[ | Y1, . . . , Yn] = N nαn Yn + µσ 2

nαn + σ 2 , 1 nαn + σ 2

N ˆθMLE, 1 nαn I 1 0

where in this model the MLE equals the sample mean ˆθMLE = Yn, the Fisher information I0 = 1 and the last (Bernstein von Mises) approximation holds as nαn . Observe that (i) the αn-posterior above can be obtained from the original posterior by replacing n by the eﬀective sample size nαn, so that the tempering eﬀect means one eﬀectively only uses n = nαn of the data with the exception that Yn in the centering remains identical. Second, (ii) the posterior variance scales as (nαn) 1 for large n and hence the diameter of a credible set constructed from two-sided αn posterior quantiles is enlarged by a multiplicative factor of order 1/αn compared to the traditional posterior. Third, (iii) the choice of αn does not asymptotically aﬀect the location of the αn-posterior mean. Combined with (ii), this implies that credible sets from the αn-posterior do not have the correct frequentist coverage asymptotically, being conservative (too large). In view of these observations, our main results can be heuristically summarized as implying that for αn-posteriors based on common Bayesian nonparametric priors in the well-speciﬁed setting:

1. The αn-posterior contraction rate is of the same form as the full posterior contraction rate, but with the sample size n replaced by the eﬀective sample size n = nαn.

2. For semiparametric Bayesian inference involving suﬃciently regular low-dimensional functionals, a Bernstein von Mises distributional approximation holds as in (2).

3. Under regularity conditions, suitably rescaled credible sets from the αn-posterior have asymptotically correct frequentist coverage and information-theoretic optimal diameter, and are thus eﬃcient conﬁdence sets (unlike the standard credible sets).

The Bernstein von Mises (Bv M) distributional approximation in (2) has been extended for the αn-posterior to general regular low-dimensional parametric models (Miller, 2021; Medina et al., 2022). However, such proof techniques do not extend to the present semiparametric setting, where one wishes to estimate a ﬁnite-dimensional functional in the presence of a highor inﬁnite-dimensional prior, such as a Gaussian process. In Section 2, we derive analogous semiparametric Bv M results to (2) for the αn-posterior by building on the ideas of Castillo and Rousseau (2015). We apply these results to the concrete examples of density estimation and the nonparametric Gaussian white noise model, illustrating our results using histogram and Gaussian process priors, including for the standard Mat ern and squared exponential covariance kernels. Since the αn-posterior variance inﬂates the usual posterior variance, the resulting credible sets can be much larger than needed leading to conservative uncertainty quantiﬁcation in the well-speciﬁed setting. In Section 3, we further show that suitably rescaled credible sets can correct for this, yielding optimal (eﬃcient) uncertainty quantiﬁcation and potentially mitigating one of the downsides of fractional posterior inference. Unlike semiparametric Bv M results, nonparametric contraction rates for αn-posteriors have previously been studied in the literature. When the model is well-speciﬁed, these often require weaker conditions for convergence and sometimes lead to simpler proofs compared

L Huillier, Travis, Castillo and Ray

to the usual posterior. A remarkable result is that when αn < 1, testing or metric entropy conditions which are typically needed for deriving posterior convergence rates as in Ghosal et al. (2000); Ghosal and van der Vaart (2017) are not needed for the fractional posterior, at least when convergence is expressed in terms of certain information-theoretic distances such as R eyni divergences. This was established by Zhang (2006) (and was earlier obtained for consistency by Walker and Hjort, 2001), see also Kruijer and van der Vaart (2013); Bhattacharya et al. (2019); Gr unwald and Mehta (2020) for related results and examples (we refer to Ghosal and van der Vaart, 2017, Chapters 6 and 8, for further results and historical notes). This means that using fractional posteriors often allows one to broaden the set of priors or models for which desirable properties are obtained compared to usual posteriors, as only a prior mass condition is needed, avoiding sometimes delicate constructions with sieve sets in order to keep entropies under control. However, the works Zhang (2006); Kruijer and van der Vaart (2013); Bhattacharya et al. (2019); Gr unwald and Mehta (2020) do not seek to obtain a sharp dependence of the rate on αn, and do not yield sharp results in the norms we are interested in, see Section 4 for more discussion. In particular, we show that one can recover the heuristic idea that the fractional posterior uses n = nαn fraction of the data. Such sharp nonparametric contraction rates for the αn-posterior in terms of both n and αn are needed to obtain precise semiparametric Bv M results. In this paper, we restrict to well-speciﬁed nonparametric models. Compared to parametric models, nonparametric models attempt to be suﬃciently broad that model misspeciﬁcation is unlikely, so that the well-speciﬁed case covers a far larger set of situations. There are nonetheless important notions of nonparametric model misspeciﬁcation (e.g. Ghosal and van der Vaart, 2017, Chapter 8.5) that will be dealt with in future work. Note that the choice αn > 1, which is not covered by our results, is also used in the literature, for instance in variational inference (Alquier et al., 2016; Burgess et al., 2017) and distributed Bayesian computation (Szab o and van Zanten, 2019). Finally, the fractional posterior is a special case of a Gibbs posterior (Jiang and Tanner, 2008), where one replaces the log-likelihood with (the negative of) a risk function, and with a multiplicative constant λ, also called inverse-temperature parameter, playing the role of αn. Gibbs posteriors appear naturally in the study of PAC-Bayesian bounds, see Catoni (2004, 2007) and the recent overview by Alquier (2024). Although we focus here on the special case of the log-likelihood, it would be interesting to also investigate similar questions as in the present paper for λ. Outline. In this paper, we investigate the behaviour of fractional posteriors both for functionals of inﬁnite-dimensional models (the semiparametric problem, see Sections 2 and 3) and for contraction rates of the overall unknown parameter (the nonparametric problem, see Section 4). We start each main section by a general result valid under fairly generic conditions, which we then apply to speciﬁc models and priors. In particular, we will consider three main example cases: the nonparametric Gaussian white noise model, density estimation with random histogram priors, and density estimation with exponentiated Gaussian process priors. In Section 2, we study the semiparametric problem and investigate the distribution induced from the fractional posterior on a functional ψ(η), where η is an inﬁnite-dimensional parameter. We show that under certain conditions, the fractional posterior distribution of nαn(ψ(η) ˆψ), with ˆψ an eﬃcient estimator of ψ, converges to a normal distribution with variance equal to the eﬃcient information bound for estimating the functional. In

Semiparametric Inference Using Fractional Posteriors

some cases, the conditions for this to hold diﬀer slightly from those needed for the classical posterior with αn = 1 studied in Castillo and Rousseau (2015). Although this posterior asymptotic normality (which we shall call the αn Bv M result) is of interest in itself, it also implies that credible sets from the αn posterior are length inﬂated by a factor 1/ αn compared to the case αn = 1, giving them large (conservative) coverage but making them ineﬃcient. In Section 3, we study the frequentist coverage properties of a shifted and dilated version of the αn credible sets. Under an appropriate condition on the centering of the αn Bv M result, which can always be veriﬁed if αn is bounded from below, the transformed credible set is shown to be an asymptotically optimal credible set, thereby remedying this issue. We show that when αn may go to zero, this is no longer necessarily the case, and assessing coverage becomes more delicate. Nonparametric contraction rates are studied in Section 4. We ﬁrst obtain a generic result for the contraction rate of the αn-posterior in terms of a R enyi divergence and under a prior mass condition only, slightly sharpening the recent result by Bhattacharya et al. (2019). We then show that under further entropy conditions (Ghosal and van der Vaart, 2017), one can improve this rate in certain regimes of αn, in particular deriving the expected nonparametric rate with n replaced by the eﬀective sample size n = nαn, thereby generalising the very speciﬁc one dimensional Gaussian example above to the inﬁnite dimensional setting. We also brieﬂy discuss supremum norm contraction rates, and show that the above message still holds. Our results are investigated in the three concrete example settings mentioned above. Note that we restrict to these settings for simplicity of exposition, but that our results can be applied much more broadly to settings where the semiparametric Bv M tools discussed in the next sections can be deployed, which includes contexts as diﬀerent as inverse problems (Nickl, 2022), survival analysis (Castillo and van der Pas, 2021), inference for diﬀusions (Nickl and Ray, 2020), causal inference (Ray and van der Vaart, 2020), etc. We also perform simulations which conﬁrm that the derived asymptotic theoretical properties are empirically relevant and observable at reasonable ﬁnite sample sizes: in particular, we illustrate that the modiﬁed credible sets have close to optimal coverage already at moderate sample size. Framework and notation. Throughout the paper, we consider the following general setting. Let (Yn, An, P n η : η S) be a sequence of statistical experiments indexed by a parameter η, where Y = Y n are the observations, S is a metric measure space, and n is an indexing parameter quantifying the available amount of information. For each n N and η S, we assume that P n η admits a density pn η relative to a σ-ﬁnite measure µn deﬁned on the measurable space (Yn, An). Throughout the following, we make a number of notational simpliﬁcations, enumerated here. We write P n η0 =: P0 for the probability under the true parameter η0, En η0 =: E0 for the corresponding expectation under P0, o P n η0(1) =: o P (1) for a term which is o(1) in P0 probability, Πn =: Π for a prior which may depend on n, Παn( |Y n) for the αn posterior distribution, and Eαn( |Y n) for the expectation with respect to the αn posterior . We study frequentist properties of the αn posterior distribution as n , that is assuming the observation Y is distributed according to P n η0 for some true value of the parameter η0. We consider the regime n with αn (0, 1] such that n = nαn , with further conditions on αn required for some results. The condition n is minimal for asymptotic results given the interpretation of n as the eﬀective sample size used by

L Huillier, Travis, Castillo and Ray

the fractional posterior, see (2). Of particular interest is the regime αn 0, since several existing results in the literature hold for α small enough , for instance robustness to misspeciﬁcation of both fractional posteriors (Gr unwald and van Ommen, 2017) and their variational approximations (Medina et al., 2022).

2. Semiparametric Bernstein-von Mises Theorems

Using a nonparametric statistical model provides generality and ﬂexibility, and global nonparametric rates for fractional posteriors will be discussed in Section 4. Even in this general setting, it is often the case statisticians are interested in estimating a ﬁnite-dimensional parameter or aspect of the model, the so-called semiparametric problem. Perhaps the simplest example is, say in density estimation to ﬁx ideas, the problem of estimating a linear functional R 1 0 af of the unknown density f, where a is a given square-integrable function (e.g. the indicator of an interval). We have seen that in the simple one-dimensional example in the introduction, the αn posterior gives a distribution that is inﬂated by a factor of size roughly 1/ αn compared to the classical posterior. In this section, we will show that this in fact corresponds to a general phenomenon which carries over to estimation of many semiparametric functionals. As mentioned earlier, we allow αn (0, 1] to depend on n and assume nαn as n . More precisely, given a functional ψ : S R of interest, we wish to study the properties of the marginal αn posterior distribution of ψ(η), i.e the push-forward measure of the αn posterior deﬁned by (1) through the map ψ. We ﬁrst consider a fairly general setting and introduce suﬃcient conditions for the posterior distribution to be asymptotically Gaussian (in a sense given in the next paragraph) with an optimal (eﬃcient) variance. Afterwards, we apply this general result to the Gaussian white noise model and density estimation. We say that a distribution QY on R, depending on the data Y , converges weakly in P0-probability to a Gaussian distribution N(0, V ), denoted QY ; N(0, V ) if, as n ,

d BL (QY , N(0, V )) P0 0, (3)

where d BL is the bounded Lipschitz distance between probability distributions on R (the latter distance metrises weak convergence, see Chapter 11 of Dudley, 2002). In the sequel, we take QY to be a re-centered and re-scaled version of the αn posterior distribution induced on the functional ψ(η). More precisely, given a rate vn and a centering µ = µ(Y ), consider the map τψ : η vn(ψ(η) µ). Below we will say that the αn posterior distribution of vn(ψ(η) µ) converges weakly in P0 probability to a N(0, V ) distribution if (3) holds for

QY = Παn[ | Y ] τ 1 ψ ,

that is, for the push-forward measure of the αn posterior through τψ. To establish (3), one can, for instance, verify that Laplace transforms converge in P0 probability, see Castillo and Rousseau (2015) for details. When vn = nαn and µ = ˆψ is an eﬃcient estimator of ψ(η), writing Lαn( nαn(ψ(η) ˆψ)|Y ) for the marginal αn-posterior distribution of nαn(ψ(η) ˆψ), the above says that

Lαn( nαn(ψ(η) ˆψ)|Y ) N(0, V )

Semiparametric Inference Using Fractional Posteriors

as n . Such a result, known as a semiparametric Bv M theorem, says that the above marginal αn-posterior distribution asymptotically converges to a Gaussian distribution, with the precise form of convergence deﬁned via (3). It is perhaps more intuitive to express this distributional approximation as Lαn(ψ(η)|Y ) N( ˆψ, V/(nαn)), mirroring the conjugate example (2). Recall that we assume there is a true P0 = P n η0 generating the data and we are taking the large-sample frequentist limit n .

2.1 A generic LAN setting

Recall the log-likelihood is denoted by ℓn(η) = log pn η(Y n) and we write o P (1) as a shorthand for o P0(1) = o Pη0(1). The following setting formalises a generic semiparametric framework as in Castillo and Rousseau (2015) (see also Castillo, 2012b and Ghosal and van der Vaart, 2017, where similar settings are considered in order to derive Bv M theorems). A main diﬀerence is in the control of remainder terms, which here depend on αn (one recovers the conditions of Castillo and Rousseau, 2015 when αn = 1).

Assumption 2.1 Let (H, , L) be a Hilbert space with associated norm L. In the following, Rn and r are remainder terms which are controlled through the last part of the assumption. LAN expansion. Suppose the log-likelihood around η0 can be written, for suitable η s to be speciﬁed below, as

ℓn(η) = ℓn(η0) n

2 η η0 2 L + n Wn(η η0) + Rn(η, η0),

where Wn : h 7 Wn(h) is P n 0 almost surely a linear map and Wn(h) converges weakly to N(0, h 2 L) as n . Functional expansion. Suppose that the functional ψ around η0 can be written, for some ψ0 H, as ψ(η) ψ(η0) = ψ0, η η0 L + r(η, η0).

Deﬁne, for any ﬁxed t R, a path through η as

ηt = η tψ0 nαn . (4)

Remainder terms control. Suppose that there exists a sequence of measurable sets An satisfying Παn[An|Y n] = 1 + o P (1),

such that η η0 H for all η An and n suﬃciently large, and for any ﬁxed t R,

sup η An |t nαnr(η, η0) + αn(Rn(η, η0) Rn(ηt, η0))| = o P (1).

For ψ0 and Wn as in Assumption 2.1, further deﬁne,

ˆψ = ψ(η0) + Wn(ψ0) n , V0 = ||ψ0||2 L . (5)

L Huillier, Travis, Castillo and Ray

The term V0 is the eﬃciency bound for estimating ψ(η0); an estimator ψ = ψ(Y ) is said to be linear eﬃcient for estimating ψ(η0) if it can be expanded as ψ = ψ(η0) + Wn(ψ0)/ n + o P (1/ n) or equivalently if n( ψ ˆψ) = o P (1). For such an estimator, n( ψ ψ(η0)) converges in distribution to a N(0, V0) variable. Note that ˆψ is itself not an estimator as it depends on unknown quantities. But in all the following limiting results at rate 1/ n or 1/ nαn, this quantity can be replaced by any linear eﬃcient estimator ψ since ψ = ˆψ + o P (1/ n).

Interpretation of Assumption 2.1. The ﬁrst condition requires that the log-likelihood expands around η0 as the sum of a negative quadratic term, a stochastic term and a remainder term. This type of Local Asymptotic Normality assumption is reminiscent of the classical LAN expansion in parametric models (see e.g. van der Vaart, 1998, Chapter 7); the main diﬀerence is that here in the (more general) nonparametric setting, we require a control of remainder terms on typically larger neighborhoods. While in smooth parametric models the LAN expansion is formulated in a 1/ n neighborhood of the truth, An in Assumption 2.1 will generally be chosen as a set on which the posterior for η concentrates; since the present setting is nonparametric, the diameter of this set is typically a nonparametric convergence rate that is slower than 1/ n. Finally, Assumption 2.1 involves the functional ψ(η) and requires that it can be expanded around the true value ψ(η0) in a way that is compatible with the LAN inner product. These assumptions are later veriﬁed for several classes of priors in white noise regression and density estimation for a broad range of αn values. More generally, we expect Assumption 2.1 to hold in a wide variety of setting. For instance, in the case αn = 1, since they were introduced in Castillo and Rousseau (2015), these assumptions have been veriﬁed in diﬀusion models (Nickl and Ray, 2020); inverse problems (Nickl and S ohl, 2019; Nickl, 2020, 2022); survival models (Castillo and van der Pas, 2021); the Cox model (Castillo, 2012b; Ning and Castillo, 2024); and causal inference (Ray and van der Vaart, 2020) amongst others.

2.2 General Bv M Theorems

With Assumption 2.1, we can prove a general Bv M type result for the αn posterior distribution of ψ(η). For the statement below, the conditional expectation in the display is E[G(η) | An] = R

An G(η)d P(η)/P(An), applied here with P = Παn[ | Y n] the αn posterior distribution and G the speciﬁc exponential function of η appearing in the display.

Theorem 2.2 (Semiparametric Bv M for the αn posterior ) Let Π = Πn be a prior distribution on η and suppose that Assumption 2.1 holds with sets An. Then for any t R,

Eαn(et nαn(ψ(η) ˆψ)|Y n, An) = eo P (1)+t2V0/2

An eαnℓn(ηt)dΠ(η) R eαnℓn(η)dΠ(η) ,

where Eαn denotes expectation with respect to the αn posterior. Furthermore, if for any t R, R

An eαnℓn(ηt)dΠ(η) R eαnℓn(η)dΠ(η) = 1 + o P (1),

then the αn posterior distribution of nαn(ψ(η) ˆψ) converges weakly in P0 probability to a Gaussian distribution with mean 0 and variance V0.

Semiparametric Inference Using Fractional Posteriors

The last display of Theorem 2.2 is a change-of-measure type condition. It is satisﬁed if a small additive perturbation of the prior (replacing η by ηt or vice-versa) has little eﬀect on computing the integrals in the display. It can often be checked by doing a change of measure in the prior, see e.g. Castillo (2012b) and Castillo and Rousseau (2015). We now apply this general result to the following two prototypical nonparametric models, which will serve as concrete examples for our main results here and in Section 4.

Model (GWN) (Gaussian white noise) For f L2[0, 1], one observes the trajectory Y n = (Y n(t) : t [0, 1])

d Y n(t) = f(t)dt + 1 nd B(t), t [0, 1],

where B(t) is a standard Brownian motion. For (φk)k 1 any orthonormal basis of L2[0, 1], it is statistically equivalent to observe the subprocess (Y n k = R 1 0 φk(t)d Y n(t) : k 1) acting on this basis. In particular, the problem can be rewritten as observing Y n = (Y n k )k with

Y n k = fk + 1 nεk, k 1,

where fk = R 1 0 f(t)φk(t)dt and εk iid N(0, 1).

The Gaussian white noise model is the continuous analogue of nonparametric regression with ﬁxed or uniform random design (Reiß, 2008). It is a standard approach in statistical theory to instead consider this model (Johnstone, 2019), which behaves asymptotically identically to nonparametric regression, but simpliﬁes certain technical arguments due to the discretization. Commonly used priors for f are series priors and Gaussian process priors, see below for speciﬁc examples.

Model (D) (Density estimation) For f a probability density with respect to Lebesgue measure on the interval [0, 1], one observes Y = Y n = (Y1, . . . , Yn) with Y1, . . . , Yn iid f.

Many diﬀerent priors have been used for density functions; for example histograms, P olya trees, mixture models and logistically transformed priors, see the monograph by Ghosal and van der Vaart (2017). Here we will focus on two large classes: random histograms and exponentiated Gaussian processes. Although for clarity of exposition we focus on these prototypical models, our techniques extend to others. The results from this section require the form of local asymptotic normality (LAN) described in Assumption 2.1, which is expected in order to derive asymptotic normality results, while the nonparametric results from Section 4 only require a prior mass condition in the minimal case.

Gaussian White Noise. In Model (GWN), the likelihood admits a LAN expansion, with η = f, L = 2 and Rn = 0:

ℓn(f) ℓn(f0) = n

2 f f0 2 2 + n Wn(f f0),

L Huillier, Travis, Castillo and Ray

where, for g = P k=1 gkφk, we set Wn(g) = P k=1 gkεk. For the functional, we assume that it admits the following expansion

ψ(f) ψ(f0) = ψ0, f f0 2 + r(f, f0) (6)

for some ψ0 L2([0, 1]). This gives ˆψ = ψ(f0) + Wn(ψ0) n = ψ(f0) + P k=1 ψ0,kεk n , where

ψ0,k = R 1 0 ψ0(t)φk(t)dt, and V0 = ψ0 2 2. Theorem 2.2 immediately implies the following result.

Theorem 2.3 (Semiparametric Bv M in Gaussian white noise) Let ψ : L2[0, 1] R be a functional of f satisfying (6). Suppose that An L2[0, 1] and the remainder term r in (6) satisfy Assumption 2.1, and that for ft = f tψ0 nαn , it holds that R

An eαnℓn(ft)dΠ(f) R eαnℓn(f)dΠ(f) = 1 + o P (1). (7)

Then for ˆψ = ψ(f0) + P k=1 ψ0,kεk n , the αn posterior distribution of nαn(ψ(f) ˆψ) converges weakly in P0 probability to a Gaussian distribution with mean 0 and variance ψ0 2 2.

We emphasise that the form of ˆψ and the limiting variance come from simply considering the expansion of the log-likelihood and the functional as deﬁned in Assumption 2.1. Density Estimation. For f, g L2[0, 1], let F(g) = R g(t)f(t)dt. For η = log f, we have the LAN expansion:

ℓn(η) ℓn(η0) =

i=1 {η(Yi) η0(Yi)} = n

2 η η0 2 L + n Wn(η η0) + Rn(η, η0),

where, for g L2(f0), g 2 L = R (g F0(g))2f0, Wn(g) = 1 n Pn i=1[g(Yi) F0(g)], and

Rn(η, η0) = n F0(h)+ 1

2 h 2 L for h = n(η η0). For the functional expansion, we assume there exists a bounded measurable function ψf0 : [0, 1] R such that

ψ(f) ψ(f0) = Z ψf0f + r(f, f0) and Z ψf0f0 = 0. (8)

In this case,

ψ(f) ψ(f0) = Z (f f0) ψf0 + r(f, f0) = f f0

L + r(f, f0)

= η η0, ψf0 L + r(f, f0),

with r(f, f0) = B(f, f0) + r(f, f0) and

B(f, f0) = Z η η0 f f0

Note that the last steps are required since the functional expansion should hold in terms of the parameter η = log f rather than the density f itself. This gives ˆψ = ψ(f0) + Wn( ψf0)/ n = ψ(f0)+Pn i=1 ψf0(Yi)/n, and limiting variance ψf0 2 L = R ψ2 f0f0. With this in mind, we obtain the following result.

Semiparametric Inference Using Fractional Posteriors

Theorem 2.4 (Semiparametric Bv M in density estimation) Let f ψ(f) be a functional on probability densities on [0, 1] and assume there exists a bounded measurable function ψf0 : [0, 1] R such that (8) holds. Suppose that for some sequence εn 0 and sets An {f : f f0 1 εn}, for r as in (8),

Παn(An|Y n) = 1 + o P (1), (9)

sup f An r(f, f0) = o 1 nαn

Denote ft = fe t ψf0/ nαn/F(e t ψf0/ nαn) and for An as above, assume that R

An eαnℓn(ft)dΠ(f) R eαnℓn(f)dΠ(f) = 1 + o P (1). (11)

Then for ˆψ = ψ(f0) + 1

n Pn i=1 ψf0(Yi), the αn posterior distribution of nαn(ψ(f) ˆψ) converges weakly in P0 probability to a Gaussian distribution with mean 0 and variance R ψ2 f0f0.

We now proceed to apply these results to concrete priors.

2.3 Random Histogram Priors

We ﬁrst illustrate our main theorem for density estimation using a class of histogram priors. We will see that the αn posterior, although leading to an enlarged variance in estimating functionals, can sometimes lead to weaker conditions in terms of regularities. In particular, although the αn posterior rate may then be slower, it provides more robustness against possible semiparametric bias that may occur for certain functionals. We provide an example where uncertainty quantiﬁcation is unreliable for the true posterior, because credible sets will suﬀer from bias, whereas credible sets from the αn posterior still cover the true unknown function. Random histogram prior. For any integer k, we deﬁne a distribution on H1 k, the subset of regular histograms with k equally spaced bins which are densities on [0, 1]. Let S1 k = {ω [0, 1]k, Pk i=1 ωi = 1} be the unit simplex in Rk. Denote by D(δ1, . . . , δk) the Dirichlet distribution with real positive weights (δ1, . . . , δk) on S1 k and consider the induced measure H(k, δ1, . . . , δk) on H1 k deﬁned as

j=1 ωj1Ij(x), ω = (ω1, . . . , ωk) D(δ1, . . . , δk), (12)

where Ij = [(j 1)/k, j/k] for j = 1, . . . , k. We now deﬁne the random histogram prior Π = Πn that we will use throughout this section. Let Kn be a diverging sequence to be chosen below and (δ1,n, . . . , δKn,n) a sequence of positive weights and set Π = Πn = H(Kn, δ1,n, . . . , δKn,n). We assume the weights satisfy the technical condition

i=1 δi,n = o( nαn) (13)

L Huillier, Travis, Castillo and Ray

as n , which ensures that the prior is not too concentrated around its mean. Linear functionals. Let us apply Theorem 2.4 to the case of linear functionals, i.e. those of the form ψ(f) = R ψf0f. For k 1 and h in L2[0, 1], consider the L2-projection h[k] of h onto the set of histograms with k bins:

Writing ψ = ψf0 for short, deﬁne ˆψ[k] and the sequence Vk from the projection ψ[k] of ψ as

ˆψ[k] = ψ(f0) + 1

i=1 ψ[k](Yi), Vk = Z f0 ψ2 [k] Z f0 ψ[k]

Recall that here, ˆψ = ψ(f0) + Pn i=1 ψ(Yi)/n and V0 = R f0 ψ2 (not to be confused with setting k = 0 in the last display).

Proposition 2.5 Let Π be the random histogram prior (12) with k = Kn and weights satisfying (13). Suppose f0 is bounded and

Παn( f f0,[Kn] 1 εn|Y n) = 1 + o P (1), (14)

for a sequence εn 0. Suppose additionally that nαn( ˆψ[Kn] ˆψ) = o P (1). (15)

Then the αn posterior distribution of nαn(ψ(f) ˆψ) converges weakly in P0 probability to a Gaussian distribution with mean 0 and variance V0.

Assumption (15) ensures that, asymptotically, the posterior distribution is centered at an eﬃcient estimator. Arguments in Lemma B.5 give the expansion ˆψ[Kn] ˆψ = F0( ψ[Kn]) + o P (1/ n), so that (15) can also be formulated as nαn F0( ψ[Kn]) = o(1). Let us also note that, without assuming (15), the proof of Proposition 2.5 still gives that the αn posterior distribution of nαn(ψ(f) ˆψ[Kn]) converges weakly to a N(0, V0) variable. The marginal posterior is thus centered at ˆψ[Kn], whether this is an eﬃcient estimator or not. To gain a quantitative understanding of the minimal smoothness assumptions required by the Bv M and understand how these relate to those for the full posterior (Castillo and Rousseau, 2015), we next consider H older smoothness scales.

Corollary 2.6 Consider estimating ψ(f0) = R 1 0 af0 with a Cγ([0, 1]), f0 Cβ([0, 1]) bounded away from zero and β, γ (0, 1]. Let Π be the random histogram prior (12) with weights satisfying (13) and (nαn) b δi,n 1 for some b > 0, and with Kn = o(nαn/ log(nαn)). If nαn K γ β n = o(1), (16)

then the αn posterior distribution of nαn(ψ(f) ˆψ) converges weakly in P0 probability to a Gaussian distribution with mean 0 and variance V0.

Semiparametric Inference Using Fractional Posteriors

The assumption that Kn is of smaller order than nαn ensures that the αn-posterior for f at least concentrates around f0 at a rate going to 0. Condition (16) is suﬃcient for (15) under the assumed regularity conditions and, as in Proposition 2.5, without assuming (16), the proof of Corollary 2.6 still gives that the αn posterior distribution of nαn(ψ(f) ˆψ[Kn]) converges weakly to a N(0, V0) variable. Many choices of αn, Kn fulﬁll these conditions. Note that the larger Kn, the weaker the regularity conditions, e.g. taking Kn slightly smaller that nαn gives that a Bv M type result, at rate nαn, holds if the regularities satisfy γ + β > 1/2. It is interesting to compare the above result with ones for the standard posterior (αn = 1, which was considered in Castillo and Rousseau, 2015, Theorem 4.2, albeit under a special choice of Kn only). Since the conditions on Kn depend on αn, we underline that we compare both under the same prior (i.e. with the same choice of Kn).

1. Case αn = α (0, 1): consider a sequence Kn = o(n/ log n), weights such that n b δi,n 1 and PKn i=1 δi,n = o( n), and the corresponding random histogram prior. Given this prior, the result obtained for the α posterior is very similar to the one obtained for the full posterior: the larger Kn, the smaller the regularities of the representers of the functional a and f0 may be, and the choice Kn n leads to the condition γ + β > 1/2 also for the α posterior. However, a main diﬀerence lies in the fact that the asymptotic variance for the α posterior is then R f0 ψ2 f0/α, which is larger than the optimal variance R f0 ψ2 f0 obtained for the posterior.

2. Case αn 0: to ﬁx ideas consider αn = n y with 0 < y < 1. Let us further choose Kn = nx with x (0, 1 y), so that the ﬁrst condition on Kn holds. As before, one chooses weights such that n b δi,n 1 for b > 0 and PKn i=1 δi,n = o(n(1 y)/2). Corollary 2.6 with αn = 1 implies that a Bv M with optimal variance holds under the condition γ + β > 1

On the other hand, applying Corollary 2.6 with αn = n y gives the condition

γ + β > 1 y

for the Bv M with rescaling nαn to hold. Thus we obtain a slower rate with the n y posterior, but we have a weaker condition on the regularities of the functions a and f0.

Since the above are only suﬃcient conditions, we next explicitly construct an example where for the same prior, the semiparametric Bv M holds for the αn posterior but fails for the standard posterior. Semiparametric bias and possible lack of Bv M. For the linear functional ψ(f) = R af, Castillo and Rousseau (2015) give a speciﬁc counterexample in which the Bv M theorem is ruled out because of a nonnegligeable bias appearing in the centering of the posterior distribution of R af. We now investigate the behaviour of the αn posterior distribution of R af in the same context. In their counterexample, Castillo and Rousseau (2015) consider a random histogram prior with a random number of bins. Here we adapt the counterexample of Castillo and

L Huillier, Travis, Castillo and Ray

Rousseau (2015) to our setting of a random histogram prior with a deterministic number of bins and derive a result regarding the αn posterior. In order to be able to explicitly compute the bias term, we consider a functional with representer of the form

2 +γ)ψlk(x) (17)

for x in [0, 1], γ > 0 and (ψlk) the Haar wavelet basis.

Proposition 2.7 Let f0 be a continuously diﬀerentiable function with derivative f 0 > ρ > 0 bounded away from zero, and let a be as in (17) with 0 < γ 1/2. Consider the random histogram prior (12) with Kn = 2pn and pn = log(n1/3)/ log(2) and δi,n = n b for all i and some b > 1/6. Then:

1. The posterior distribution of n(ψ(f) ˆψ[Kn]) converges weakly in P0-probability to the N(0, V0) distribution. Moreover the centering ˆψ[Kn] satisﬁes ˆψ[Kn] ˆψ = F0( ψ[Kn]) + o P ( 1 n) with | n F0( ψ[Kn])| c > 0 and even | n F0( ψ[Kn])| if γ < 1/2 . In particular, the posterior distribution is biased and the Bv M theorem does not hold.

2. Consider a sequence αn = n x with (1 2γ)/3 < x < 2/3. Then the αn posterior distribution of nαn(ψ(f) ˆψ) converges weakly in P0-probability to the N(0, V0) distribution.

This provides an example in which a non negligible bias appears in the centering of the posterior distribution of R af (rescaled by n), whereas the αn posterior distribution of R af (rescaled by nαn) is not biased. This has consequences for uncertainty quantiﬁcation: 1 δ quantile credible sets from the posterior have less than 1 δ coverage asymptotically, or even 0 coverage, whereas those for the αn-posterior have coverage greater than 1 δ asymptotically. Uncertainty quantiﬁcation using the αn-posterior is thus reliable, if conservative, whereas that using the standard posterior is not. Of course, note that the αn posterior has a spread of order 1/ nαn (instead of the smaller 1/ n for the posterior), which makes it easier to verify conﬁdence statements. We refer to Section 3 for more details on the coverage and size of credible sets for the αn posterior.

Remark 2.8 (Approximately linear functionals) The above results extend to certain well-behaved non-linear functionals, such as the square-root R f, power R fq, q 2, and entropy R f log f functionals. This is proved in Examples 4.2-4.4 of Castillo and Rousseau (2015) by controlling the remainder of the functional expansion in Assumption 2.1, and the extension to the αn posterior is similar.

2.4 Gaussian Process Priors

In this section, we apply our general semiparametric Bv M results to the widely used class of Gaussian process priors. For general deﬁnitions and background material on Gaussian processes and their associated reproducing kernel Hilbert spaces (RKHS), the reader is referred to Chapter 11 of Ghosal and van der Vaart (2017) or the monograph by Rasmussen and

Semiparametric Inference Using Fractional Posteriors

Williams (2006). We will establish general results for Gaussian priors in density estimation and Gaussian white noise, and then apply these to speciﬁc examples commonly used in practice, such as the Mat ern and squared exponential covariance kernels. Let W = (W(x) : x [0, 1]) be a mean-zero Gaussian process with covariance function K(x, y) = E[W(x)W(y)]. One can view W as a Borel-measurable map in some Banach space (B, ) (e.g. (C[0, 1], )) with associated RKHS (H, H). It is known that nonparametric estimation properties of Gaussian process priors depend on their sample smoothness, as measured through their small-ball probability (van der Vaart and van Zanten, 2008, 2007, 2011). This can be quantiﬁed via the concentration function ϕη0 at a point η0 B, deﬁned as

ϕη0(ε) = log Π( W ε) + 1

2 inf h H: h η0 <ε h 2 H, (18)

where refers to the norm on B. For the full posterior and standard statistical models, the contraction rate for Gaussian processes is then connected to the solution to the equation ϕη0(εn) nε2 n, see van der Vaart and van Zanten (2008). As the next theorem shows, a similar result holds for the fractional posterior by instead considering the inequality

ϕη0(εn) nαnε2 n, (19)

i.e. using the eﬀective sample size n = nαn on the right-hand side, see Section 4 below for details.

Theorem 2.9 (Gaussian white noise) Consider the Gaussian white noise model and assign to f a mean-zero Gaussian prior Π in L2[0, 1] with associated RKHS H. Suppose that εn 0 satisﬁes (19) with η0 = f0 L2[0, 1], and that Assumption 2.1 holds for ψ(f) = ψ(f0) + ψ0, f f0 2 + r(f, f0) and An {f : f f0 2 εn}. Further assume that there exist sequences ψn H and ζn 0 such that

ψn ψ0 2 ζn, ψn H nαnζn, nαnεnζn 0. (20)

Then the αn posterior distribution of nαn(ψ(f) ˆψ) converges weakly in P0 probability to a Gaussian distribution with mean 0 and variance ψ0 2 2.

The sequence ψn allows one to approximate the Riesz representer, ψ0, of the functional by elements of the RKHS H. This is helpful since for elements of the RKHS, one can directly deal with the change of measure condition (7) using the Cameron-Martin Theorem. Note that if ψ0 H, one may immediately take ψn = ψ0 and ζn = 0. The main message from Theorem 2.9 is that for semiparametric inference, the fractional posterior mirrors the main heuristic properties of parametric models (e.g. Equation 2). In particular, all conditions are driven by the usual conditions for semiparametric Bv Ms for Gaussian priors but with eﬀective sample size n = nαn reﬂecting the downweighting of the data (see Section 4 for speciﬁc discussion on this regarding contraction rates). The resulting marginal posterior for the functional ψ(f) is again centered at an eﬃcient estimator ˆψ, but has variance inﬂated by a 1/αn-factor.

L Huillier, Travis, Castillo and Ray

Turning now to density estimation, we use the standard approach of using the exponential link function (Ghosal and van der Vaart, 2017, Section 2.3.1) to ensure the Gaussian process induces a prior on the set of probability densities:

f(x) = f W (x) = e W(x) R 1 0 e W(y)dy . (21)

Theorem 2.10 (Density estimation) Consider density estimation on [0, 1] and suppose f0 C[0, 1] is bounded away from zero. Let W be a mean-zero Gaussian process in (C[0, 1], ) with RKHS H, and consider the induced prior on densities f via (21). Suppose that εn 0 satisﬁes (19) with η0 = log f0. Let ψ(f) be a functional with expansion (8) having continuous representer ψf0 satisfying

sup f An r(f, f0) = o P

for some An {f : f f0 1 ϵn} with Παn(An|Y n) = 1 + o P (1). Further assume that there exist sequences ψn H and ζn 0 such that

ψn ψf0 ζn, ψn H nαnζn, nαnϵnζn 0.

Then the αn posterior distribution of nαn(ψ(η) ˆψ) converges weakly in P0 probability to a Gaussian distribution with mean 0 and variance ψf0 2 L = R 1 0 ψ2 f0f0.

The implications of Theorem 2.10 are similar to those of Theorem 2.9. The use of the slightly stronger -norm compared to the 2-norm in Theorem 2.9 is required to deal with the nonlinear link function (21) and has little eﬀect on our main results. We consider the following speciﬁc examples of Gaussian priors.

Example 1 (Inﬁnite series) Let (φk)k 1 be an orthonormal basis of L2[0, 1]. For γ > 0, consider the random function

k=1 k γ 1/2Zkφk(x), Zk iid N(0, 1). (22)

Deﬁne the Sobolev scales in terms of the (φk) basis:

f L2[0, 1] :

k=1 k2β| f, φk 2|2 R2 )

If (φk) is the Fourier basis, then Hβ coincides with the usual notion of Sobolev smoothness of periodic functions on (0, 1]. The inﬁnite series prior (22) models an almost γ-smooth function in the sense that it assigns probability one to Hs for any s < γ.

Example 2 (Mat ern) The Mat ern process on R with parameter γ > 0 is the mean-zero stationary Gaussian process with covariance kernel (Example 11.8 in Ghosal and van der Vaart, 2017)

K(s, t) = K(s t) = Z

R e i(s t)λ(1 + |λ|2) γ 1/2dλ.

Semiparametric Inference Using Fractional Posteriors

The covariance function can alternatively be represented in terms of special functions, see e.g. p.84 of Rasmussen and Williams (2006).

Example 3 (Squared exponential) The rescaled squared exponential process on R with parameter γ > 0 is the mean-zero stationary Gaussian process with covariance kernel

K(s, t) = K(s t) = exp 1

k2n (s t)2 ,

where kn = nαn log2(nαn)

1 1+2γ is the length scale.

The Mat ern and squared exponential are two of the most widely used covariance kernels in statistics and machine learning (Rasmussen and Williams, 2006). The sample paths of the squared exponential process are analytic, and so are typically too smooth to eﬀectively model a function of ﬁnite smoothness in the sense that they yield suboptimal contraction rates. Rescaling the covariance kernel using the decaying lengthscale kn as in Example 3 allows one to overcome this and model a γ-smooth function (van der Vaart and van Zanten, 2007).

Example 4 (Riemann-Liouville) The Riemann-Liouville process released at zero of regularity γ > 0 is deﬁned as

k=0 Zkxk + Z x

0 (x s)γ 1/2d Bs, (24)

where Zk iid N(0, 1) and B is an independent Brownian motion.

For γ = 1/2, the Riemann-Liouville process reduces to Brownian motion released at zero. Each of the above Gaussian processes is suitable for modelling a γ-smooth function in a suitable sense, which can diﬀer between the processes. For simplicity, we state the following result for linear functionals, but it can be extended to certain non-linear functionals following Remark 2.8.

Corollary 2.11 Let W be a mean-zero Gaussian process. In Gaussian white noise, take as prior f = W and set η0 = f0, while in density estimation take the prior on densities f induced by (21) and set η0 = log f0. Let ψ(f) = R 1 0 fa be a linear functional and consider the two cases:

(i) W is an inﬁnite Gaussian series (Example 1) with parameter γ, η0 Hβ and a Hµ, where Hs is deﬁned in (23);

(ii) W is a Mat ern, rescaled squared exponential or Riemann-Liouville process (Examples 2-4) with parameter γ, η0 Cβ and a Cµ.

2 + (γ µ) 0,

then the αn posterior distribution of nαn(ψ(η) ˆψ) converges weakly in P0 probability to a Gaussian distribution with

L Huillier, Travis, Castillo and Ray

(a) Gaussian white noise: mean 0 and variance a 2 2 in both cases (i) and (ii);

(b) density estimation: mean 0 and variance ψf0 2 L = R ψ2 f0f0, with ψf0 = a R af0 in case (ii).

Corollary 2.11 shows that for widely used Gaussian priors, parametric Bv M results and conclusions (e.g. Miller, 2021; Medina et al., 2022) extend to semiparametric problems. In particular, for regular enough functionals, the heuristic ideas and intuition extend from low-dimensional frameworks to our more complex setting involving an inﬁnite-dimensional nuisance parameter. For the inﬁnite series prior (Example 1) in Gaussian white noise, one can also directly derive the last conclusion using the explicit form of the posterior coming from conjugacy. In particular, this allows one to consider low regularity functionals where n-estimation is not possible, which falls outside the usual Bv M setting. The following extends the computations of Theorem 5.1 of Knapik et al. (2011) to the αn-posterior.

Lemma 2.12 Consider Gaussian white noise, let f have the inﬁnite series prior (Example 1) of regularity γ > 0 and consider the linear function ψ(f) = R 1 0 af. If f0 Hβ, a Hµ, 0 < αn 1 and µ β, then

Ef0Παn(f : |ψ(f) ψ(f0)| Mn max{(nαn) β ( 1

2 +γ)+µ 1+2γ , (nαn) 1/2}|Y ) 0,

for every sequence Mn as n .

Thus in the low regularity regime, the αn-posterior may inﬂate the posterior variance of ψ(f) by a factor slower than 1/αn. This corresponds to a more nonparametric regime and the conclusions here are similar to those obtained for contraction rates for the full parameter, see Section 4 for more discussion. Empirical veriﬁcation of the Bv M for the rescaled squared exponential process. Consider density estimation with n = 10, 000 observations drawn from the density on [0, 1] given by f eg, with g having coeﬃcients gk = k 1

2 β in the Fourier basis of [0, 1]. Consider estimating the linear functional given by ψ(f) = R 1 0 a(t)f(t)dt with a deﬁned by coeﬃcients ak = k 1

2 µ in the same basis. The estimator is ˆψ = 1

n Pn i=1 a(Yi), the eﬃcient inﬂuence function is ψf0(t) = a(t) ψ(f0), and the information bound is ψf0 2 L = R 1 0 a(t)2f0(t)dt ψ(f0)2. We take as prior the exponentiated Gaussian process prior (21) with W a rescaled squared exponential process (Example 3) with length scale kn = n 1 1+2γ . Figure 1 displays histograms of αn posterior draws of nαn(ψ(f) ˆψ)/ ψf0 L with αn = 1/4 for combinations of β, γ; the blue distributions represent cases for which the condition γ β > 1

2 + (γ µ) in Corollary 2.11 is satisﬁed, while the red distributions represent cases when the condition is violated. Posterior draws were generated by MCMC using the sbde package (Tokdar et al., 2022). One can see that when the condition is veriﬁed, the marginal posterior appears to be Gaussian with the correct variance, but when the condition is violated this does not seem to be the case. This illustrates that the asymptotic results and conditions are applicable in ﬁnite sample sizes.

Semiparametric Inference Using Fractional Posteriors

6 3 0 3 6 nαn(ψ(f) ψ )

Rescaled samples from the Fractional Posterior (alpha = 1/4)

6 3 0 3 6 nαn(ψ(f) ψ )

Figure 1: Draws from the fractional posterior distribution of nαn(ψ(η) ψ(f0))/ ψf0 L with αn = 1/4 for diﬀerent combinations of β and γ. In all cases µ = 1, and on the left γ = 1 for diﬀerent values of β, while on the right β = 1 for diﬀerent values of γ. The red distributions correspond to cases where γ β < 1

2 + (γ µ) (the condition in Corollary 2.11 is violated), while the blue distributions correspond to cases where γ β > 1

2 + (γ µ) (the condition is veriﬁed). The black line is the density of a N(0, 1) random variable.

3. Construction of Eﬃcient Conﬁdence Intervals from αn Posteriors

In Section 2, we derived semiparametric Bv M theorems for fractional posteriors. When α = 1, it is well known that the Bv M theorem implies that certain credible sets (typically built from posterior quantiles) are optimal sized conﬁdence sets. For 0 < αn < 1, this is no longer true for αn posteriors in that the length of the resulting credible sets will overshoot the optimal length given by the semiparametric eﬃciency bound. We now investigate how this can be remedied. For simplicity, we focus on the case where ψ(η) is one dimensional. Suppose one has obtained a Bv M theorem for ψ(η), for instance using the results from Section 2, that is,

Παn[ |Y n] τ 1 n ; N(0, V ), (25)

where τn : η nαn(ψ(η) ˆψ), the centering ˆψ is linear eﬃcient and V is the eﬃciency bound for estimating ψ(η). In particular, n( ˆψ ψ(η0)) L N(0, V ). (26)

L Huillier, Travis, Castillo and Ray

For 0 < δ < 1, let a Y n,δ denote the δ quantile of the αn posterior distribution of ψ(η) and consider the quantile region

Iαn = I(δ, αn, Y ) := (a Y n, δ

2 , a Y n,1 δ

By deﬁnition, Παn [ψ(η) Iαn | Y ] = 1 δ, that is, Iαn is a (1 δ) credible set (assuming the αn posterior CDF is continuous, otherwise one takes generalised quantiles). For δ (0, 1), denote by qδ the quantile of the N(0, 1) distribution. From (25) and standard results recalled in Lemma B.6, one deduces that Iαn admits the following expansion:

2 nαn + o P

2 nαn , +o P

When αn = 1 or αn 1, it follows from (27) and the fact that ˆψ is linear eﬃcient that Iαn is asymptotically an eﬃcient conﬁdence interval of level 1 δ for the parameter ψ(η0). When αn α [0, 1), Iαn has a diameter blown-up by a factor 1/ αn compared to I1 for αn = 1, and its conﬁdence level thus exceeds 1 δ. Denoting by Φ the cumulative distribution function of the N(0, 1) distribution, it follows from (27) that,

1. if αn α (0, 1), then P0 [ψ(η0) Iαn] 2Φ(q1 δ/2/ α) 1 > 1 δ;

2. if αn 0, then P0 [ψ(η0) Iαn] 1.

An implication is that while Iαn is a valid conﬁdence set, it is conservative, in that its coverage is larger than the target 1 δ. In order to construct an eﬃcient conﬁdence interval from the αn posterior of ψ(η) when αn α [0, 1), we consider a modiﬁed quantile region. Let ψ be an estimator of ψ(η0) built from the αn posterior distribution of ψ(η) (e.g. posterior median or mean) and set

Jαn := αn(a Y n, δ

2 ψ) + ψ , αn(a Y n,1 δ

2 ψ) + ψ i . (28)

We call this a shift and rescale version of the quantile set (or sometimes corrected set): this new interval is obtained by recentering Iαn at ψ and applying a shrinking factor αn. We now provide a condition under which the shift-and-rescale set presented in (28) has the correct coverage.

Theorem 3.1 Suppose (25) (26) hold for some 0 < αn < 1, and suppose the estimator ψ satisﬁes

ψ = ˆψ + o P (1/ n). (29)

Then Jαn in (28) is an asymptotically eﬃcient conﬁdence interval of level 1 δ for the parameter ψ(η0), i.e. P0 [ψ(η0) Jαn] 1 δ

as n . If αn = α (0, 1] is ﬁxed and ψ is the α posterior median, then (29) holds. In particular, the region (28) is an asymptotically eﬃcient conﬁdence interval of level 1 δ for ψ(η0).

Semiparametric Inference Using Fractional Posteriors

Theorem 3.1 states that if the re centering is close enough to the eﬃcient estimator ˆψ, then the shift and rescale modiﬁcation leads to a conﬁdence set of optimal size (in terms of eﬃciency) from an information-theoretic perspective, and this is always possible for ﬁxed α if one centers at the posterior median. When αn can possibly go to zero, the situation is more delicate. Indeed, although by deﬁnition (25) is centered around an eﬃcient estimator at the scale 1/ nαn, it is not clear in general how to deduce from this a similar result at the smaller scale 1/ n. We do not provide a general answer here, but to gain some insight we consider two speciﬁc examples: the conjugate parametric setting (2), and the nonparametric Gaussian white noise model with a conjugate prior, and investigate whether the αn posterior median a Y n, 1

2 satisﬁes (29) when αn 0.

Theorem 3.1 applies to semiparametric models, but also to parametric models as a special case. In particular, in the conjugate example (2), it is easy to check that (29) holds if and only if nαn , which is a fairly mild condition. We now turn to a more complex setting. Modiﬁed credible sets in Gaussian white noise. Consider Model (GWN) and write f0(t) = P k=1 f0,kφk(t) for (φk)k>0 an orthonormal basis of L2[0, 1]. We assign a prior to f by placing independent priors on the basis coeﬃcients fk = f, φk N(0, λk), and consider the problem of estimating the linear functional ψ(f) = R 1 0 a(t)f(t)dt = P k=1 akfk. By conjugacy arguments, the αn posterior distribution of ψ(f)|Y (n) is Gaussian (so its median and mean coincide) N(a Y n,1/2, σ2), with

a Y n,1/2 =

nαnλk 1 + nαnλk ak Yk, σ2 =

λk 1 + nαnλk a2 k.

Suppose the smoothness of the true function f0, the representer a and the prior are speciﬁed through the magnitude of their basis coeﬃcients as follows, for β, µ, γ > 0,

2 β, ak = k 1

2 µ, λk = k 1 2γ. (30)

Setting ψ = a Y n,1/2 the posterior mean/median, the shift and rescale set is, with zδ the standard Gaussian quantiles,

Jαn = ψ + αnzδ/2 σ , ψ + αnz1 δ/2 σ .

By Theorem 3.1, for the set Jαn to have asymptotic coverage 1 δ it suﬃces that ψ ˆψ = o P (1/ n). The following result describes the behaviour of the shift and rescale sets.

Proposition 3.2 Consider the Gaussian white noise model with Gaussian prior f = P k=1 fkφk, where fk ind N(0, λk), and suppose that (30) holds. Let Jαn denote the set (28) with ψ equal to the posterior mean/median. Then

1. If β + µ > 1 + 2γ, the sets Jαn are eﬃcient conﬁdence intervals of level 1 δ if and only if nαn .

2. If β + µ = 1 + 2γ, then Jαn are eﬃcient conﬁdence intervals of level 1 δ if and only if n log(n)αn .

L Huillier, Travis, Castillo and Ray

2 + γ < β + µ < 1 + 2γ, then the sets Jαn are eﬃcient conﬁdence intervals of level

1 δ if and only if n1 1+2γ

2(β+µ) αn .

This result assumes γ + 1/2 < β + µ, which corresponds to the case where a Bernstein-von Mises result for the standard posterior (αn 1) holds, see Theorem 5.4 in Knapik et al. (2011), cases (ii) and (iii). In agreement with these results, we see by setting αn = 1 in Proposition 3.2 that in all three cases standard credible sets J1 are eﬃcient conﬁdence sets. The point of Proposition 3.2 is to investigate to what extent shift and rescale sets Jαn centered at the posterior median remain eﬃcient conﬁdence sets when αn goes to 0. In Cases 1 and 2, the condition is very mild and any sequence (αn) essentially slower than 1/ n works (recall as noted above that in the basic parametric example (2), the shift and rescale sets are eﬃcient under the same condition nαn ). When β + µ approaches 1/2 + γ (Case 3), αn is only allowed to decrease quite slowly to 0 to preserve eﬃciency. An interpretation is that the problem becomes more nonparametric and the αn posterior median does not necessarily concentrate fast enough in order for (29) to be satisﬁed. Simulation study. We now illustrate the applicability of the asymptotic result presented in Proposition 3.2 to the ﬁnite sample setting. We simulated 10,000 observations of Y n from the Gaussian white noise model (n = 10, 000) with 3 diﬀerent parameter combinations of (β, µ, γ) corresponding to the three diﬀerent cases presented in Proposition 3.2. With each of these observations, we produced credible sets from the full posterior, the αn posterior, and the shift and rescale sets from the αn posterior, and computed their empirical coverage (the proportion of the sets which contained the true parameter ψ(f0)), their length, and the mean bias of their centering. This data is presented in Table 1. For the αn posterior and the corrected credible sets, we study two regimes in each case: one where αn breaches the condition described in Proposition 3.2 by a log n factor, and one where αn veriﬁes the condition by a log n factor. This results in a large diﬀerence in the empirical coverage of the shift and rescale sets; when the lower bound is breached, the corrected sets have little or no coverage, but when the lower bound is respected they have approximately the target coverage. In this example, the conditions provided by Proposition 3.2 seem to be accurate (note that due here to the moderate sample size of n = 10, 000, the p

log(n) factor is still not completely negligible in comparison to the polynomial factor speciﬁed by Proposition 3.2, which explains why the empirical behaviours clearly feature either coverage or non-coverage).

We ﬁrst comment on how the lengths of the corrected sets in each of the cases roughly match the lengths of the credible sets from the full posterior, but that the bias of the centering of the corrected sets is always larger than the bias of the centering of the full posterior (even in the regimes where αn does not breach the lower bound). It is easy to see why this is the case in this particular model; the bias is P k=1 1 1+nαnλk , which is obviously larger in magnitude for smaller αn. The fact that the corrected sets have a larger bias but the same length as those from the full posterior results in a strictly lower coverage, which can be seen in the empirical results. Secondly, we observe that the lengths of the shift and rescale sets are roughly the same for diﬀerent choices of αn, so it is purely the bias of the centering which aﬀects the coverage for αn breaching the lower bound versus αn respecting the lower bound. This makes sense on inspection of the assumptions of Proposition 3.1, which relies on the posterior mean being

Semiparametric Inference Using Fractional Posteriors

Gaussian White Noise Case 1 β + µ > 1 + 2γ

β = 2, µ = 2, γ = 0.5 Cov. Len. Bias (SD) Full Posterior 0.95 0.02 -0.00008 (0.005) αn Posterior ( nαn = 1/ p

log(n) 0) 1.00 0.86 -0.05331 (0.004) αn Posterior ( nαn = p

log(n) ) 1.00 0.08 -0.00059 (0.005) Shift and rescale Sets ( nαn = 1/ p

log(n) 0) 0.00 0.02 -0.05331 (0.004) Shift and rescale Sets ( nαn = p

log(n) ) 0.95 0.02 -0.00059 (0.005) Case 2 β + µ = 1 + 2γ β = 1, µ = 1, γ = 0.5 Full Posterior 0.95 0.02 -0.00006 (0.005) αn Posterior ( n log(n)αn = 1/ p

log(n) 0) 1.00 0.29 -0.01462 (0.005)

αn Posterior ( n log(n)αn = p

log(n) ) 0.99 0.03 -0.00021 (0.005)

Shift and rescale Sets ( n log(n)αn = 1/ p

log(n) 0) 0.15 0.02 -0.01462 (0.005)

Shift and rescale Sets ( n log(n)αn = p

log(n) ) 0.95 0.02 -0.00021 (0.005) Case 3 1 2 + γ < β + µ < 1 + 2γ β = 0.75, µ = 0.75, γ = 0.5 Full Posterior 0.95 0.02 -0.00061 (0.005)

αn Posterior ( n1 1+2γ

2(β+µ) αn = 1/ p

log(n) 0) 1.00 0.40 -0.04759 (0.005)

αn Posterior ( n1 1+2γ

2(β+µ) αn = p

log(n) ) 1.00 0.04 -0.00145 (0.005)

Shift and rescale Sets ( n1 1+2γ

2(β+µ) αn = 1/ p

log(n) 0) 0.00 0.02 -0.04759 (0.005)

Shift and rescale Sets ( n1 1+2γ

2(β+µ) αn = p

log(n) ) 0.94 0.02 -0.00145 (0.005)

Density Estimation Case 4 1 2 + γ < β + µ < 1 + 2γ β = 1, µ = 1, γ = 1 Full Posterior 0.95 0.01 -0.00089 (0.004)

αn Posterior ( n1 1+2γ

2(β+µ) αn = 1/ p

log(n) 0) 0.93 0.06 -0.01267 (0.005)

αn Posterior ( n1 1+2γ

2(β+µ) αn = p

log(n) ) 0.94 0.02 -0.00233 (0.005)

Shift and rescale Sets ( n1 1+2γ

2(β+µ) αn = 1/ p

log(n) 0) 0.31 0.01 -0.01267 (0.005)

Shift and rescale Sets ( n1 1+2γ

2(β+µ) αn = p

log(n) ) 0.92 0.01 -0.00233 (0.005)

Table 1: Data pertaining to the credible sets obtained in the three diﬀerent cases presented in Proposition 3.2. Case 4 represents a similar experiment in density estimation.

within a factor o P (1/ n) of the eﬃcient centering; when αn breaches the lower bound implied by Proposition 3.2, the bias is orders of magnitude larger than when αn respects the lower bound.

L Huillier, Travis, Castillo and Ray

Finally, note that the credible sets from the αn posterior always have coverage close to 1, but at the price of being considerably larger than those from the full posterior or the corrected credible sets. Density estimation. We empirically illustrate the behaviour of shift and rescale sets in density estimation, where exact computations are not possible. We use the same prior, true density and linear functional as the empirical study in Section 2.4, with β = γ = µ = 1. We take n = 10, 000 observations and again generate posterior samples by MCMC using the sbde R-package (Tokdar et al., 2022) with αn = n 1/4/ p

log(n), n 1/4p

log(n) and 1. We consider the (empirical) 95% credible intervals and the corresponding shift-and-rescale credible intervals. Figure 2 shows the roughly Gaussian shape of each of the posterior distributions; the comparatively large credible intervals from the αn posterior (dashed vertical lines); and the fact that the shift and rescale intervals (solid vertical lines) and credible interval from the full posterior have approximately the same length, which shows the correction also appears to work well in this more complex setting. For estimates of the coverage of these shift-and-rescale credible sets, see Case 4 in Table 1. The condition

2(β+µ) αn derived for Gaussian white noise in Proposition 3.2 seems to be a good guide in this setting as well, with the shift-and-rescale credible sets achieving very small coverage when this condition is breached, but approximately the right coverage when the condition is veriﬁed.

Remark 3.3 (Multi-dimensional functionals) Though we do not formally present any multi-dimensional semiparametric Bv M results in this paper, we brieﬂy sketch the analogous construction of a multidimensional shift and rescale set given a Bv M theorem. Recall that for a one-dimensional functional, one uses the αn posterior quantiles to deﬁne the boundary of the credible interval. In higher-dimensions, a simple possibility is to use a sample from the αn posterior to compute its empirical covariance VY , and use this as a shape for the boundary of the credible set. More precisely, for a d dimensional functional, a (1 δ) credible set from an approximately Gaussian Nd( ψ, V ) random variable is approximately

{ψ : (ψ ψ)T V 1(ψ ψ) χ2 d(1 δ)},

where χ2 d(1 δ) is the (1 δ) quantile of the χ2 d distribution. The corresponding empirical shift-and-rescale set from the fractional posterior would then be

{ψ : (ψ ψ)T V 1 Y (ψ ψ) αnχ2 d(1 δ)},

where VY is the empirical αn posterior covariance. This provides an analogue in dimension d 1 of the shift-and-rescale set presented in (26) when d = 1.

4. Contraction Rates for the Fractional Posterior

A ﬁrst step in proving semiparametric Bv M results in Section 2 is to localize the posterior near the true parameter by establishing a contraction rate. We therefore study nonparametric contraction rates for the αn-posterior distribution with a focus on obtaining the precise dependence on both n and αn, results which are also of independent interest for full

Semiparametric Inference Using Fractional Posteriors

1.20 1.22 1.24 1.26

n 1 4 ln(n)

n 1 4 ln(n)

Sample from the αn posterior

Figure 2: Histograms of αn posterior distribution samples of ψ(f) in density estimation using a rescaled squared exponential prior, with empirical credible intervals (dashed lines) and shift-and-rescale credible intervals (solid lines).

nonparametric Bayesian estimation. Given our primary focus is semiparametrics, we will consider common statistical norms which are relevant to this topic, such as Lp-distances. Recall that unlike for the full Bayesian posterior, testing or metric entropy conditions are not needed to obtain contraction rates in the R enyi-divergence for the fractional posterior when αn < 1 (as derived by Walker and Hjort, 2001 for consistency and Zhang, 2006 for rates), see also Kruijer and van der Vaart (2013); Bhattacharya et al. (2019); Gr unwald and Mehta (2020). Given this result is more ﬂexible than the classic test-based approach for full posteriors, we ﬁrst examine its implications for some common statistical norms. For 0 < α < 1, the R enyi divergence of order α between two densities f and g on a measurable space (E, A, µ) is given by

Dα(f, g) = 1 1 α log Z

E fαg1 αdµ .

Further deﬁne the usual Kullback-Leibler divergence K(f, g) = R f log(f/g)dµ and its 2ndvariation V (f, g) = R f (log(f/g) K(f, g))2 dµ. It is well-known that posterior contraction

L Huillier, Travis, Castillo and Ray

rates are related to the prior mass assigned to a Kullback-Leibler type neighbourhood about the true density pn 0 = pn η0:

Bn(pn η0, ε) = Bn(η0, ε) = {η S : K(pn η0, pn η) nε2, V (pn η0, pn η) nε2},

see Chapter 8 of Ghosal and van der Vaart (2017). We ﬁrst modify Theorem 3.1 of Bhattacharya et al. (2019) by introducing an explicit dependence on αn in the small-ball probability.

Theorem 4.1 For any nonnegative sequence εn and 0 < αn < 1 such that nαnε2 n and

Π(Bn(η0, εn)) e nαnε2 n, (31)

there exists C > 0 such that as n ,

n Dαn(pn η, pn η0) C αnε2 n 1 αn |Y n = o P (1).

The last result diﬀers from Theorem 3.1 in Bhattacharya et al. (2019) on two points: ﬁrst, the required lower bound for the small-ball probability in (31) takes the form e nαnε2 n rather than e nε2 n, which is a natural modiﬁcation in view of the interpretation that the αn-posterior uses eﬀective sample size n = nαn; second, the obtained rate in terms of Dαn(pn η, pn η0)/n is Cαnε2 n/(1 αn) instead of Cε2 n/(1 αn) (importantly, note that the sequences εn in both rates may be diﬀerent since the small-ball probability condition is diﬀerent, see below for more details). We illustrate the diﬀerence between these approaches in the next examples. Note that in interpreting the rate in Theorem 4.1, one needs to take care of the dependence of Dαn on the exponent αn. In typical examples for iid models, this scales as nαn times squared individual distances between densities. In the Gaussian white noise model for instance, one can directly compute Dαn(f, f0) = nαn

2 f f0 2 2, so that the conclusion of the last statement becomes

f : f f0 2 C εn 1 αn |Y n = o P (1).

Consider for simplicity the case of a β-smooth Gaussian process with β-smooth truth f0, in

which case condition (31) above yields the choice εn = εn,αn = (nαn) β 2β+1 (see Section 4.1 below for precise statements). In this case, Theorem 4.1 gives L2-rate εn,αn(1 αn) 1/2 =

(nαn) β 2β+1 (1 αn) 1/2, while Theorem 3.1 of Bhattacharya et al. (2019) implies rate

εn,1α 1/2 n (1 αn) 1/2 = n β 2β+1 α 1/2 n (1 αn) 1/2. In particular, for all β > 0 and 0 < αn < 1, the former gives a better dependence on αn, particularly in the small αn regime. A similar conclusion holds in density estimation with L1-loss, where one has Dαn(fn, fn 0 ) nαn f f0 2 1/2 (van Erven and Harremoes, 2014, Theorem 31) for fn(x) = Qn i=1 f(xi) the n-fold product density of f, thereby giving the same rates as for L2-loss in Gaussian white noise as just above. Thinking of αn s that go to zero polynomially in n (e.g. αn = n 1/4), one sees that the improvement is polynomial in n in these examples.

Semiparametric Inference Using Fractional Posteriors

Remark 4.2 One can also more generally compare the rates obtained by the two approaches. Denote f(ε) := fn(ε) = Π(Bn(η0, ε)) and g(ε) = e nε2 and suppose to ﬁx ideas that the equations f(εn) = e nαnε2 n and f( εn) = e n ε2 n have unique solutions εn, εn. By deﬁnition (f g)( εn) = 0 while f(εn) g(εn) = e nαnε2 n e nε2 n > 0, so that εn εn using that f g is non-decreasing. In particular, f(εn) f( εn) which leads to αnε2 n ε2 n, implying that the rate provided by Theorem 4.1 is in that case, up to constants, at least as fast as that of Theorem 3.1 of Bhattacharya et al. (2019) (and, as the examples above show, sometimes the improvement is polynomial).

Note that the above rates deteriorate as αn 1, i.e. convergence to the full posterior. This is not surprising since contraction rates for the full posterior typically require additional conditions, such as testing or bounded entropy conditions. Indeed, Barron et al. (1999) provide a counterexample of a prior which satisﬁes the small ball condition (31) with αn = 1 but not a related entropy condition. They show the full posterior is inconsistent (Barron et al., 1999, Section 3.5), whereas the fractional posterior converges to the truth at rate at least (1 α) 1n 1/3 when α (0, 1) is ﬁxed (Bhattacharya et al., 2019). This counterexample shows that one must exploit additional regularity properties of a prior beyond the prior mass condition (31) to ensure good behaviour as αn 1. Note that taking a sequence αn 1 is also relevant to certain practical Bayesian computational algorithms, for instance fractionally weighting (tempering) parallel distributions can improve sampling convergence and yield faster mixing times (Geyer and Thompson, 1995) or in some empirical Bayes methods (Martin and Tang, 2020). We therefore present a second αn-posterior convergence result following the testing approach of Ghosal and van der Vaart (2007), which removes the necessity that αn < 1 at the expense of an extra testing condition needed to control the complexity of the prior support. Theorem 1 of Ghosal and van der Vaart (2007) extends to the αn-posterior using the same proof technique as for the full posterior.

Theorem 4.3 Let d be a metric on the parameter space S and η0 S. Suppose that there exist universal constants K, a > 0 such that for all ε > 0 and all η1 S satisfying d(η0, η1) > ε, there exist tests ϕn satisfying

Eη0ϕn e Knε2, sup η S:d(η,η1)<aε Eη(1 ϕn) e Knε2. (32)

Let Π = Πn be a prior on S, and εn, εn and 0 < αn 1 be nonnegative sequences such that nαn ε2 n . Suppose further that there exist constants C, D > 0 and subsets Sn S satisfying

1. N(εn, Sn, d) e Dnε2 n,

2. Π(Sc n) e (C+3)nαn ε2 n,

3. Π(Bn(η0, εn)) e Cnαn ε2 n.

Then there exists M > 0 such that as n ,

Παn(η : d(η, η0) M(εn εn)|Y n) P0 0.

L Huillier, Travis, Castillo and Ray

In the i.i.d. density estimation model, the testing condition (32) is satisﬁed for instance by the Hellinger metric, L1-distance or, for a bounded set of densities, by the L2-distance (Ghosal and van der Vaart, 2017, Proposition D.8). It similarly extends to Gaussian white noise with the L2-distance (Ghosal and van der Vaart, 2017, Lemma D.16) and various other non-i.i.d. models such as nonparametric regression, Markov chains and times series, see Chapter 8.3 in Ghosal and van der Vaart (2017). Having two sequences εn and εn adds ﬂexibility to the approach, which can prove useful in certain non-i.i.d. models. Returning to the β-smooth Gaussian process example and assuming for simplicity that

εn εn n β 2β+1 , Theorem 4.3 yields rate (nαn) β 2β+1 compared with the slower rate

(nαn) β 2β+1 (1 αn) 1/2 from Theorem 4.1. In particular, the former rate gains signiﬁcantly when αn 1 and fully matches the original parametric intuition that the fractional posterior uses eﬀective sample size n = nαn. We now apply these general results to the concrete examples of histograms and Gaussian process priors. In all cases we use the sharper rate from Theorem 4.3 since these priors satisfy the required entropy conditions.

Proposition 4.4 (Histogram prior) Consider density estimation on [0,1] with true density f0 Cβ([0, 1]) for some β (0, 1], bounded away from 0. Let Π = Πn denote the histogram prior (12) satisfying Kn = o (nαn/ log(nαn)) and 1 (nαn)b δi,n 1 for i = 1, . . . , Kn for some b > 0. Then there exists C > 0 such that as n ,

f : f f0 1 C Kn log(nαn Kn)

As expected, the rate in the last proposition matches that for the full posterior but with the role of the sample size n replaced by the eﬀective sample size n = nαn (cf. Equation 4.8 in Castillo and Rousseau, 2015). Note that the optimal choice K n (log(nαn)/(nαn)) 1 2β+1 that balances the two terms in the rate also depends on αn and hence will not match the optimal truncation for the true posterior. This follows since the fractional posterior inﬂates the variance without signiﬁcantly aﬀecting the bias in the well-speciﬁed setting considered here. We further remark that the prior conditions required in Proposition 4.4 become more stringent as αn 0, though one may always take Kn since nαn by assumption.

4.1 Contraction rates for Gaussian process priors

As mentioned in Section 2.4 above, for a mean-zero Gaussian process W viewed as a Borelmeasurable map in a Banach space (B, ) with corresponding RKHS (H, H), the corresponding contraction rates are related to the behaviour of the concentration function, ϕw. This connection is made explicit in Theorem 2.1 of van der Vaart and van Zanten (2008), which characterizes rates such that a Gaussian prior places suﬃcient mass about a given truth and concentrates on sets of bounded complexity. These conclusions are in terms of the Banach-space norm , which must then be related to concrete distances in standard statistical settings. The following result extends Theorem 2.1 of van der Vaart and van Zanten (2008) to the fractional posterior by considering the solution to the equation ϕη0(εn) nαnε2 n, i.e.

Semiparametric Inference Using Fractional Posteriors

using the eﬀective sample size on the right-hand side, see (19). Since it is well-established that the support of a Gaussian process W equals the closure of its RKHS H under the underlying Banach space norm , we require the true parameter η0 to lie in this space.

Lemma 4.5 Let W be a mean-zero Gaussian random element in a separable Banach space (B, ) with associated RKHS (H, H), and suppose η0 lies in H, the closure of H in B. If εn > 0 and αn > 0 satisfy ϕη0(εn) nαnε2 n, then for any C > 1 with Cnαnε2 n > log 2, there exist measurable sets Bn B such that

log N(3εn, Bn, ) 6Cnαnε2 n,

P(W / Bn) e Cnαnε2 n,

P( W η0 < 2εn) e nαnε2 n.

Lemma 4.5 involves the Banach space norm , which is related to statistically relevant norms and divergences in both Gaussian white noise and density estimation in (van der Vaart and van Zanten, 2008). We will shortly make this correspondence explicit in Propositions 4.7 and 4.8 below. However, given our interest in the precise role of the fractional parameter αn, we ﬁrst study corresponding lower bounds for the contraction rate. For Gaussian process priors, this has been studied in Castillo (2008), where it is established that a lower bound on the concentration function in turn implies a lower bound on the contraction rate.

Lemma 4.6 (Lower bound for contraction rate) Let W be a mean-zero Gaussian random element in a separable Banach space (B, ) with associated RKHS (H, H), and suppose η0 lies in H, the closure of H in B. Suppose εn 0, 0 < αn 1 such that nαnε2 n satisfy Π (Bn(η0, εn)) e cnαnε2 n for some c > 0. If δn 0 satisﬁes ϕη0(δn) (2+c)nαnε2 n, then as n , Παn(η : η η0 δn|Y n) P0 0.

Note that Lemma 4.6 yields a lower bound on the posterior contraction rate for the parameter η to which the Gaussian process is assigned, and in the underlying Banach space norm , which need not match the desired statistical distance. We now specialize the above results to our two concrete models.

Proposition 4.7 (Contraction rates in Gaussian white noise) Consider the Gaussian white noise model and let the prior on f be a mean-zero Gaussian random element W in L2[0, 1] with associated RKHS H. If the true parameter f0 lies in the support of W and εn 0 satisﬁes ϕf0(εn) nαnε2 n, then for some M > 0 large enough,

Παn(f : f f0 2 > Mεn|Y n) P0 0,

as n . Moreover, if ϕf0(δn) 9 4nαnε2 n, then for suﬃciently small m > 0 and as n , Παn(f : f f0 2 mδn|Y n) P0 0.

L Huillier, Travis, Castillo and Ray

In the white noise model, one can consider W as a random element of L2[0, 1], so that the norms for the upper and lower bounds in Proposition 4.7 match. This is no longer the case in density estimation.

Proposition 4.8 (Contraction rates in density estimation) Consider density estimation on [0, 1] and assign to the density f a prior of the form (21), where W is a mean-zero Gaussian random element in L [0, 1] with associated RKHS H. If the true parameter η0 = log f0 lies in the support of W and εn 0 satisﬁes ϕη0(εn) nαnε2 n, then for M > 0 large enough, as n ,

Παn(f : f f0 1 > Mεn|Y n) P0 0.

Moreover, there exists C1 > 0 a ﬁnite constant such that if ϕη0(δn) C1nαnε2 n, then for suﬃciently small m > 0 and as n ,

Παn(f : f f0 mδn|Y n) P0 0.

One typically expects the rates in L1 and L to match up to a logarithmic factor in n, so εn and δn in the last proposition should heuristically be of the same polynomial order. However, a lower bound in L does not strictly imply one in the weaker L1-norm and hence there is a genuine mismatch here. We next apply the above results to the concrete examples of Gaussian priors considered above.

Corollary 4.9 Let W be one of the mean-zero Gaussian process described in Examples 1-4 with regularity parameter γ > 0, considered as a random element in Lp[0, 1] with associated concentration function ϕη0. Then εn 0 satisﬁes ϕη0(εn) nαnε2 n in the following cases.

(i) Inﬁnite series prior (Example 1) with p = 2, η0 Hβ and εn (nαn) γ β

(ii) Mat ern process (Example 2) with p = , η0 Cβ and εn (nαn) γ β

(iii) Rescaled square exponential process (Example 3) with p = , η0 Cβ and εn nαn log2(nαn)

(iv) Riemann-Liouville process (Example 4) with p = , η0 Cβ and

1+2γ if γ β or γ = 1

2 or γ / β + 1

2 + N nαn log(nαn) γ β

1+2γ otherwise.

In particular, such εn give a contraction rate for the αn-posterior distribution in 2-loss in Gaussian white noise (cases (i)-(iv)) or in 1-loss in density estimation (cases (ii)-(iv)).

In all cases, we recover the usual contraction rate with the sample size n replaced by the eﬀective sample size nαn, mirroring the parametric situation. A natural question is whether these rates are sharp, which can be investigated via Lemma 4.6 by lower bounding the concentration function ϕη0(εn). This is a more delicate issue for which less is known,

Semiparametric Inference Using Fractional Posteriors

but we consider two representative examples which can be proved as in Castillo (2008). The goal is to ﬁnd δn as large as possible such that

Ef0Παn(f : f f0 p mδn|Y n) 0,

and evaluate the gap between δn and the rate (nαn) γ β

1+2γ (possibly up to log(nαn)-factors) from Corollary 4.9.

Inﬁnite series prior (Example 1) with regularity γ > 0 and p = 2 in Gaussian white noise. If γ β (undersmoothing case), then for any f0 Hβ, we may take δn (nαn) γ 1+2γ . If γ > β (oversmoothing case), then there exists f0 Hβ such that for

t > 1 + β/2, we may take δn (nαn) β 2γ+1 (log(nαn)) t.

Brownian motion released at zero in density estimation with p = . Consider W(x) = Z0 + Bx for B a standard Brownian motion, Z0 N(0, 1) independent and the expontiated prior (21). This corresponds to the Riemann-Liouville process (Example 4) with γ = 1/2, but with a slight correction to the polynomial term. If f0 Cβ

for β 1/2 (undersmoothing case), then we may take δn (nαn) 1/4, which equals (nαn) γ 1+2γ with γ = 1/2.

In these two examples, the upper and lower bounds match, possibly up to logarithmic factors, indicating that our results capture the correct dependence on αn in the nonparametric contraction rate for the fractional posterior. This matches a similar conclusion in the parametric setting (Miller, 2021; Medina et al., 2022).

4.2 Supremum norm contraction rates in Gaussian white noise

The two general approaches to posterior contraction used above are known to yield suboptimal rates in losses such as L , which are incompatible with the intrinsic distance that geometrizes the statistical model (e.g. the Hellinger distance in density estimation), see Hoﬀmann et al. (2015). An alternative method is to express such a loss in terms of multiple functionals, usually involving basis coeﬃcients, and then apply tools from semiparametric Bv M results uniformly over these functionals (Castillo, 2014). We follow the program of Castillo (2014) and show that this approach extends to the fractional posterior setting in Gaussian white noise. Let (ψlk) denote a boundary corrected S-regular orthonormal wavelet basis of L2[0, 1], see H ardle et al. (1998) for full details and deﬁnitions. Consider the Besov ball

f L2[0, 1] : sup l 0 sup 0 k 2l 1 | f, ψlk 2| R2 l(β+1/2) )

The space Bβ is equivalent to the usual H older space Cβ for non-integer β, while for integer β it is slightly larger, satisfying the continuous embedding Cβ Bβ . We consider a wavelet series prior of the form

k=0 σlζlkψlk(x), (33)

L Huillier, Travis, Castillo and Ray

where ζlk iid ϕ from some density ϕ on R and σl > 0 is a scaling factor.

Proposition 4.10 Let f0 Bβ (R) for some β, R > 0, and consider the wavelet series prior (33) with (i) ϕ equal to the uniform Unif[ B, B] density for some B > R and σl = 2 l(β+1/2) or (ii) ϕ equal to a density that is positive on [ 1, 1] and satisﬁes the tail condition

c1e b1|x|1+δ ϕ(x) c2e b2|x|1+δ for all |x| 1, (34)

for some b1, b2, c1, c2, δ > 0 and σl = 2 l(β+1/2)(l + 1) 1 1+δ . Then there exists M > 0 large enough such that

Z f f0 dΠαn(f|Y n) M log(nαn)

The conclusion of the proposition is in E0-expectation, which is slightly stronger than the usual notion of a posterior contraction rate and readily implies the latter via Markov s inequality. Proposition 4.10 thus shows that contraction rates in stronger norms, such as the L -norm, satisfy the same heuristic messages derived above, namely that nonparametric contraction rates use the eﬀective sample size. Note that for g a N(0, 1) density, which is covered by the last result, the prior (33) reduces to a mean-zero Gaussian process with covariance kernel K(x, y) = P l Ln,k 2 l(2β+1)ψlk(x)ψlk(y). The uniform use of the semiparametric tools developed here can also be used to establish full nonparametric Bv M results in weaker topologies which permit estimation at rate nαn (Castillo and Nickl, 2014). We mention that such results can provide frequentist coverage guarantees for certain Bayesian credible sets for the full inﬁnite-dimensional parameter as well, although we do not pursue such extensions here.

Acknowledgments

The authors would like to thank three reviewers for helpful comments, Surya Tokdar for providing early access to the sbde R-package, and the Imperial College London-CNRS Ph D Joint Programme for funding to support this collaboration and travel between the Sorbonne Universit e and Imperial College London. ALH is funded by a CNRS Imperial College Ph D grant. IC acknowledges funding from the Institut Universitaire de France and ANR grant project BACKUP ANR-23-CE40-0018-01.

Appendix A. Proofs of Main Results

A.1 Contraction Rates

Proof of Theorem 4.1 By Lemma B.1, on a subset Cn of P0-probability at least 1 1 nε2n , for any measurable set A S,

E0Παn(A|Y n) = E0

A pn η (Y n)αn

pnη0(Y n)αn dΠ(η) R pnη (Y n)αn pnη0(Y n)αn dΠ(η) E0

A pn η (Y n)αn

pnη0(Y n)αn dΠ(η)

Π(Bn(η0, εn))e 2αnnε2n 1Cn + P0(Cc n)

A R pn η(x)αnpn η0(x)1 αndµ(x)dΠ(η)

Π(Bn(η0, εn))e 2αnnε2n + o(1),

Semiparametric Inference Using Fractional Posteriors

where the last equality follows from Fubini s theorem. Set

An := η : Z pn η(x)αnpn η0(x)1 αndµ(x) e 4nαnε2 n

= η : 1 n(1 αn) log Z pn η(x)αnpn η0(x)1 αndµ(x) 4 αnε2 n 1 αn

n Dαn(pn η, pn η0) 4 αnε2 n 1 αn

Substituting An into the second-last display and using the small-ball assumption (31) yields

E0Παn(An|Y n)

An e 4nαnε2 ndΠ(η)

Π(Bn(η0, εn))e 2αnnε2n + o(1) e nαnε2 n + o(1) = o(1),

since nαnε2 n .

Proof of Theorem 4.3 Denote εn = εn εn and note that Assumption 1 of the theorem is also satisﬁed for the sequence εn. Then this assumption together with the testing condition imply that there exists M > 0 and tests ψn such that Eη0(ψn(Y n)) = o(1) and sup η Sn,d(η,η0) M εn Eη(1 ψn(Y n)) e (C+3)n ε2 n. Assumptions 2 and 3 and Lemma B.2 yield

that Παn(Sc n|Y n) P0 0 and consequently, setting An := {η, d(η, η0) M εn},

Παn(An|Y n) = Παn(An Sn|Y n)ψn(Y n) + Παn(An Sn|Y n)(1 ψn(Y n)) + Παn(An Sc n|Y n)

ψn(Y n) + Παn(An Sn|Y n)(1 ψn(Y n)) + o P (1)

= Παn(An Sn|Y n)(1 ψn(Y n)) + o P (1).

By Lemma B.1, for a subset Cn of P0-probability at least 1 1 n ε2n and arguing as in the proof of Theorem 4.1 just above, we have

E0Παn(An Sn|Y n)(1 ψn(Y n)) E0

An Sn pn η (Y n)αn

pnη0(Y n)αn dΠ(η)

Π(Bn(η0, εn))e 2αnn ε2n (1 ψn(Y n))1Cn + P0(Cc n).

Using Fubini s theorem and H older s inequality, the last display is bounded by R

An Sn R pn η(x)αnpn η0(x)1 αn(1 ψn(x))dµ(x)dΠ(η)

Π(Bn(η0, εn))e αn2n ε2n + P0(Cc n)

An Sn R pn η(x)(1 ψn(x))dµ(x) αn R pn η0(x)dµ(x) 1 αn dΠ(η)

e Cnαn ε2ne 2nαn ε2n + o(1),

which is bounded by e(2+C)nαn ε2 n R

An e (C+3)nαn ε2 ndΠ(η) + o(1) e nαn ε2 n + o(1) = o(1).

Proof of Proposition 4.4 The proof is a direct application of Theorem 4.3. First, the testing condition (32) is satisﬁed in the density estimation model with d = 1. Then, let us verify the conditions 1, 2 and 3 for Sn = H1 Kn. For Condition 1, set εn = p

Kn log(n)/n

L Huillier, Travis, Castillo and Ray

that satisﬁes Kn log(3K1/2 n /εn) nε2 n and thus 3K1/2 n /εn Kn e Dnε2 n for some D > 0. By a standard result on the ε-covering number of the unit ball B 2(0RKn, 1), for n large enough, it follows,

N(εn, H1 Kn, 1) N(εn, S1 Kn, 1) N(εn, B 2(0RKn, 1), 1)

N(εn/K1/2 n , B 2(0RKn, 1), 2) 3K1/2 n /εn Kn e Dnε2 n,

and therefore εn satisﬁes Condition 1. For the random histogram prior, we have Π((H1 Kn)C) = 0 and so Condition 2 is clearly satisﬁed. Finally, by Lemma B.3, the sequence ε2 n = Kn log(nαn Kn)/nαn + K 2β n satisﬁes Π(Bn(f0, M εn)) e nαn(M εn)2 for some M > 0, and thus the result follows from Theorem 4.3.

Proof of Lemma 4.5 The proof is a straightforward adaptation of the proof of Theorem 11.20 in Ghosal and van der Vaart (2017) to the αn posterior, and is hence omitted.

Proof of Lemma 4.6 By Lemma I.28 of Ghosal and van der Vaart (2017), the concentration function satisﬁes

ϕη0(ε) log Π( W η0 ε) ϕη0(ε/2)

for any ε > 0. In particular, Π( W η0 δn) e ϕη0(δn) e (2+c)nαnε2 n, so that under the lemma hypotheses,

Π( W η0 δn)

Π(BKL(η0, εn)) e (2+c)nαnε2 n

e cnαnε2n e 2nαnε2 n 0.

The result then follows from Lemma B.2.

Proof of Proposition 4.7 In Gaussian white noise, the testing condition (32) is satisﬁed by the likelihood ratio test with the distance d = 2 (Ghosal and van der Vaart, 2017, Lemma D.16), and hence it suﬃces to verify conditions (1)-(3) of Theorem 4.3 in order to apply that theorem. For εn satisfying ϕf0(εn) nαnε2 n, Lemma 4.5 gives sets Bn satisfying conditions (1)-(2). By Lemma 8.30 of Ghosal and van der Vaart (2017), the Kullback-Leibler neighbourhoods take the form Bn(f0, εn) = {f : f f0 2 εn} (not to be confused with the Bn from Lemma 4.5). But then Π( f f0 2 < 2εn) e nαnε2 n from the third part of Lemma 4.5, which veriﬁes (3) for εn possibly a multiple of itself. The contraction upper bound thus follows from Theorem 4.3. For the lower bound, we apply Lemma 4.6 with c = 1/4, so that δn satisfying ϕf0(δn) 9

4nαnε2 n is a lower bound for the contraction rate.

Proof of Proposition 4.8 In density estimation, the testing condition (32) is satisﬁed for the Hellinger distance d H (Ghosal and van der Vaart, 2017, Proposition D.8), and hence it again suﬃces to verify conditions (1)-(3) of Theorem 4.3. By Lemma 3.1 of van der Vaart and van Zanten (2008), the squared Hellinger distance, Kullback-Leibler divergence and

Semiparametric Inference Using Fractional Posteriors

its 2nd-variation V between exponentiated densities fw and fv of the form (21) are each bounded by a multiple of v w 2 as soon as v w D0 for some ﬁnite constant D0 < . Conditions (1)-(2) can thus be veriﬁed with d = , while for (3) it suﬃces to show Π( W log f0 εn) e Cnαnε2 n. These three conditions each follow from Lemma 4.5 for εn satisfying ϕlog f0(εn) nαnε2 n, so that we have contraction rate εn in Hellinger distance. Since the L1-distance is bounded by a multiple of the Hellinger distance, we get the same contraction rate in L1. For the lower bound, the proof is similar to the proof of Theorem 3 of Castillo (2008).

Proof of Corollary 4.9 Case (i): inﬁnite series. For ε > 0 small enough, the centered small ball probability satisﬁes ϕ0(ε) ε 1

γ (Lemma 11.47 in Ghosal and van der Vaart,

2017), while infh H: h η0 2<ε h 2 H ε 2γ 2β+1

β for β γ + 1/2 (the latter quantity is O(1) if β > γ + 1/2 since then η0 is in the RKHS of W and one may take h = η0). We thus have

ϕη0(εn) ε 1/γ n + ε (2γ 2β+1)/β n , which can be checked is O(nαnε2 n) for εn (nαn) γ β

1+2γ . Case (ii): Mat ern. For ε > 0 small enough and η0 Cβ, we have ϕη0(ε) ε 1/γ + ε (2γ 2β+1)/β by Lemmas 11.36 and 11.37 of Ghosal and van der Vaart (2017). As in case

(i), this is O(nαnε2 n) for εn (nαn) γ β

Case (iii): squared exponential. Taking the length scale kn = nαn log2(nαn)

1 1+2γ , Lemma

2.2 and Theorem 2.4 of van der Vaart and van Zanten (2007) imply that for η0 Cβ,

log 1 knε2n

if kβ n εn. Then ϕη0(εn) nαnε2 n is satisﬁed for εn kβ n log(nαn) nkn , which has minimal

solution εn nαn log2(nαn)

Case (iv) Riemann-Liouville. For η0 Cβ, the concentration function satisﬁes (Theorem 4 of Castillo, 2008)

β γ > β and ( γ = 1/2 or γ / β + 1/2 + N),

β log(1/ε) otherwise.

In the ﬁrst two cases, ϕη0(εn) nαnε2 n is satisﬁed by εn = (nαn) γ β

1+2γ , while in the third

case, ϕη0(εn) nαnε2 n for εn = nαn log(nαn) γ β

A.2 Bernstein von Mises Results

Proof of Theorem 2.2 In this proof, to avoid any possible confusion, we use the explicit notation o P0(1) for a term going to 0 in P0 probability (instead of the shorthand o P (1)). To show that n(ψ(η) ˆψ) converges in distribution (in P0 probability) to a N(0, V0) law,

L Huillier, Travis, Castillo and Ray

it suﬃces to do so for n(ψ(η) ˆψ)1An(η). Indeed, n(ψ(η) ˆψ) = n(ψ(η) ˆψ)1An(η)+ n(ψ(η) ˆψ)1Acn(η), and since by assumption Παn[Ac n | Y n] = o P0(1), for η Παn[ | Y n] the variable 1Acn(η) goes to 0 in probability, and so does n(ψ(η) ˆψ)1Acn(η) (the probability that it is non zero is Παn[Ac n | Y n]). Since convergence in distribution is implied by convergence of Laplace transforms (this is also true for convergence in distribution in P0 probability, see Lemma 1 of the supplement of Castillo and Rousseau, 2015 for details on this), it is enough to show, for any real t, that Eαn[e n(ψ(η) ˆψ)1An | Y n] goes to et2V0/2 in P0 probability. Since e n(ψ(η) ˆψ)1An = e n(ψ(η) ˆψ)1An + 1Acn, using again that Παn[Ac n | Y n] = o P0(1), it is enough to show that

Eαn(et nαn(ψ(η) ˆψ)|Y n, An) :=

An et nαn(ψ(η) ˆψ)eαnℓn(η) αnℓn(ηt)eαnℓn(ηt)dΠ(η) R

An eαnℓn(η)dΠ(η)

An et nαn(ψ(η) ˆψ)eαnℓn(η) αnℓn(ηt)eαnℓn(ηt)dΠ(η) R eαnℓn(η)dΠ(η) Παn(An | Y n) 1

goes to et2V0/2 in P0 probability, where ηt = η tψ0/ nαn the path as in (4). Using the LAN expansion in Assumption 2.1 and the linearity of Wn,

ℓn(η) ℓn(ηt) = n

2 η η0 2 L + n

2 ηt η0 2 L + n Wn(η ηt) + Rn(η, η0) Rn(ηt, η0)

= t n αn ψ0, η η0 L + t2

2αn ψ0 2 L + t αn Wn(ψ0) + Rn(η, η0) Rn(ηt, η0),

recalling that L is a norm induced by a Hilbert space. Using the deﬁnition (5) of ˆψ and the functional expansion in Assumption 2.1,

t nαn(ψ(η) ˆψ) = t nαn ψ0, η η0 L t αn Wn(ψ0) + t nαnr(η, η0).

Combining the last two displays thus gives

t nαn(ψ(η) ˆψ) + αnℓn(η) αnℓn(ηt)

= αnℓn(ηt) + t2 ψ0 2 L 2 + t nαnr(η, η0) + αn(Rn(η, η0) Rn(ηt, η0)) | {z } Rem(η,η0)

where supη An |Rem(η, η0)| = o P0(1) by assumption. Substituting this into the ﬁrst display of the proof gives

Eαn(et nαn(ψ(η) ˆψ)|Y n, An) = eo P0(1)+t2||ψ0||2 L/2

An eαnℓn(ηt)dΠ(η) R eαnℓn(η)dΠ(η) .

Since the last ratio equals 1 + o P0(1) by assumption, the last display goes to et2V0/2 in P0 probability, which concludes the proof.

Semiparametric Inference Using Fractional Posteriors

Proof of Theorem 2.4 We proceed by verifying the assumptions of Theorem 2.2 for the parameter η = log f. We ﬁrst need to verify Assumption 2.1. As in the discussion preceding the statement of Theorem 2.4, we have the LAN and functional expansions given by:

ℓn(η) ℓn(η0) = n

2 η η0 2 L + n Wn(η η0) + Rn(η, η0)

ψ(f) ψ(f0) = η η0, ψf0 L + B(f, f0) + r(f, f0),

where B(f, f0) = R h η η0 f f0

i ψf0f0, so that r(f, f0) = B(f, f0) + r(f, f0). With ft as in the statement of Theorem 2.4 and ηt = log ft,

Rn(η, η0) Rn(ηt, η0) = t n αn η η0, ψf0 L t2

2αn ψf0 2 L + n log F(e t ψf0/ nαn).

Expanding the last term, we have for f An { f f0 1 ϵn},

n log F(e t ψf0/ nαn) = n log

Z f ψf0 + t2

Z f ψ2 f0 + o

t2 ψ2 f0 nαn

= n log 1 t nαn η η0, ψf0 L t nαn B(f, f0)+

2nαn ψf0 2 L + t2

2nαn (F F0)( ψ2 f0) + O((nαn) 3/2)

= t n αn η η0, ψf0 L t n αn B(f, f0) + t2

2αn ψf0 2 L + o(1),

since (F F0)( ψ2 f0) ψf0 2 f f0 1 εn on An. Hence we have

Rn(η, η0) Rn(ηt, η0) = t n αn B(f, f0) + o(1),

and the condition on remainder terms in Assumption 2.1 reduces to

sup f An | nαnr(f, f0)| = o P (1),

which is satisﬁed by assumption. The result then follows from Theorem 2.2.

Proof of Proposition 2.5 To prove Proposition 2.5, we use Lemma A.1 and Lemma A.2 stated below. Lemma A.1 is proved in Section B and the proof is very similar to the one of Theorem 2.4. The main diﬀerences with Theorem 2.4 are that the change of variables condition is stated in term of the projection of ψ and the posterior concentration is around the projection of f0. For a random histogram prior, these two changes turn out to be useful when one wants to give suﬃcient conditions for the change of variables condition to be satisﬁed. Indeed, this is is done in Lemma A.2 which is also proved in Section B.

Lemma A.1 Recall that ˆψ[Kn] = ψ(f0) + 1

n Pn i=1 ψ[Kn](Yi). Suppose f0 is bounded and

Παn(An|Y n) := Παn({f H1 Kn, f f0,Kn 1 εn}|Y n) = 1 + o P (1), (37)

L Huillier, Travis, Castillo and Ray

for a sequence εn = o(1). Set ft = fe t ψ[Kn] nαn /F(e t ψ[Kn] nαn ) and suppose

An eαnln(ft)dΠ(f) R eαnln(f)dΠ(f) = 1 + o P (1). (38)

Then the αn-posterior distribution of nαn(ψ(f) ˆψ[Kn]) converges weakly to a Gaussian distribution with mean 0 and variance V = R f0 ψ2 f0.

Lemma A.2 Let Π be the random histogram prior (12) with k = Kn and weights satisfying (13). Suppose

Παn( An|Y n) := Παn({f H1 Kn, f f0,[Kn] 1 εn}|Y n) = 1 + o P (1), (39)

for a sequence εn = o(1). Then there exists εn = o(1) a positive sequence (possibly bigger than εn), such that

Παn(An|Y n) := Παn({f H1 Kn, f f0,[Kn] 1 εn}|Y n) = 1 + o P (1), (40)

An eαnln(ft)dΠ(f) R eαnln(f)dΠ(f) = 1 + o P (1). (41)

We can combine these two results to prove Proposition 2.5. Indeed, from assumptions (13) and (14), using Lemma A.2, we know that there exists a positive sequence εn decreasing to 0 satisfying (40) and (41). Then we deduce from Lemma A.1 that the posterior distribution of nαn(ψ(f) ˆψ[Kn]) converges weakly to a Gaussian distribution with mean 0 and variance V = R f0 ψ2 f0. Finally, assumption (15) implies that nαn(ψ(f) ˆψ) converges weakly to a Gaussian distribution with mean 0 and variance V = R f0 ψ2 f0.

Proof of Corollary 2.6 This is a direct application of Proposition 2.5. Using the assumptions made on Kn, the weights and f0, and Proposition 4.4, we deduce that there exists εn 0 satisfying (14). Let us now consider the bias term nαn( ˆψ[Kn] ˆψ). Recall that from Lemma B.5,

nαn( ˆψ ˆψ[Kn]) = nαn( F0( ψ[Kn]) + o P (1/ n)) = nαn F0( ψ[Kn]) + o P (1).

Using the deﬁnition of ψ[Kn],

F0( ψ[Kn]) = Z 1

0 f0 ψ[Kn] = Z 1

0 f0(a[Kn] a) = Z 1

0 (f0,[Kn] f0)(a[Kn] a),

so that by the H older regularity of f0 and a,

nαn|F0( ψ[Kn])| nαn a[Kn] a f0,[Kn] f0 nαn K γ β n = o(1)

Semiparametric Inference Using Fractional Posteriors

by assumption (16). Hence the assumptions of Proposition 2.5 are satisﬁed, which yields the result.

Proof of Proposition 2.7 Recall from Castillo and Rousseau (2015) p.2371, that the assumptions made on a, Kn, and f0 allow us to bound the bias term F0( ψ[Kn]) as follows

3 K (γ+1) n F0( ψ[Kn]) K (γ+1) n n (γ+1)

We ﬁrst prove the result regarding the full posterior. Since, f0 C1([0, 1]) and is bounded away from 0, n1/3

2 Kn n1/3 and hence Kn = o(n/ log(n)), n b δi,n = n b 1, we deduce from Proposition 4.4 that there exists εn 0 satisfying (14). Combining this latter result with the fact that PKn i=1 δi,n = Knn b n1/3 b = o( n), we can use Proposition 2.5 to deduce that the posterior distribution of n(ψ(f) ˆψ[Kn]) converges weakly to the N(0, V0) distribution in P0-probability. Moreover, (42) implies | n F0( ψ[Kn])| c > 0 since γ 1/2 and even | n F0( ψ[Kn])| if γ < 1/2 . For the result regarding the αn-posterior, the proof is similar. Since, f0 C1([0, 1]) bounded away from 0, Kn n1/3 = o(n1 x/ log(n1 x)) since x < 2/3 hence Kn = o(nαn/ log(nαn)), (nαn) b δi,n = n b 1 for some b > 0, we deduce from Proposition 4.4 that there exists εn 0 satisfying (14). Combining this latter result with the fact that PKn i=1 δi,n = Knn b n1/3 b = o( nαn) since b > 1/6, we can use Proposition 2.5 to deduce that the posterior distribution of nαn(ψ(f) ˆψ[Kn]) converges weakly to the N(0, V0) distribution in P0-probability. Finally, by (42), nαn F0( ψ[Kn]) = o(1) since x > (1 2γ)/3. Thus, by Lemma B.5, it follows that nαn( ˆψ ˆψ[Kn]) = o P (1). Therefore, we deduce that the αn-posterior distribution of nαn(ψ(f) ˆψ) converges weakly to the N(0, V0) distribution in P0-probability.

Proof of Theorem 2.9 We will verify the conditions of Theorem 2.3, for which we need to construct suitable sets An satisfying Assumption 2.1 and the change of measure condition (7). Under the theorem hypothesis that ϕf0(εn) nαnε2 n, Proposition 4.7 implies that the posterior contracts about f0 at rate εn in 2, i.e. Παn(Bn|Y ) P0 1 for Bn = {f : f f0 2 Mεn} with M > 0 large enough. Turning to condition (7), we follow Castillo (2012a) and ﬁrst approximate the perturbation ft = f tψ0 nαn by an element of the RKHS and then apply the Cameron-Martin Theorem. To this end, let ψn H satisfy (20). Deﬁne the following isometry associated to the Gaussian process W:

UW : Vect {t K( , t) : t R} L2(Ω)

i=1 ai K( , ti) 7

i=1 ai Wti =: UW (η),

and since any h H is the limit of a sequence Ppn i=1 ai,n K( , ti,n), UW can be extended to an isometry UW : H L2(Ω). Then UW (h) is the L2-limit of the sequence Ppn i=1 ai,n Wti, so that it is a Gaussian random variable with mean 0 and variance h 2 H. Recalling that

L Huillier, Travis, Castillo and Ray

f = W is a Gaussian process under the prior, the usual Gaussian tail bound implies

Π(W : |UW (ψn)| M0 nαnεn ψn H) 2e M2 0 nαnε2 n/2, (43)

so that the posterior probability of the set in the last display tends to zero in P0-probability by Lemma B.2 for M0 > 0 large enough. Together with the contraction result, this shows that the sets An = {w : |Uw(ψn)| M0 nαnεn ψn H} Bn

satisfy Π(An|Y ) P0 1 as n . Since An Bn and using the assumptions of the present theorem, the sets An satisfy Assumption 2.1. It thus remains to establish the condition (7). Deﬁne the approximate perturbation fn = f tψn nαn , which we will now show satisﬁes

sup f Bn |αn(ℓn(fn) ℓn(ft))| = o P (1). (44)

Indeed, using the LAN expansion for the Gaussian white noise model, under P0,

αn(ℓn(fn) ℓn(ft)) = t nαn

0 (f f0)(ψn ψ0) t2

2 ( ψn 2 2 ψ0 2 2)+ t nαn Wn(ψ0 ψn),

where we recall Wn(g) N(0, g 2 2) for any g L2. By Cauchy-Schwarz, the ﬁrst term is bounded by t nαn f f0 2 ψn ψ0 2 t nαnζnϵn = o(1) by assumption (20) for f Bn. The absolute value of the second term equals

2 | ψn ψ0, ψn + ψ0 2| t2

2 ψn ψ0 2 ψn + ψ0 2 t2

2 ζn(2 ψ0 2 + ζn) = o(1),

again by assumption (20). The third term has distribution N 0, t2αn ψn ψ0 2 2/n , which is o P (1) since its variance tends to zero as n . Together, these three bounds establish (44). A version of the Cameron-Martin theorem (Castillo, 2012b, Lemma 17) states that for all Φ : B R measurable and for any g, h H and ρ > 0,

E(1{|UW (g)| ρ}Φ(W h)) = E(1{|UW (g)+ g,h H| ρ}Φ(W)e UW ( h) h 2 H/2).

Using (44) and the last display with ht = tψn/ nαn and ρt = M0 nαnεn ψn H, the quantity in (7) equals

An eαnℓn(ft)dΠ(f) R eαnℓn(f)dΠ(f) =

Bn,t 1{|Uw(ψn)+ ψn,ht H| ρt}eαnℓn(w)e Uw( ht) ht 2 H/2dΠ(w) R eαnℓn(f)dΠ(f) eo P (1),

where Bn,t = Bn ht = {w : w + tψn/ nαn f0 2 Mεn}. For w in the domain of the top integral, using also (20),

|Uw( ht) ht 2 H/2| = t nαn

t nαn ρt + t2

2nαn ψn 2 H t M0 nαnεnζn + t2

Semiparametric Inference Using Fractional Posteriors

Setting An,t = {w : |Uw(ψn) + ψn, ht H| ρt} Bn,t, the ratio of integrals thus equals

An,t eαnℓn(w)dΠ(w)

R eαnℓn(f)dΠ(f) eo P (1) = Παn(An,t|Y )eo P (1).

It thus remains to show Παn(An,t|Y ) = 1 + o P (1). Since

Ac n,t = {w : |Uw(ψn)+ ψn, ht H| > M0 nαnεn ψn H} {w : w+tψn/ nαn f0 2 > Mεn},

it suﬃces to consider the posterior probability of each of the last sets. Since w+tψn/ nαn w 2 ψn 2/ nαn (1 + ζn)/ nαn, the second set is contained in {w : w f0 2 > Mεn C/ nαn}, which has posterior probability o P (1) by Proposition 4.7, possibly after replacing εn by a multiple of itself. For the ﬁrst set, note that | ψn, ht H| = t ψn 2 H/ nαn is of strictly smaller order than nαnεn ψn H if and only if ψn H = o(nαnεn). By (20), it suﬃces that ζn = o( nαnεn), which holds since ζn 0 while nαnε2 n . Thus the ﬁrst set is contained in {w : |Uw(ψn)| > (M0/2) nαnεn ψn H} for n large enough. Arguing as in (43) and using Lemma B.2, the posterior probability of this set is thus o P (1). This shows that Π(Ac n,t|Y ) = o P (1) as required.

Proof of Theorem 2.10 The proof is similar to the proof of Theorem 2.9, but with a few minor diﬀerences. For the LAN expansion, we have under P0,

αn(ℓn(ηn) ℓn(ηt)) = t Gn( ψf0 ψn) + t n Z (f0 fη)( ψf0 ψn) + o(1),

where Gn(g) = 1 n Pn i=1(g(Yi) E0(g(Yi))), so that G( ψf0 ψn) = o P (1). We consider sets Bn deﬁned in terms of the 1-norm rather than 2-norm, so that we may use our contraction results for density estimation. In the last display, we thus use H older s inequality t n f0 fη 1 ψf0 ψn instead of Cauchy-Schwarz, which requires the slightly stronger assumption involving the L -norm ψf0 ψn to show that this tends to 0.

Proof of Corollary 2.11 We apply Theorem 2.9 in Gaussian white noise and Theorem 2.10 in density estimation. In both cases, the required functional expansion holds by the linearity of ψ(f), so that it remains to verify (19) and that one can suitably approximate the representers ψ0 = a or ψf0 = a R 1 0 f0a by elements of the RKHS.

By Corollary 4.9, in each case εn = (nαn) γ β

2γ+1 satisﬁes the condition (19) on the concentration function, possibly up to a log(nαn)-factor that does not aﬀect our results here. Next, one can show that in Examples 2-4 (see the proof of Theorem 4 in Castillo, 2008 for the Riemann-Liouville process, the proof of Lemma 11.37 in Ghosal and van der Vaart, 2017 for the Mat ern process, and the proof of Lemma 2.2 in van der Vaart and van Zanten, 2007 for the rescaled square exponential process), for an appropriate kernel smoother φ and sequence σn,

L Huillier, Travis, Castillo and Ray

satisﬁes ψn H, ψn a σµ n, and ψn 2 H σ 2γ 1+2µ n . Setting σn = ζ1/µ n , we obtain ζn = (nαn) µ 2γ+1 as a suitable choice to satisfy the bounds on ψn. It thus remains to show nαnϵnζn 0, for which a suﬃcient condition is γ β > 1

2 + (γ µ). The ﬁnal part of the condition comes from the fact that we need ϵn 0, which is satisﬁed if γ β > 0. For the inﬁnite series prior in Gaussian white noise, one instead uses the truncated series ψn = PJn k=1 a, φk 2φk H, for which ψn a 2 2 = P k>Jn k2µ 2µ| a, φk 2|2 J 2µ n a 2 Hµ and ψn 2 H J2γ+1 2µ n . Taking Jn (nαn)1/(2γ+1) and ζn (nαn) µ/(2γ) as above, we recover the same conditions as in Examples 2-4.

Proof of Lemma 2.12 By conjugacy of the αn-posterior of f, the αn posterior distribution of ψ(f) is

nαnλk 1 + nαnλk ψk Yk,

λk 1 + nαnλk ψ2 k

As in Knapik et al. (2011), it thus suﬃces to show that

nαnλk 1 + nαnλk ψkf0,k ψkf0,k

nαnλk 1 + nαnλk ψk

λk 1 + nαnλk ψ2 k

is bounded by a multiple of ε2 n. We have,

nαnλk 1 + nαnλk ψkf0,k ψkf0,k

ψkf0,k 1 + nαnλk

(1 + nαnk 1 2γ)2

f0 2 β (li) 2 µ (nαn) 2β+2µ

1+2γ 2 ε2 n,

where we have used Lemma 8.2 in Knapik et al. (2011) to deduce the second last inequality. For the second term,

nαnλk 1 + nαnλk ψk

ψ2 knα2 nλ2 i (1 + nαnλk)2 = nα2 n

(1 + nαnk 1 2γ)2

l 2 µ nαn2(nαn) 2+4γ+2µ

1+2γ 2 αn(nαn) 1+2γ+2µ

1+2γ 1 ε2 n,

where we used Lemma 8.1 of Knapik et al. (2011) for the ﬁrst inequality. Finally,

λk 1 + nαnλk ψ2 k =

1 + nαnk 1 2γ l 2 µ(nαn) 1+2γ+2µ

1+2γ 1 ε2 n

where we again invoke Lemma 8.1 of Knapik et al. (2011) for the ﬁrst inequality.

Semiparametric Inference Using Fractional Posteriors

A.3 Credible Regions

Proof of Theorem 3.1 Using (25) and standard results recalled in Lemma B.6, it follows that for all δ (0, 1), nαn(a Y n,δ ˆψ) =

V qδ + o P (1) and hence we have the expansion of the quantile

a Y n,δ = ˆψ +

V qδ nαn + o P ( 1 nαn ). (45)

Combining this expansion and assumption (29), we obtain αn(a Y n,δ ψ) =

V qδ/ n + o P (1/ n). Hence we can expand the shift-and-rescale set as

Jαn = αn(a Y n, δ

2 ψ) + ψ, αn(a Y n,1 δ

2 / n + o P (1/ n), ˆψ +

2 / n + o P ( 1 n) i .

This last expansion together with assumption (26) yield the ﬁrst conclusion of Theorem 3.1. Let us move to the case αn = α (0, 1] is ﬁxed. By (45), the posterior median equals

2 nαn + o P (1/ nαn) = ˆψ + o P ( 1 nαn ) = ˆψ + o P ( 1 n)

2 = 0. Hence (29) is satisﬁed and the result follows.

Proof of Proposition 3.2 For ψ the posterior mean/median, deﬁne Tn = n( ψ ˆψ). We need to show that |Tn| = o P (1) to satisfy the assumption (29) of Theorem 3.1. We have

1 1 + nαnλk akf0,k

1 1 + nαnλk akϵk =: tn,1 tn,2.

The second term is Gaussian with mean 0 and variance P k=1 1 (1+nαnλk)2 a2 k (nαn) ( 2µ 1+2γ 2)

by Lemma 8.1 of Knapik et al. (2011). Thus, |tn,2| = o P (1) since µ, γ > 0. Turning to tn,1,

set k = (nαn) 1 1+2γ . For k k , we have (nαn)k 1 2γ < 1 + (nαn)k 1 2γ 2(nαn)k 1 2γ, and for k > k , we have 1 < 1 + (nαn)k 1 2γ < 2. Hence, we can write

1 + nαnk 1 2γ n nαn

k=k +1 k 1 (β+µ)

k=1 k (β+µ 2γ) + n(nαn) β+µ

It thus suﬃces to study when this quantity is o(1). The second term in the last display is

o(1) if and only if αn n

1+2γ 2β+2µ 1 =: ωn, while the ﬁrst term has three cases. (1) β + µ > 1 + 2γ: the ﬁrst sum in (46) is summable even for k = , and hence

|tn,1| 1 nαn + + n(nαn) β+µ

1+2γ , which is o(1) if and only if αn max(n 1/2, ωn) = n 1/2, i.e. nαn .

L Huillier, Travis, Castillo and Ray

(2) β + µ = 1 + 2γ: note that ωn = n 1/2, while the ﬁrst term in (46) equals 1 nαn Pk k=1 k 1 1 nαn log k 1 nαn log(nαn). This is o(1) if and only if αn n 1/2 log n,

(3) β + µ < 1 + 2γ: the ﬁrst term in (46) is of size 1 nαn (k )1+2γ β µ n(nαn) β+µ

1+2γ , which is exactly the same order as the second term in (46). Thus |tn,1| = o(1) if and only if αn ωn. But our results are restricted to the regime 0 < αn 1 and hence we require that ωn 0 to have a valid choice satisfying 0 < ωn αn 1. One can then check that ωn 0 if and only if 1

2 + γ < β + µ, which determines the lower bound in this range. For

such a choice, |tn,1| = o(1) if and only if ω 1 n αn = n1 1+2δ 2(β+µ) αn .

A.4 Supremum norm contraction rates

Proof of Proposition 4.10 We focus on the case (ii) for brevity, the case (i) being similar (though easier, see also Castillo, 2014). Set Ln := nαn log(nαn) log(2)(2β+1 . For all

sequences (flk), denote f Ln := PLn l=0 P k flkψlk and f LC n := P l>Ln P k flkψlk. Also denote ˆf Ln := PLn l=0 P k Ylkψlk. We have

E0( Z f f0 dΠαn(f|Y n)) E0( Z f Ln ˆf Ln dΠαn(f|Y n)) | {z } (a)

+ E0( ˆf Ln f Ln 0 ) | {z } (b)

+ E0( Z f Lc n dΠαn(f|Y n)) | {z } (c)

+ f Lc n 0 | {z } (d)

Term (d) Using the assumptions made on the coeﬃcients (f0,lk) and the localisation property of the wavelet basis (ψlk) that P k |ψlk(x)| 2l/2 for all x [0, 1],

l>Ln max k |f0,lk| X

l>Ln 2 l( 1

2 +β)2l/2 2 βLn. (48)

Term (b) Using the localisation property of the basis (ψlk)lk, it follows

ˆf Ln f Ln 0 =

k |ψlk| 1 n

l=0 max k |εlk| 2l/2.

Then a standard result about the maximum of n gaussian variables gives that

E0( ˆf Ln f Ln 0 ) 1 n

l=0 E0( max 2l 1 k 0 |εlk|)2l/2 1 n

log(2l+1)2l/2 Ln n 2 Ln

Semiparametric Inference Using Fractional Posteriors

Term (a) Let t > 0. Using the localisation property of the basis (ψlk)lk and Jensen s inequality, we have

Z f Ln ˆf Ln dΠαn(f|Y n) = 1 nαn

l=0 2l/2 Z max k | nαn(flk Ylk)|dΠαn(f|Y n)

Z et| nαn(flk Ylk)|dΠαn(f|Y n)

Let t R, we want to bound R et nαn(flk Ylk)dΠαn(f|Y n) uniformly over l Ln and k = 0, . . . , 2l 1. By deﬁnition of the αn-posterior distribution

Z et nαn(flk Ylk)dΠαn(f|Y n) =

R et nαn(u Ylk)e nαn

2 (u Ylk)2 1

2 (u Ylk)2 1

R et(u αnεlk)e 1

2 (u αnεlk)2ϕ( 1

σl ( u nαn + f0,lk))du R

2 (u αnεlk)2ϕ( 1

σl ( u nαn + f0,lk))du . (51)

Then let us notice that for B > R, if x [ B(l+1)µ; B(l+1)µ], then ϕ(x) c1e b1B(1+δ)(l+1) Ce cl and 1[ B(l+1)µ;B(l+1)µ]( 1

σl ( u nαn + f0,lk)) 1[

log(nαn)(B R);

log(nαn)(B R)](u)

1[ 1;1](u). Combining this remark with (51), it follows,

Z et nαn(flk Ylk)dΠαn(f|Y n) e t2

2 ecl R 1 1 e 1

2 (u αnεlk)2du e t2

2 ecl R 1 1 e 1

2 (u εlk)2du . (52)

Combining (50) and (52), we obtain

Z f Ln ˆf Ln dΠαn(f|Y n) 1 nαn

2 ecl R 1 1 e 1

2 (u εlk)2du

Taking the E0-expectation and using Jensen s inequality, we get

Z f Ln ˆf Ln dΠαn(f|Y n) 1 nαn

l=0 2l/2(log(2l+1ecl C)

Setting t = p

2 log(2l+1ecl C), we obtain

Z f Ln ˆf Ln dΠαn(f|Y n) 1 nαn

2 log(2l+1ecl C) Ln nαn 2 Ln

L Huillier, Travis, Castillo and Ray

Term (c) Using the localisation property of the basis and Jensen s inequality, we have

Z f Lc n dΠαn(f|Y n) X

l>Ln 2l/2 1

Z et|flk|dΠαn(f|Y n)). (54)

Let t R, we have

Z etflkdΠαn(f|Y n) =

2 (u Ylk)2 1

σl )du R e nαn

2 (u Ylk)2 1

R et( u nαn +f0,lk)e u2

2 +u αnεlk 1 nαnσl ϕ( 1

σl ( u nαn + f0,lk))du R e u2

2 +u αnεlk 1 nαnσl ϕ( 1

σl ( u nαn + f0,lk))du .

First, we bound from below the denominator. Denote A := {u : 1

σl ( u nαn + f0,lk) 1} and µ(A) := R

A 1 nαnσl ϕ( 1

σl ( u nαn + f0,lk))du = R 1 1 ϕ(u)du. Using Jensen s inequality with the exponential function, we obtain

Dlk := Z e u2

2 +u αnεlk 1 nαnσl ϕ( 1

σl ( u nαn + f0,lk))du

2 +u αnεlk 1 nαnσlµ(A)ϕ( 1

σl ( u n + f0,lk))du

2 +u αnεlk 1 nαnσlµ(A)ϕ( 1

σl ( u nαn +f0,lk))du.

Denote ζl = R

A u 1 nαnσlµ(A)ϕ( 1

σl ( u nαn + f0,lk))du.

Dlk µ(A)e 1

2 supu A u2+ αnεlkζl µ(A)e Cnαn(σ2 l +f2 0,lk)+ αnεlkζl, (55)

for some constant C > 0. Now split the integral of the numerator as follows

Z et( u nαn +f0,lk)e u2

2 +u αnεlk 1 nαnσl ϕ( 1

σl ( u nαn + f0,lk))du

A et( u nαn +f0,lk)e u2

2 +u αnεlk 1 nαnσl ϕ( 1

σl ( u nαn + f0,lk))du | {z } :=N1 lk(t)

AC et( u nαn +f0,lk)e u2

2 +u αnεlk 1 nαnσl ϕ( 1

σl ( u nαn + f0,lk))du | {z } :=N2 lk(t)

Semiparametric Inference Using Fractional Posteriors

Using (55) and Fubini s theorem, it follows

E0 N1 lk(t) Dlk e Cnαn(σ2 l +f2 0,lk) Z

A et( u nαn +f0,lk)e u2

2 E0(e αnεlk(u ζl)) | {z }

=e αn(u ζl)2

1 nαnσl ϕ( 1

σl ( u nαn + f0,lk))du

e Cnαn(σ2 l +f2 0,lk)+ αnζ2 l 2 Z

A et( u nαn +f0,lk)e (1 αn) u2

2 e uαnζl 1 nαnσl ϕ( 1

σl ( u nαn + f0,lk))du

e Cnαn(σ2 l +f2 0,lk)+ ζ2 l 2 Z

t( u nαn +f0,lk) e|uζl| 1 nαnσl ϕ( 1

σl ( u nαn + f0,lk))du

e Cnαn(σ2 l +f2 0,lk)+ ζ2 l 2 e|t|σl+|ζl| nαn(σl+|f0,lk|).

Since |ζl| supu A |u| nαn(σl + |f0,lk|), we have

E0 N1 lk(t) Dlk e Cnαn(σ2 l +f2 0,lk)+|t|σl, (56)

for some constant C > 0. On the other hand, the change of variables u = nαn(σly +f0,lk) and (55) give

N2 lk(t) Dlk e Cnαn(σ2 l +f2 0,lk) Z

[ 1;1]C etuσle nαn

2 (σlu f0,lk)2+ αnεlk( nαn(σlu f0,lk) ζl)ϕ(u)du.

Therefore, using Fubini s theorem, we get

E0 N2 lk(t) Dlk e Cnαn(σ2 l +f2 0,lk) Z

[ 1;1]C etuσle nαn

2 (σlu f0,lk)2 E0(eεlk αn( nαn(σlu f0,lk) ζl)) | {z }

e αn( nαn(σlu f0,lk) ζl)2

e Cnαn(σ2 l +f2 0,lk)+ αnζ2 l 2 Z

[ 1;1]C etuσle (1 αn)( nαn

2 (σlu f0,lk)2)e αn nαnζl(σlu f0,lk)du

e Cnαn(σ2 l +f2 0,lk)+ ζ2 l 2 Z

[ 1;1]C etuσle αn nαnζl(σlu f0,lk)ϕ(u)du

e Cnαn(σ2 l +f2 0,lk)+ ζ2 l 2 +αn nαn|ζlf0,lk| Z

[ 1;1]C eu(tσl αn nαnζlσl)ϕ(u)du

e Cnαn(σ2 l +f2 0,lk) Z

[ 1;1]C eu(tσl αn nαnζlσl)ϕ(u)du,

where the constant C may change for line to line. Using the tail behavior of ϕ, one can bound its Laplace tranform and we get

E0 N2 lk(t) Dlk e Cnαn(σ2 l +f2 0,lk)e C(|t|σl+αn nαn|ζl|σl) δ+1

δ e Cnαn(σ2 l +f2 0,lk)+C(|t|σl+ nαn|ζl|σl) δ+1

Combining (56) and (57), we obtain

Z etflkdΠαn(f|Y n) e C(nαn(σ2 l +f2 0,lk)+|t|σl+(|t|σl+ nαn|ζl|σl) δ+1

L Huillier, Travis, Castillo and Ray

Combining (54) and (58), and denoting φl = R2 l( 1

2 +β), t > 0 we have

Z f Lc n dΠαn(f|Y n) X

l>Ln 2l/2 1

t log(2l+1e C(nαn(σ2 l +φ2 l )+|t|σl+(|t|σl+nαn|σl+φl|σl) δ+1

Using the fact that for l > Ln, nαnφl nαnφLn p

log(nαn Ln l and nαnφLnσLn

log(nαn)(Ln + 1)µ log(nαn) δ 1+δ L

δ 1+δ n l δ 1+δ , we deduce that for any t > 0,

Z f Lc n dΠαn(f|Y n) X

l>Ln 2l/2 1

t log(2l+1e C(l+tσl+(tσl+l δ δ+1 ) δ+1

l>Ln 2l/2 1

t (l + tσl + (tσl + l δ 1+δ ) δ+1

Choosing t = l δ δ+1 σ 1 l , we obtain

Z f Lc n dΠαn(f|Y n) X

l>Ln 2l/2σll δ δ+1 (l + l δ δ+1 + (l δ δ+1 + l δ 1+δ ) δ+1

l>Ln 2l/2σll δ δ+1 (l + l δ δ+1 ) X

l>Ln 2l/2σll δ δ+1 l

l>Ln 2l/2σll 1 δ+1 X

l>Ln 2l/22 l( 1

2 +β) 2 βLn. (60)

Conclusion. Combining (49), (48), (53) and (60), one gets, as desired,

E0( Z f f0 dΠαn(f|Y n)) 1 nαn

2 + 2 βLn log(nαn)

Appendix B. Ancillary Results

B.1 Contraction Rates

Lemma B.1 For any distribution Π on S, any C, ε > 0 and 0 < α 1, with P0-probability at least 1 1 C2nε2 , we have Z

pnη0(Y n)α dΠ(η) Π(Bn(η0, ε))e α(C+1)nε2.

Proof Suppose Π(Bn(η0, ε)) > 0 (otherwise the result is immediate), and denote by Π = Π( Bn(η0,ε))

Π(Bn(η0,ε)) the normalized prior to Bn(η0, ε). Now let us bound from below

pnη0(Y n)α dΠ(η) Z

pnη0(Y n)α dΠ(η) = Π(Bn(η0, ε)) Z pn η(Y n)α

pnη0(Y n)α d Π(η). (61)

Semiparametric Inference Using Fractional Posteriors

Since Π is a probability measure on S, Jensen s inequality applied to the logarithm gives,

log Z pn η(Y n)α

pnη0(Y n)α d Π(η) α Z log pn η(Y n) pnη0(Y n)

Consider now the random variable Z := R log pn η (Y n) pnη0(Y n) d Π(η). Then

Bn(η0,ε) E0

log pn η(Y n) pnη0(Y n)

Z log pn η(x) pnη0(x)

pn η0(x)dµn(x)d Π(η))

Thus Z is integrable and using Fubini s theorem,

Z log pn η(x) pnη0(x)

pn η0(x)dµn(x)d Π(η) = Z

Bn(η0,ε) K(pn η0, pn η)d Π(η) nε2.

Turning to the variance,

Var0(Z) = Var0( Z) = E0

Z log pn η0(Y n) pnη(Y n)

Bn(η0,ε) K(pn η0, pn η)d Π(η)

Z log pn η0(Y n) pnη(Y n)

K(pn η0, pn η)d Π(η) 2

Bn(η0,ε) E0

log pn η0(Y n) pnη(Y n)

K(pn η0, pn η) 2 d Π(η) nε2,

using that Π is supported on Bn(η0, ε). By Chebychev s inequality, P0(|Z E(Z)| Cnε2) 1 Cnε2 . Thus, on the event {|Z E(Z)| Cnε2}, which has a probability at least 1 1 Cnε2 ,

log Z pn η(Y n)α

pnη0(Y n)α d Π(η) α(Z EZ + EZ) α(C + 1)nε2.

Substituting this bound into (61) then gives the result.

Lemma B.2 Let An be measurable sets, 0 < αn 1 and εn be a non-negative sequence such that nαnε2 n . If

Π(An) Π(Bn(η0, εn))e 2nαnε2n = o(1),

then Παn(An|Y n) P0 0.

L Huillier, Travis, Castillo and Ray

Proof Applying H older s inequality to the right-hand side of (35) implies

E0Παn(An|Y n)

An R pn η(x)dµn(x) αn R pn η0(x)dµn(x) 1 αn dΠ(η)

Π(Bn(η0, εn))e 2αnnε2n + o(1)

= Π(An) Π(Bn(η0, εn))e 2αnnε2n + o(1) = o(1).

Lemma B.3 Consider density estimation on [0,1] with true density f0 Cβ([0, 1]) for some β (0, 1], bounded away from 0. Let Π = Πn denote the histogram prior (12) satisfying Kn = o (nαn/ log(nαn)) and for all i {1, . . . , Kn}, 1 (nαn)b δi,n 1 for some b > 0. Then

the sequence ε2 n = Kn log(nαn Kn)/(nαn) + K 2β n satisﬁes Π(Bn(f0, Mεn)) e nαn(Mεn)2

for some M > 0.

Proof Using that the Kullback-Leiber and its 2nd-variation tensorizes in density estimation, we may write

Bn(f0, ε) = B1(f0, ε) = {f F : K(f0, f) ε2, V (f0, f) ε2}. (62)

Let ρ2 n = log(nαn Kn)/(nαn Kn). Since Kn = o( nαn log(nαn)), it holds that (ρn Kn)2 = Kn log(nαn Kn)/(nαn) = o(1) and thus ρn K 1 n for n large enough. This bound, the assumption δi,n 1 together with Lemma B.4 give that there exist positive constants C and c such that for all integer n,

Π(f H1 Kn, f f0,Kn 1 2ρn) Ce c Kn log( 1

i=1 δi,n. (63)

Using basic properties of histograms, we also have

Π(f H1 Kn, f f0,Kn 2Knρn) Π(f H1 Kn, f f0,Kn 1 2ρn). (64)

Since f0 satisﬁes m f0 M for some M > m > 0, we also have m f0,Kn M for all n. Now let f H1 Kn such that f f0,Kn 2Knρn. Since Knρn 0, for n large enough, m

2 < f < 2M. Since f and f0 are bounded away from zero and inﬁnity, using that log(1 + x) x,

K(f0, f) = Z 1

0 log 1 + f0 f

f (f0 f + f) = Z 1

Also, since x 7 log x is 1

r-Lipschitz on [r, ),

V (f0, f) Z 1

0 |f0 f|2f0 4

m2 f f0 2 .

Semiparametric Inference Using Fractional Posteriors

Moreover, for f0 Cβ([0, 1]) and any f H1 Kn such that f f0,Kn 2Knρn,

f f0 2 2 f f0,Kn 2 + 2 f0 f0,Kn 2 8(Knρn)2 + 2K 2β n ε2 n.

Combining the last three displays thus implies that K(f0, f) Dε2 n and V (f0, f) Dε2 n for some constant D = D(m) = D(f0) > 0. Together with (62)-(64), we obtain

Dεn)) = Π({f, K(f0, f) Dε2 n, V (f0, f) Dε2 n})

Π(f H1 Kn, f f0,Kn 1 2ρn) Ce c Kn log( 1

i=1 δi,n. (65)

Using the assumption on the weights (δi,n) and the deﬁnition of ρn yields that

Dεn)) Ce c Kn log( 1

ρn )e b Kn log(nαn) Ce c Kn log( 1

where the constants C and c may change from line to line. Finally, since the sequence ρ2 n satisﬁes 1 ρ2n log( 1

ρn ) nαn Kn and thus Kn log( 1

ρn ) nαn(Knρn)2, it follows

Dεn)) Ce c Kn log( 1

ρn ) Ce cnαn(Knρn)2 Ce cnαnε2 n = Ce nαn( cεn)2.

Denoting D := max(

D, c) + 1, for n large enough we have

Π(Bn(f0, D εn)) e nαn(D εn)2.

Lemma B.4 Let X1, . . . , XK be distributed according to the Dirichlet distribution on the K-simplex with parameters δ = (δ1, . . . , δK), where 0 < δi 1 for all i . Let x0 = (x10, . . . , x K0) be any point on the K-simplex. There exist positive constants c, C, independent of K, δ and x0 such that, for ε K 1

i=1 |Xi xi0| 2ε

Ce c K log( 1

Proof The proof is the same as that of Lemma 6.1 in Ghosal et al. (2000), except one keeps track of the dependence on the Dirichlet parameters.

B.2 Bernstein von Mises Results

Proof of Lemma A.1 Let f An. First we have

αnℓn(ft) = αn

log f(Yi) t ψ[Kn](Yi)

nαn log(F(e t ψ[Kn] nαn ))

= αnln(f) t nαn 1 n

i=1 ψ[Kn](Yi) nαn log(F(e t ψ[Kn] nαn )). (66)

L Huillier, Travis, Castillo and Ray

Let us expand the term log(F(e t ψ[Kn] nαn )). Throughout the calculations below, one can keep track of the uniformity of the remainder terms and check that the remainder in the ﬁnal expansion is uniform over An. The fact that ψ is bounded (and so is ψ[Kn]) ensures this uniformity. By expanding the logarithm around 1,

log(F(e t ψ[Kn] nαn )) = log Z fe t ψ[Kn] nαn = log Z 1

0 f(1 t ψ[Kn] nαn + t2 ψ2 [Kn] 2nαn + o( 1

= log 1 t nαn

0 f ψ[Kn] + t2

0 f ψ2 [Kn] + o( 1

0 f ψ[Kn] + t2

0 f ψ2 [Kn] Z 1

Since over An, f is an histogram of size Kn, we deduce

log(F(e t ψ[Kn] nαn )) = t nαn

0 f ψf0 + t2

0 f ψ2 [Kn] Z 1

Then, the facts that f f0,Kn 1 εn over An and ψ[Kn] is bounded imply

log(F(e t ψf0 nαn ))

0 f ψf0 + t2

0 f0,Kn ψ2 [Kn] Z 1

0 f0,Kn ψf0 2 + o( 1

0 f ψf0 + t2

2nαn VKn + o( 1

nαn ). (67)

Thus, combining (66) and (67), we have

αnln(ft) = αnln(f) t nαn 1 n

i=1 ψ[Kn](Yi) + t nαn

2 VKn + o(1)

= αnln(f) + t nαn( 1

i=1 ψ[Kn](Yi) + ψ(f) ψ(f0)) t2

2 VKn + o(1).

By rearranging and using the deﬁnition of ˆψ[Kn],

αnln(f) + t nαn(ψ(f) ˆψ[Kn]) = αnln(ft) + t2

2 VKn + o(1). (68)

Let us show that VKn R f0 ψ2. Since Kn , R ( ψ[Kn] ψ)2 = o(1) and since

f0 is bounded it follows R f0( ψ[Kn] ψ)2 = o(1). Hence R f0 ψ2 [Kn] R f0 ψ2 = o(1)

and thus R f0,Kn ψ2 [Kn] R f0 ψ2 = o(1). Moreover, | R f0 ψ[Kn]| = | R f0( ψ[Kn] ψ)|

Semiparametric Inference Using Fractional Posteriors

f0 ψ[Kn] ψ 2 = o(1). Finally VKn = R f0 ψ2 + o(1). Using this result together with (68) and Assumption (38) it follows

Eαn(et nαn(ψ(f) ˆψ[Kn])|Y n, An) =

An et nαn(ψ(f) ˆψ[Kn])eαnln(f)dΠ(f) R eαnln(f)dΠ(f)

An eαnln(ft)+ t2

2 VKn+o(1)dΠ(f) R eαnln(f)dΠ(f) = e t2

2 VKn(1 + o(1))

An eαnln(ft)dΠ(f) R eαnln(f)dΠ(f)

2 F0( ψ2 f0)(1 + o P (1)).

The last estimate is for the restricted distribution Παn( |Y n, An) but Assumption (37) implies that the unrestricted version also follows and this proves Lemma A.1.

Proof of Lemma A.2 Set εn = εn + (e2 |t| nαn ψ 1). First, εn 0 and since εn εn, we have Παn(An|Y n) = 1 + o P (1). Now let us show the convergence (41). For k 1, let us set

(ω1, . . . , ωk 1) (0, 1)k 1,

Throughout the proof, we will use the notation ωk = 1 Pk 1 j=1 ωj. Let us denote by H the map

H : Uk H1 k (ω1, . . . , ωk 1) x k Pk j=1 ωj1Ij(x).

By deﬁnition of the prior distribution, we have Z

An eαnln(ft)dΠ(f) = Z

H1 Kn 1f Aneαnln(ft)dΠ(f) (69)

UKn 1 1H(ω1,...,ωKn 1) Aneαnln(H(ω1,...,ωKn 1)e t ψ[Kn] nαn )/ R H(ω1,...,ωKn 1)e t ψ[Kn] nαn )) (70)

i=1 ωδi,n 1 i dω1 . . . dωKn 1.

For an integer n and j {1, . . . , Kn}, denote γj = et ψj/ nαn with ψj = Kn R

Ij ψ. For k an

integer and x ]0, + [k, let us denote by Sx the map

Sx : Uk ]0, + [ (ω1, . . . , ωk 1) Pk j=1 ωjxj.

For an integer k and vector x ]0, + [k, denote

φx : Uk Uk (ω1, . . . , ωk 1) ( ω1x1

Sx(ω), . . . , ωk 1xk 1

L Huillier, Travis, Castillo and Ray

This mapping is well deﬁned and note that

H(ω1, . . . , ωKn 1)e t ψ[Kn] nαn

R H(ω1, . . . , ωKn 1)e t ψ[Kn] nαn = H(φγ 1(ω1, . . . , ωKn 1)). (71)

Moreover, one can show that for all integer k and for all x ]0, + [k, φx is bijective and its inverse is φx 1. From Lemma 5 in the supplemental article of Castillo and Rousseau (2015), the mappings φx and φx 1 are C1, and the determinant of the jacobian matrix of the map φx is given by

det(Dφx(ω1, . . . , ωk 1)) = 1 Sx(ω)k

i=1 xi. (72)

Let us combine (69) and (71) and then make the change of variables (ξ1, . . . , ξKn 1) φγ(ξ1, . . . , ξKn 1) in (69) and using 1 PKn 1 i=1 γiξi Sγ(ξ) = γKnξKn

Sγ(ξ) , it follows

An eαnln(ft)dΠ(f)

UKn 1 1H(φγ(ξ1,...,ξKn 1)) Aneαnln(H(ξ1,...,ξKn 1)) 1 B(δ)

δi,n 1 1 Sγ(ξ)Kn

UKn 1 1H(φγ(ξ1,...,ξKn 1)) Aneαnln(H(ξ1,...,ξKn 1)) Kn Y

i=1 γδi,n i | {z } ( )

Sγ(ξ) PKn i=1 δi,n | {z } ( )

i=1 ξδi,n 1 i dξi.

For the term ( ), which does not depend on ξ, we have

j=1 γδj,n j =

j=1 et ψjδj,n/ nαn = et PKn j=1 ψjδj,n/ nαn

e |t| nαn ψ PKn j=1 δj,n

j=1 γδj,n j e

|t| nαn ψ PKn j=1 δj,n,

so that QKn j=1 γδj,n j = 1 + o(1) using the condition (13). As for the term ( ), for ξ UKn 1

we have Sγ(ξ) = PKn j=1 γjξj = PKn j=1 e

t nαn ψjξj, and thus

e t nαn ψ Kn X

j=1 ξj Sγ(ξ) e

t nαn ψ Kn X

j=1 ξj (74)

e |t| nαn ψ PKn i=1 δi,n Sγ(ξ) PKn i=1 δi,n e

|t| nαn ψ PKn i=1 δi,n. (75)

Semiparametric Inference Using Fractional Posteriors

Using Assumption (13) gives that term ( ) is 1+o(1) uniformly over UKn 1. By combining (73) and the results on term ( ) and term ( ), we obtain Z

An eαnln(ft)dΠ(f) (76)

= (1 + o(1)) Z

UKn 1 1H(φγ(ξ1,...,ξKn 1)) Aneαnln(H(ξ1,...,ξKn 1)) 1 B(δ)

i=1 ξδi,n 1 i dξ1 . . . dξKn 1.

Next, let us show that we have the inclusion

{(ξ1, . . . , ξKn 1) UKn 1, H(ξ1, . . . , ξKn 1) f0,Kn 1 εn}

{(ξ1, . . . , ξKn 1) UKn 1, H(φγ(ξ1, . . . , ξKn 1)) f0,Kn 1 εn}. (77)

For all integer n, denote (ξ0 1, . . . , ξ0 Kn 1) the element of UKn 1 such that H(ξ0 1, . . . , ξ0 Kn 1) = f0,Kn. Let (ξ1, . . . , ξKn 1) UKn 1 such that H(ξ1, . . . , ξKn 1) f0,Kn 1 εn PKn i=1 |ξi ξ0 i | εn. Then we have

H(φγ(ξ1, . . . , ξKn 1)) f0,Kn 1 =

Sγ(ξ) ξ0 i |

i=1 |ξi ξ0 i | +

i=1 ξi| γi Sγ(ξ) 1|.

By (74), it follows that for all i {1, . . . , Kn}

e |t| nαn ψ

|t| nαn ψ γi Sγ(ξ) e

e |t| nαn ψ

| γi Sγ(ξ) 1| e2 |t| nαn ψ 1.

Hence H(φγ(ξ1, . . . , ξKn 1)) f0,Kn 1 εn + (e2 |t| nαn ψ 1) = εn. Thus we have the inclusion (77). Finally, combining (76) and (77), we obtain

(1 + o P (1)) Z

UKn 1 1H(ξ1,...,ξKn 1) Aneαnln(H(ξ1,...,ξKn 1)) 1 B(δ)

i=1 ξδi 1 i dξ1 . . . dξKn 1

An eαnln(ft)dΠ(f) (1 + o P (1)) Z eαnln(f)dΠ(f),

(1 + o P (1))Παn( An|Y n)

An eαnln(ft)dΠ(f) R eαnln(f)dΠ(f) (1 + o P (1)).

By assumption 39 Παn( An|Y n) = 1 + o P (1) and the result follows.

L Huillier, Travis, Castillo and Ray

Lemma B.5 The following expansion holds for ˆψ,

ˆψ ˆψ[Kn] = F0( ψ[Kn]) + o P (1/ n).

Proof By deﬁnition,

ˆψ ˆψ[Kn] = ψ(f0) + 1

i=1 ψ(Yi) ψ(f0) 1

i=1 ψ[Kn](Yi) = 1

i=1 ( ψ(Yi) ψ[Kn](Yi))

ψ(Yi) ψ[Kn](Yi) + F0( ψ[Kn]) F0( ψ[Kn]). (78)

Moreover, using that E0( ψ(Y1) ψ[Kn](Y1)) = F0( ψ[Kn]),

ψ(Yi) ψ[Kn](Yi) + F0( ψ[Kn]) !2 )

ψ(Yi) ψ[Kn](Yi) + F0( ψ[Kn] !2 )

n E0( ψ(Y1) ψ[Kn](Y1) + F0( ψ[Kn] 2 ) 1

n E0( ψ(Y1) ψ[Kn](Y1) 2 )

Z ψ ψ[Kn] 2 f0 f0

Z ψ ψ[Kn] 2 = f0

n o(1) = o( 1

where we used the assumption Kn in the last calculation. Hence, we obtain 1

n Pn i=1 ψ(Yi)

ψ[Kn](Yi) + F0( ψ[Kn]) = o P (1/ n). Combining this with (78) yields the result.

B.3 Credible regions

Lemma B.6 Let (QY n ) be a sequence of random real distributions, (un) a positive sequence, (Yn) a sequence of real random variables and V a positive constant. Let δ (0, 1) and denote a Y n,δ the random δ-quantile of QY n . If

QY n := un(QY n Yn) L N(0, V ), (80)

then un(a Y n,δ Yn) P

Proof By Lemma 2 in the supplement of Castillo and Rousseau (2015), (80) implies that

sup s R | QY n (( , s]) N(0, V )(( , s])| P 0.

Using Lemma B.7 and that un(a Y n,δ Yn) is the δ-quantile of QY n , we deduce that un(a Y n,δ

Semiparametric Inference Using Fractional Posteriors

Lemma B.7 Let p (0, 1). Let (Fn) be a sequence of random cumulative distribution functions and (qn p ) the (random) sequence of its p-quantiles. Let F be a ﬁxed continuous increasing cumulative distribution function and qp be its p-quantile. If sups R |Fn(s) F(s)| P 0 as n , then |qn p qp| P 0 as n .

Proof For ρ > 0 arbitrary, we show that P(|qp qn p | ρ) 1. Since F is increasing and continuous, we have F(qp ρ) < F(qp) = p < F(qp + ρ). Set ε = min(F(qp + ρ) p, p F(qp ρ))/2. On the event {sups R |Fn(s) F(s)| ε} it follows

Fn(qp + ρ) F(qp + ρ) ε F(qp + ρ) F(qp + ρ) p

2 = p + F(qp + ρ) p

By deﬁnition of the quantile qp n, this implies qp n qp + ρ. Similarly,

Fn(qp ρ) F(qp ρ) + ε F(qp ρ) + p F(qp ρ)

2 = p p F(qp ρ)

Hence qp ρ < qp n, and thus it follows {sups R |Fn(s) F(s)| ε} {|qn p qp| ρ}. Let δ > 0. Since sups R |Fn(s) F(s)| = o P (1), there exists N0 such that for all n N0,P(sups R |Fn(s) F(s)| ε) 1 δ. Hence one deduces that for all n N0, 1 δ P(sups R |Fn(s) F(s)| ε) P(|qn p qp| ρ).

P. Alquier. User-friendly introduction to PAC Bayes bounds. 2024. Foundations and Trends in Machine Learning, to appear.

P. Alquier and J. Ridgway. Concentration of tempered posteriors and of their variational approximations. Ann. Statist., 48(3):1475 1497, 2020.

P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res., 17:Paper No. 239, 41, 2016.

A. Barron, M. J. Schervish, and L. Wasserman. The consistency of posterior distributions in nonparametric problems. Ann. Statist., 27(2):536 561, 1999.

A. Bhattacharya, D. Pati, and Y. Yang. Bayesian fractional posteriors. Ann. Statist., 47 (1):39 66, 2019.

P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B. Stat. Methodol., 78(5):1103 1130, 2016.

C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in β-VAE. In NIPS Workshop on Learning Disentangled Representations. ar Xiv, 2017.

I. Castillo. Lower bounds for posterior rates with Gaussian process priors. Electronic Journal of Statistics, 2:1281 1299, 2008.

L Huillier, Travis, Castillo and Ray

I. Castillo. Semiparametric Bernstein von Mises Theorem and bias, illustrated with Gaussian process priors. Sankhya: The Indian Journal of Statistics, Series A, 74(2):194 221, 2012a.

I. Castillo. A semiparametric Bernstein von Mises Theorem for Gaussian process priors. Probability Theory and Related Fields, 152(1-2):53 99, 2012b.

I. Castillo. On Bayesian supremum norm contraction rates. Ann. Statist., 42(5):2058 2091, 2014.

I. Castillo and R. Nickl. On the Bernstein-von Mises phenomenon for nonparametric Bayes procedures. Ann. Statist., 42(5):1941 1969, 2014.

I. Castillo and J. Rousseau. A Bernstein von Mises Theorem for smooth functionals in semiparametric models. Ann. Statist., 43(6), Dec 2015.

I. Castillo and S. van der Pas. Multiscale Bayesian survival analysis. Ann. Statist., 49(6): 3559 3582, 2021.

O. Catoni. Statistical learning theory and stochastic optimization. Lecture Notes in Mathematics - Springer-Verlag, 1851, 01 2004.

O. Catoni. PAC Bayesian supervised classiﬁcation: The thermodynamics of statistical learning. Lecture Notes-Monograph Series, 56:i 163, 2007.

R. M. Dudley. Real analysis and probability, volume 74 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 2002. ISBN 0-521-00754-2. Revised reprint of the 1989 original.

N. Friel and A. N. Pettitt. Marginal likelihood estimation via power posteriors. J. R. Stat. Soc. Ser. B Stat. Methodol., 70(3):589 607, 2008.

C. J. Geyer and E. A. Thompson. Annealing Markov Chain Monte Carlo with applications to ancestral inference. Journal of the American Statistical Association, 90(431):909 920, 1995.

S. Ghosal and A. van der Vaart. Convergence rates of posterior distributions for noniid observations. Ann. Statist., 35(1):192 223, 2007.

S. Ghosal and A. van der Vaart. Fundamentals of nonparametric Bayesian inference. Cambridge series in statistical and probabilistic mathematics; 44. Cambridge University Press, Cambridge, 2017. ISBN 9781139029834.

S. Ghosal, J. K. Ghosh, and A. W. van der Vaart. Convergence rates of posterior distributions. Ann. Statist., 28(2):500 531, 2000.

P. Gr unwald. Safe probability. J. Statist. Plann. Inference, 195:47 63, 2018.

P. Gr unwald and T. van Ommen. Inconsistency of Bayesian inference for misspeciﬁed linear models, and a proposal for repairing it. Bayesian Anal., 12(4):1069 1103, 2017.

Semiparametric Inference Using Fractional Posteriors

P. D. Gr unwald and N. A. Mehta. Fast rates for general unbounded loss functions: from ERM to generalized Bayes. J. Mach. Learn. Res., 21:Paper No. 56, 80, 2020.

P. Gr unwald. The Safe Bayesian: Learning the learning rate via the mixability gap. Lecture Notes in Comput. Sci., 7568:169 183, 2012.

W. H ardle, G. Kerkyacharian, D. Picard, and A. Tsybakov. Wavelets, approximation, and statistical applications, volume 129 of Lecture Notes in Statistics. Springer-Verlag, New York, 1998. ISBN 0-387-98453-4.

M. Hoﬀmann, J. Rousseau, and J. Schmidt-Hieber. On adaptive posterior concentration rates. Ann. Statist., 43(5):2259 2295, 2015.

C. C. Holmes and S. G. Walker. Assigning a value to a power likelihood in a general Bayesian model. Biometrika, 104(2):497 503, 2017.

C.-W. Huang, S. Tan, A. Lacoste, and A. C. Courville. Improving explorability in variational inference with annealed variational objectives. In Advances in Neural Information Processing Systems, volume 31, 2018.

W. Jiang and M. A. Tanner. Gibbs posterior for variable selection in high-dimensional classiﬁcation and data mining. Ann. Statist., 36(5):2207 2231, 2008.

I. Johnstone. Gaussian estimation: Sequence and wavelet models. available from https: //imjohnstone.su.domains/GE_09_16_19.pdf, 2019.

B. T. Knapik, A. W. van der Vaart, and J. H. van Zanten. Bayesian inverse problems with Gaussian priors. Ann. Statist., 39(5), Oct 2011.

W. Kruijer and A. van der Vaart. Analyzing posteriors by the information inequality. In From probability to statistics and back: high-dimensional models and processes, volume 9 of Inst. Math. Stat. (IMS) Collect., pages 227 240. Inst. Math. Statist., Beachwood, OH, 2013.

S. P. Lyddon, C. C. Holmes, and S. G. Walker. General Bayesian updating and the losslikelihood bootstrap. Biometrika, 106(2):465 478, 2019.

R. Martin and Y. Tang. Empirical priors for prediction in sparse high-dimensional linear regression. J. Mach. Learn. Res., 21:Paper No. 144, 30, 2020.

M. A. Medina, J. L. M. Olea, C. Rush, and A. Velez. On the robustness to misspeciﬁcation of α-posteriors and their variational approximations. Journal of Machine Learning Research, 23(147):1 51, 2022.

J. W. Miller. Asymptotic normality, concentration, and coverage of generalized posteriors. J. Mach. Learn. Res., 22:Paper No. 168, 53, 2021.

J. W. Miller and D. B. Dunson. Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114(527):1113 1125, 2019. PMID: 31942084.

L Huillier, Travis, Castillo and Ray

R. Nickl. Bernstein von Mises theorems for statistical inverse problems I: Schr odinger equation. J. Eur. Math. Soc. (JEMS), 22(8):2697 2750, 2020.

R. Nickl. Bayesian Non-linear Statistical Inverse Problems. 2022. ETH course Lecture Notes.

R. Nickl and K. Ray. Nonparametric statistical inference for drift vector ﬁelds of multidimensional diﬀusions. Ann. Statist., 48(3):1383 1408, 2020.

R. Nickl and J. S ohl. Bernstein von Mises theorems for statistical inverse problems II: compound Poisson processes. Electronic Journal of Statistics, 13(2):3513 3571, 2019.

B. Ning and I. Castillo. Bayesian multiscale analysis of the Cox model. 2024. to appear in Bernoulli.

A. O Hagan. Fractional Bayes factors for model comparison. J. Roy. Statist. Soc. Ser. B, 57(1):99 138, 1995. With discussion and a reply by the author.

C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2006. ISBN 978-0262-18253-9.

K. Ray and A. van der Vaart. Semiparametric Bayesian causal inference. Ann. Statist., 48 (5):2999 3020, 2020.

M. Reiß. Asymptotic equivalence for nonparametric regression with multivariate and random design. Ann. Statist., 36(4):1957 1982, 2008.

N. Syring and R. Martin. Calibrating general posterior credible regions. Biometrika, 106 (2):479 486, 2019.

B. Szab o and H. van Zanten. An asymptotic analysis of distributed nonparametric methods. J. Mach. Learn. Res., 20:Paper No. 87, 30, 2019.

S. T. Tokdar, S. Jiang, and E. L. Cunningham. Heavy-tailed density estimation. Journal of the American Statistical Association, 0(0):1 13, 2022.

A. van der Vaart and H. van Zanten. Bayesian inference with rescaled Gaussian process priors. Electronic Journal of Statistics, 1, 2007.

A. van der Vaart and H. van Zanten. Information rates of nonparametric Gaussian process methods. J. Mach. Learn. Res., 12:2095 2119, 2011.

A. W. van der Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998. ISBN 0-521-49603-9; 0-521-78450-6.

A. W. van der Vaart and J. H. van Zanten. Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Statist., 36(3):1435 1463, 2008.

Semiparametric Inference Using Fractional Posteriors

T. van Erven and P. Harremoes. R enyi divergence and Kullback Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014.

S. Walker and N. L. Hjort. On Bayesian consistency. J. R. Stat. Soc. Ser. B Stat. Methodol., 63(4):811 821, 2001.

P.-S. Wu and R. Martin. A comparison of learning rate selection methods in generalized Bayesian inference. Bayesian Anal., 18(1):105 132, 2023.

T. Zhang. From ϵ-entropy to KL-entropy: analysis of minimum information complexity density estimation. Ann. Statist., 34(5):2180 2210, 2006.