# nonparametric_involutive_markov_chain_monte_carlo__af05cf84.pdf

Nonparametric Involutive Markov Chain Monte Carlo

Carol Mak 1 Fabian Zaiser 1 Luke Ong 1

Abstract A challenging problem in probabilistic programming is to develop inference algorithms that work for arbitrary programs in a universal probabilistic programming language (PPL). We present the nonparametric involutive Markov chain Monte Carlo (NP-i MCMC) algorithm as a method for constructing MCMC inference algorithms for nonparametric models expressible in universal PPLs. Building on the unifying involutive MCMC framework, and by providing a general procedure for driving state movement between dimensions, we show that NP-i MCMC can generalise numerous existing i MCMC algorithms to work on nonparametric models. We prove the correctness of the NP-i MCMC sampler. Our empirical study shows that the existing strengths of several i MCMC algorithms carry over to their nonparametric extensions. Applying our method to the recently proposed Nonparametric HMC, an instance of (Multiple Step) NP-i MCMC, we have constructed several nonparametric extensions (all of which new) that exhibit signiﬁcant performance improvements.

1. Introduction

Universal probabilistic programming (Goodman et al., 2008) is the idea of writing probabilistic models in a Turingcomplete programming language. A universal probabilistic programming language (PPL) can express all computable probabilistic models (Vákár et al., 2019), using only a handful of basic programming constructs such as branching and recursion. In particular, nonparametric models, where the number of random variables is not determined a priori and possibly unbounded, can be described naturally in a universal PPL. In programming language terms, this means the number of sample statements is unknown prior to

1Department of Computer Science, University of Oxford, United Kingdom. Correspondence to: Carol Mak <pui.mak@cs.ox.ac.uk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

execution. On the one hand, such programs can describe probabilistic models with an unknown number of components, such as Bayesian nonparametric models (Richardson & Green, 1997), variable selection in regression (Ratner, 2010), and models for signal processing (Murray et al., 2018). On the other hand, there are models deﬁned on inﬁnite-dimensional spaces, such as probabilistic context free grammars (Manning & Schütze, 1999), birth-death models of evolution (Kudlicka et al., 2019) and statistical phylogenetics (Ronquist et al., 2021).

However, since universal PPLs are expressively complete, it is challenging to design and implement inference engines that work for arbitrary programs written in them. The parameter space of a nonparametric model is a disjoint union of spaces of varying dimensions. To approximate the posterior distribution via a Markov chain Monte Carlo (MCMC) algorithm (say), the transition kernel will have to switch between (possibly an unbounded number of) states of different dimensions, and to do so reasonably efﬁciently. This explains why providing theoretical guarantees for MCMC algorithms that work for universal PPLs (Wingate et al., 2011; Wood et al., 2014; Tolpin et al., 2015; Hur et al., 2015; Mak et al., 2021b) is very challenging. For instance, the original version of Lightweight MH (Wingate et al., 2011) was incorrect (Kiselyov, 2016). In fact, most applications requiring Bayesian inference rely on custom MCMC kernels, which are error-prone and time-consuming to design and build.

Contributions We introduce Nonparametric Involutive MCMC (NP-i MCMC) for designing MCMC samplers for universal PPLs. It is an extension of the involutive MCMC (i MCMC) framework (Neklyudov et al., 2020; Cusumano Towner et al., 2020) to densities arising from nonparametric models (for background on both, see Sec. 2). We explain how NP-i MCMC moves between dimensions and how a large class of existing i MCMC samplers can be extended for universal PPLs (Sec. 3). We also discuss necessary assumptions and prove its correctness. Furthermore, there are general transformations and combinations of NP-i MCMC, to derive more powerful samplers systematically (Sec. 4), for example by making them nonreversible to reduce mixing time. Finally, our experimental results show that our method yields signiﬁcant performance improvements over existing general MCMC approaches (Sec. 5).

Nonparametric Involutive Markov Chain Monte Carlo

All missing proofs are presented in the appendix.

Notation We write Nn(x, Σ) for the x-mean Σcovariance n-dimensional Gaussian with pdf ϕn(x, Σ). For the standard Gaussian Nn(0, I), we abbreviate them to Nn and ϕn. In case n = 1, we simply write N and ϕ.

Given measurable spaces (X, ΣX) and (Y, ΣY ), we write K : X Y to mean a kernel of type K : X ΣY [0, ). We say that K is a probability kernel if for all x X, K(x, ) : ΣY [0, ) is a probability measure. We write pdf K(x, y) as the density of y Y in the measure K(x, ) assuming a derivative w.r.t. some reference measure exists.

Unless otherwise speciﬁed, the real space R is endowed with the Borel measurable sets B and the standard Gaussian N measure; the boolean space 2 := {T, F} is endowed with the discrete measurable sets Σ2 := P(2) and the measure µ2 which assigns either boolean the probability 0.5. We write x1..n to mean the n-long preﬁx of the sequence x. For any real-valued function f : X R, we deﬁne its support as Supp(f) := {x X | f(x) > 0}.

2. Background

2.1. Involutive MCMC

Given a target density ρ on a measure space (X, ΣX, µX), the i MCMC algorithm generates a Markov chain of samples {x(i)}i N by proposing the next sample x using the current sample x0, in three steps: 1. v0 K(x0, ): sample a value v0 on an auxiliary measure space (Y, ΣY , µY ) from an auxiliary kernel K : X Y applied to the current sample x0. 2. (x, v) Φ(x0, v0): compute the new state (x, v) by applying an involutiona Φ : X Y X Y to (x0, v0). 3. Accept the proposed sample x as the next step with probability given by the acceptance ratio

min 1; ρ(x) pdf K(x, v) ρ(x0) pdf K(x0, v0) |det( Φ(x0, v0))| ;

otherwise reject the proposal x and repeat x0.

ai.e. Φ = Φ 1.

Figure 1. i MCMC Algorithm

Our sampler is built on the recently introduced involutive Markov chain Monte Carlo method (Neklyudov et al., 2020; Cusumano-Towner et al., 2020), a unifying framework for MCMC algorithms. Completely speciﬁed by a target density ρ, an (auxiliary) kernel K and an involution Φ, the i MCMC algorithm (Fig. 1) is conceptually simple. Yet it is remarkably expressive, describing many existing MCMC

Listing 1. Inﬁnite Gaussian mixture model

K = floor(abs(sample(normal(0, 1)))) for i in range(K):

xs[i] = sample(normal(0, 1)) for d in data:

observe d from mixture([normal(x, 1)

for x in xs]) return K

samplers, including Metropolis-Hastings (MH) (Metropolis et al., 1953; Hastings, 1970) with the swap involution Φ(x, v) := (v, x) and the proposal distribution as its auxiliary kernel K; as well as Gibbs (Geman & Geman, 1984), Hamiltonian Monte Carlo (HMC) (Neal, 2011) and Reversible Jump MCMC (RJMCMC) (Green, 1995). Thanks to its schematic nature and generality, we ﬁnd i MCMC an ideal basis for constructing our nonparametric sampler, NPi MCMC, for (arbitrary) probabilistic programs. We stress that NP-i MCMC is applicable to any target density function that is tree representable.

2.2. Tree representable functions

As is standard in probabilistic programming, our sampler ﬁnds the posterior of a program M by taking as the target density a map w, which, given an execution trace, runs M on the sampled values speciﬁed by the trace, and returns the weight of such a run. Hence the support of w is the set of traces on which M terminates.

This density w must satisfy the preﬁx property (Mak et al., 2021b): for every trace, there is at most one preﬁx with strictly positive density. Such functions are called tree representable as they can be presented as a computation tree. We shall see how our sampler exploits this property to jump across dimensions in Sec. 3.

Formally the trace space T is the disjoint union S n N Rn, endowed with σ-algebra ΣT := {S

n N Xn | Xn Bn} and the standard Gaussian (of varying dimensions) as measure µT(S

n N Xn) := P

n N Nn(Xn). We present traces as lists, e.g. [ 0.2, 3.1, 2.8] and []. Thus the preﬁx property is expressible as: for all traces t T, there is at most one k |t| s.t. the preﬁx t1..k is in Supp(w). Note that the preﬁx property is satisﬁed by any densities w : T [0, ) induced by a probabilistic program (Prop. A.6), so this is a mild restriction

Example 1. Consider the classic nonparametric inﬁnite Gaussian mixture model (GMM), which infers the number of Gaussian components from a data set. It is describable as a program (Listing 1), where there is a mixture of K Gaussian distributions such that the i -th Gaussian has mean

Nonparametric Involutive Markov Chain Monte Carlo

xs[i] and unit variance. As K is not pre-determined, the possible number of components is unbounded, rendering the model nonparametric. Given a trace [3.4, 1.2, 1.0, 0.5], the program describes a mixture of three Gaussians centred at 1.2, 1.0 and 0.5; and it computes the likelihood of generating the set D of data from such a mixture. The program has density w : T [0, ) (w.r.t. the trace measure µT) with w(t) deﬁned as:

1 |t1| ϕ |t1| (d | t1+i, 1) if |t| 1 = |t1|

0 otherwise.

We can check that the density w is tree representable.

3. Nonparametric involutive MCMC

3.1. Example: inﬁnite GMM mixture

Consider how a sample for the inﬁnite GMM (Ex. 1) can be generated using a nonparametric variant of Metropolis-Hastings (MH), an instance of i MCMC. Suppose the current sample is x0 := [3.4, 1.2, 1.0, 0.5]; and [4.3, 3.4, 0.1, 1.4] a sample from the stock Gaussian N4 is the value of the initial auxiliary variable v0. Then, by application of the swap involution to (x0, v0), the proposed state (x, v) becomes ([4.3, 3.4, 0.1, 1.4], [3.4, 1.2, 1.0, 0.5]). A problem arises if we simply propose x as the next sample, as it describes a mixture of four Gaussians (notice K has value 4) but only three means are provided, viz., 3.4, 0.1, 1.4. Hence, the program does not terminate on the trace speciﬁed by x, i.e., x is not in the support of w, the model s density.

The key idea of NP-i MCMC is to extend the initial state (x0, v0) to (x0 ++ [e], v0 ++ [p]) where ++ denotes trace concatenation, and e, p are random draws from the stock Gaussian N. Say 0.7 and 0.3 are the values drawn; the initial state then becomes (x0, v0) = ([3.4, 1.2, 1.0, 0.5, 0.7], [4.3, 3.4, 0.1, 1.4, 0.3]), and the proposed state (x, v) becomes ([4.3, 3.4, 0.1, 1.4, 0.3], [3.4, 1.2, 1.0, 0.5, 0.7]). Now the program does terminate on a trace speciﬁed by the proposed sample x = [4.3, 3.4, 0.1, 1.4, 0.3]; equivalently x Supp(w).

Notice that if this is not the case, such a process which extends the initial state by incrementing the dimension can be repeated until termination happens. For an almost surely (a.s.) terminating program, this process a.s. yields a proposed sample.

Finally, we calculate the acceptance ratio for x Supp(w)

from the initial sample x01..4 Supp(w) as

min 1; w(x) ϕ5(x) ϕ5(v) w(x01..4) ϕ5(x0) ϕ5(v0)

3.2. State space, target density and assumptions

Fix an parameter (measure) space (X, ΣX, µX), which is (intuitively) the product of the respective measure space of the distribution of X, with X ranging over the random variables of the model in question. Assume an auxiliary (probability) space (Y, ΣY, µY). For simplicity, we assume in this paper1 that both X and Y are R; further µX has a derivative ϕX w.r.t. the Lebesgue measure, and µY also has a derivative ϕY w.r.t. the Lebesgue measure. Note that it follows from our assumption that Xn Yn

is a smooth manifold for each n.2 Now a state is a pair of parameter and auxiliary variables of equal dimension. Formally the state space S := S

n N(Xn Yn) is endowed with the σ-algebra ΣS := σ{Xn Vn | Xn ΣXn, Vn ΣYn, n N} and measure µS(S) := P

Yn µXn({x Xn | (x, v) S}) µYn(dv).

Besides the target density function w, our algorithm NPi MCMC requires two additional inputs: auxiliary kernels (as an additional source of randomness) and involutions (to traverse the state space). Next we present what we assume about the three inputs and discuss some relevant properties.

Target density function We only target densities w : T [0, ) that are tree representable, where T := S

n N Rn. Moreover, we assume two common features of real-world probabilistic programs:

(V1) w is integrable, i.e. Z := R

T w dµT < (otherwise, the inference problem is undeﬁned) (V2) w is almost surely terminating (AST), i.e. µT({t T | w(t) > 0}) = 1 (otherwise, the loop (Step 3) of the NP-i MCMC algorithm may not terminate a.s.).3

Auxiliary kernel We assume, for each dimension n N, an auxiliary (probability) kernel K(n) : Xn Yn with density function pdf K(n) : Xn Yn [0, ) (assuming a derivative w.r.t. µXn Yn exists).

Involution We assume, for each dimension n N, a differentiable endofunction Φ(n) on Xn Yn which is in-

1In App. B.1, we consider a more general case where X is set to be R 2. 2Notation: For any probability space (X, ΣX, µX) such that µX has derivative ϕX w.r.t. the Lebesgue measure. Xn is the Cartesian product of n copies of X; ΣXn is the σ-algebra generated by subsets of the form Qn i=1 Vi where Vi ΣX; and µXn is the product of n copies of µX which has derivative ϕXn w.r.t. the Lebesgue measure. Note that (Xn, ΣXn, µXn) is a probability space. 3If a program does not terminate on a trace t, the density w(t) is deﬁned to be zero.

Nonparametric Involutive Markov Chain Monte Carlo

volutive, i.e. Φ(n) = Φ(n) 1, and satisﬁes the projection commutation property:

(V3) For all (x, v) S where |x| = m, if x1..n Supp(w) for some n, then for all k = n, . . . , m, takek(Φ(m)(x, v)) = Φ(k)(takek(x, v))

where takeℓis the projection that takes a state (x, v) and returns the state (x1..ℓ, v1..ℓ) with the ﬁrst ℓcoordinates of each component. (Otherwise, the sample-component x of the proposal state tested in Step 3 may not be an extension of the sample-component of the preceding proposal state.)

3.3. Algorithm

Given a probabilistic program M with density function w on the trace space T, a set {K(n) : Xn Yn} of auxiliary kernels and a set {Φ(n) : Xn Yn Xn Yn} of involutions satisfying V1 to 3, we present the NP-i MCMC algorithm in Fig. 2.

The NP-i MCMC generates a Markov chain by proposing the next sample x using the current sample x0 as follows: 1. v0 K(k0)(x0, ): sample a value v0 on the auxiliary space Yk0 from the auxiliary kernel K(k0) : Xk0 Yk0 applied to the current sample x0 where k0 = |x0|. 2. (x, v) Φ(n)(x0, v0): compute the proposal state (x, v) by applying the involution Φ(n) on Xn Yn to the initial state (x0, v0) where n = |x0|. 3. Test if for some k, x1..k Supp(w). (Equivalently: Test if program M terminates on the trace speciﬁed by the sample-component x of the proposal state, or one of its preﬁxes.) If so, proceed to the next step; otherwise

(x0, v0) (x0 ++[e], v0 ++[p]): extend the initial state to (x0 ++ [e], v0 ++ [p]) where e and p are samples drawn from µX and µY, Go to Step 2. 4. Accept x1..k as the next sample with probability

min 1; w(x1..k) pdf K(k)(x1..k, v1..k) w(x01..k0) pdf K(k0)(x01..k0, v01..k0)

ϕXn Yn(x, v)

ϕXn Yn(x0, v0) |det( Φ(n)(x0, v0))|

where n = |x0|; otherwise reject the proposal and repeat x01..k0.

Figure 2. NP-i MCMC Algorithm

The heart of NP-i MCMC is Step 3, which can drive a state across dimensions. Step 3 ﬁrst checks if x1..k Supp(w) for some k = 1, . . . , k0, (i.e. if the program M terminates on the trace speciﬁed by some preﬁx of x). If so, the proposal state is set to (x1..k, v1..k), and the state moves from dimension k0 to k. Otherwise, Step 3 repeatedly extends the

initial state (x0, v0) to, say, (x0 ++y0, v0 ++u0), and computes the new proposal state (x++y, v++u) by Step 2, until the program M terminates on the trace speciﬁed by x ++ y. Then, the proposal state becomes (x ++ y, v ++ u), and the state moves from dimension k0 to dimension k0 + |y|. Remark 3.1. (i) The projection commutation property, V 3, ensures that the new proposal state computed using (x0 ++ y0, v0 ++ u0) from Step 3 is of the form (x ++ y, v ++ u) where (x, v) = Φ(|x0|)(x0, v0). (ii) V2, a.s. termination of the program M, ensures that the method of computing a proposal state in Step 3 almost surely ﬁnds a proposal sample x such that M terminates on a trace speciﬁed by x. (iii) The preﬁx property of the target density w ensures that any proper extension of current sample x0 (of length k0) has zero density, i.e. w(x0 ++y) = 0 for all y = []. Hence only the weight of the current sample x01..k0 Supp(w) is accounted for in Step 4 even when x0 is extended. (iv) If the program M is parametric, thus inducing a target density w on a ﬁxed dimensional space, then the NPi MCMC sampler coincides with the i MCMC sampler.

Using NP-i MCMC (Fig. 2), we can formally present the Nonparametric Metropolis-Hastings (NP-MH) sampler which was introduced in Sec. 3.1. See App. E.1 for details.

3.4. Generalisations

In the interest of clarity, we have presented a version of NP-i MCMC in deliberately puriﬁed form. Here we discuss three generalisations of the NP-i MCMC sampler.

Hybrid state space Many PPLs provide continuous and discrete samplers. The positions of discrete and continuous random variables in an execution trace may vary, because of branching. We get around this problem by deﬁning the parameter space X to be the product space of R and 2 := {F, T}. Each value ti in a trace t is paired with a randomly drawn partner t of the other type to make a pair (ti, t) (or (t, ti)). Hence, the same idea of jumping across dimensions can be applied to the state space S

n N Xn Yn. The resulting algorithm is called the Hybrid NP-i MCMC sampler. (See App. B for more details.)

Computationally heavy involutions Step 3 in the NPi MCMC sampler may seem inefﬁcient. While it terminates almost surely (thanks to V 2), the expected number of iterations may be inﬁnite. This is especially bad if the involution is computationally expensive such as the leapfrog integrator in HMC which requires gradient information of the target density function. This can be worked around if for each n N, there is an inexpensive slice function s(n) : Xn Yn X Y where s(n)(x, v) = (dropn 1 Φ(n))(x, v) if (x, v) is a n-

Nonparametric Involutive Markov Chain Monte Carlo

dimensional state such that x1..k Supp(w) for some k < n, and dropℓis the projection that takes a state (x, v) and returns the state (xℓ+1..|x|, vℓ+1..|x|) with the ﬁrst ℓ coordinates of each component dropped. Then the new proposal state in Step 3 can be computed by applying the function s(n) to the recently extended initial state (x0, v0), i.e. (x, v) (x++[e ], v++[p ]) where (e , p ) = s(n)(x0, v0) instead. (See App. D.2 for more details.)

Multiple step NP-i MCMC Suppose the involution is a composition of bijective endofunctions, i.e. Φ(n) := f (n) L f (n) 2 f (n) 1 and each endofunction {f (n) ℓ }n satisﬁes the projection commutation property and has a slice function s(n) ℓ . A new state can then be computed by applying the endofunctions to the initial state one-by-one (instead of in one go as in Step 2 and 3): For each ℓ= 1, . . . , L,

1. Compute the intermediate state (xℓ, vℓ) by applying f (n) ℓ to (xℓ 1, vℓ 1) where n = |xℓ 1|. 2. Test whether xℓ1..k is in Supp(w) for some k. If so, proceed to the next ℓ; otherwise

extend the initial state (x0, v0) with samples drawn from µX and µY, for i = 1, . . . , ℓ, extend the intermediate states (xi, vi) with the result of s(n) i (xi 1, vi 1) where n = |xi 1|, go to 2.

The resulting algorithm is called the Multiple Step NPi MCMC sampler. (See App. D.3 for more details.) This approach was adopted in the recently proposed Nonparametric HMC (Mak et al., 2021b). (See App. E.3 for more details.)

3.5. Correctness

The NP-i MCMC algorithm is correct in the sense that the invariant distribution of the Markov chain generated by iterating the algorithm in Fig. 2 coincides with the target distribution ν : A 7 1 Z R

A w dµT with the normalising constant Z. We present an outline proof here. See App. B.4 for a full proof of the Hybrid NP-i MCMC algorithm, a generalisation of NP-i MCMC.

Note that we cannot reduce NP-i MCMC to i MCMC, i.e. the NP-i MCMC sampler cannot be formulated as an instance of the i MCMC sampler with an involution on the whole state space S. This is because the dimension of involution depends on the values of the random samples drawn in Step 3. Instead, we deﬁne a helper algorithm (App. B.4.2), which induces a Markov chain on states and does not change the dimension of the involution.

This algorithm ﬁrst extends the initial state to ﬁnd the smallest N such that the program M terminates with a trace

speciﬁed by some preﬁx of the sample-component of the resulting state (x, v) after applying the involution Φ(N). Then, it performs the involution Φ(N) as per the standard i MCMC sampler. Hence all stochastic primitives are executed outside of the involution, and the involution has a ﬁxed dimension. We identify the state distribution (App. B.4.2), and show that the Markov chain generated by the auxiliary algorithm has the state distribution as its invariant distribution (Lem. B.14). We then deduce that its marginalised chain is identical to that generated by Hybrid NP-i MCMC; and Hybrid NP-i MCMC has the target distribution ν as its invariant distribution (Lem. B.17). Since Hybrid NPi MCMC is a generalisation of NP-i MCMC (Fig. 2), we have the following corollary.

Corollary 3.2 (Invariant). If all inputs satisfy V1 to 3 then ν is the invariant distribution of the Markov chain generated by iterating the algorithm described in Fig. 2.

4. Transforming NP-i MCMC samplers

The strength of the i MCMC framework lies in its ﬂexibility, which makes it a useful tool capable of expressing important ideas in existing MCMC algorithms as tricks , namely

state-dependent mixture (Trick 1 and 2 in (Neklyudov et al., 2020)), smart involutions (Trick 3 and 4), and smart compositions (Trick 5 and 6).

In each of these tricks, the auxiliary kernel and involution take special forms to equip the resulting sampler with desirable properties such as higher acceptance ratio and better mixing times. This enables a make to order approach in the design of novel MCMC samplers.

A natural question is whether there are similar tricks for the NP-i MCMC framework. In this section, we examine the tricks discussed in (Neklyudov et al., 2020), giving requirements for and showing via examples how one can design novel NP-i MCMC samplers with bespoke properties by suitable applications of these tricks to simple NP-i MCMC samplers. Similar applications can be made to the generalisations of NP-i MCMC such as Hybrid NP-i MCMC (App. C) and Multiple NP-i MCMC (App. D.3.4). Throughout this section, we consider samplers for a program M expressed in a universal PPL which has target density function w that is integrable (V1) and almost surely terminating (V2).

4.1. State-dependent mixture

Suppose we want a sampler that chooses a suitable NPi MCMC sampler depending on the current sample. This might be beneﬁcial for models that are modular, and where there is already a good sampler for each module. We can form a state-dependent mixture of a family {ιm}m M of

Nonparametric Involutive Markov Chain Monte Carlo

NP-i MCMC samplers4 which runs ιm with a weight depending on the current sample. See App. C.1 for details of the algorithm.

Remark 4.1. This corresponds to Tricks 1 and 2 discussed in (Neklyudov et al., 2020) which generalises the Mixture Proposal MCMC and Sample-Adaptive MCMC samplers.

4.2. Auxiliary direction

Suppose we want to use sophisticated bijective but noninvolutive endofunctions f (n) on Xn Yn to better explore the parameter space and return proposals with a high acceptance ratio. Assuming both families {f (n)}n and {f (n) 1}n satisfy the projection commutation property (V3), we can construct an NP-i MCMC sampler with auxiliary direction, which

samples a direction d D := {+, } with equal probability; and generates the next sample by running Step 1 to 4 of the NP-i MCMC sampler using {f (n)}n to suggest the proposal sample if d is sampled to be +; otherwise {f (n) 1}n is used.

See App. C.2 for details of the algorithm.

Notice that since the distribution of the direction variable d is the discrete uniform distribution, we do not need to alter the acceptance ratio in Step 4.

Example 2 (NP-HMC). We can formulate the recently proposed Nonparametric Hamiltonian Monte Carlo sampler in (Mak et al., 2021b) using the (Multiple Step) NP-i MCMC framework with auxiliary direction, in which case the sophisticated non-involutive endofunction is the leapfrog method L. (See App. E.3 for details.)

Remark 4.2. This corresponds to Trick 3 described in (Neklyudov et al., 2020). Trick 4 from (Neklyudov et al., 2020) cannot be applied in our framework because the projection commutation property (V3) is not closed under composition.

4.3. Persistence

Suppose we want a nonreversible sampler, so as to obtain better mixing times. A typical way of achieving nonreversibility from an originally reversible MCMC sampler is to reuse the value for a variable (that is previously resampled in the original reversible sampler) in the next iteration if the proposed sample is accepted. In this way, the value of such a variable is allowed to persist, making the sampler nonreversible.

A key observation made by Neklyudov et al. (2020) is that the composition of reversible i MCMC samplers can

4We treat ιm as a piece of computer code that changes the sample via the NP-i MCMC method described in Fig. 2.

yield a nonreversible sampler. Two systematic techniques to achieve nonreversibility are persistent direction (Trick 5) and an auxiliary kernel (Trick 6). We present similar approaches for NP-i MCMC samplers.

Suppose there is an NP-i MCMC sampler that uses the auxiliary direction as described in Sec. 4.2, i.e. there is a noninvolutive bijective endofunction f (n) on Xn Yn for each n N such that {f (n)}n and {f (n) 1}n satisfy the projection commutation property (V3). In addition, assume there are two distinct families of auxiliary kernels, namely {K(n)+}n and {K(n) }n. The corresponding NP-i MCMC sampler with persistence

proposes the next sample by running Step 1 to 3 of the NP-i MCMC sampler with {K(n)+}n and {f (n)}n if d is sampled to be +; otherwise {K(n) }n and {f (n) 1}n are used; accepts the proposed sample with probability indicated in Step 4 of the NP-i MCMC sampler; otherwise repeats the current sample and ﬂips the direction d.

See App. C.3 for details of the algorithm. The family of kernels and maps indeed persist across multiple iterations if the proposals of these iterations are accepted. The intuitive idea behind this is that if a family of kernels and maps perform well (proposals are accepted) in the current part of the sample space, we should keep it, and otherwise switch to its inverse.

Remark 4.3. This corresponds to Tricks 5 and 6 described in (Neklyudov et al., 2020), which can be found in nonreversible MCMC sampler like the Generalised HMC algorithm (Horowitz, 1991), the Look Ahead HMC sampler (Sohl-Dickstein et al., 2014; Campos & Sanz-Serna, 2015) and Lifted MH (Turitsyn et al., 2011).

Example 3 (NP-HMC with Persistence). The nonreversible HMC sampler in (Horowitz, 1991) uses persistence, and, in addition, (partially) reuses the momentum vector from the previous iteration. As shown in (Neklyudov et al., 2020), it can be viewed as a composition of i MCMC kernels. Using the method indicated above, we can also add persistence to NP-HMC. (See App. E.4 for details.)

Example 4 (NP-Lookahead-HMC). Look Ahead HMC (Sohl-Dickstein et al., 2014; Campos & Sanz-Serna, 2015) can be seen as an HMC sampler with persistence that generates a new state with a varying number of leapfrog steps, depending on the value of the auxiliary variable. Similarly, we can construct an NP-HMC sampler with Persistence that varies the numbers of leapfrog steps. (See App. E.5 for more details.)

Nonparametric Involutive Markov Chain Monte Carlo

1 2 3 4 5 6 7 8 9 10 0

16k NP-MH-P SMC NP-MH

number of mixtures

Figure 3. Histogram of the number of components for the inﬁnite GMM; correct posterior is 3.

5. Experiments

5.1. Nonparametric Metropolis-Hastings

We ﬁrst implemented two simple instances of the NPi MCMC sampler, namely NP-MH (App. E.1) and NP-MH with Persistence (App. E.2) in the Turing language (Ge et al., 2018).5 We compared them with Turing s built-in Sequential Monte Carlo (SMC) algorithm on an inﬁnite Gaussian mixture model (GMM) where the number of mixture components is drawn from a normal distribution. Posterior inference is performed on 30 data points generated from a ground truth with three components. The results of ten runs with 5000 iterations each (Fig. 3) suggest that the NP-i MCMC samplers work pretty well.

5.2. Nonparametric Hamiltonian Monte Carlo

Secondly, we consider Nonparametric HMC (Mak et al., 2021b), mentioned in Ex. 2 before. We have seen how the techniques from Sec. 4 can yield nonreversible versions of NP-i MCMC inference algorithms. Here, we look at nonparametric versions of two extensions described in (Neklyudov et al., 2020): persistence (Ex. 3) and lookahead (Ex. 4). Persistence means that the previous momentum vector is reused in the next iteration. It is parametrised by α [0, 1] where α = 1 means no persistence (standard HMC) and α = 0 means full persistence (no randomness added to the momentum vector). Lookahead HMC is parametrised by K 0, which is the number of extra iterations ( look ahead ) to try

5 The code to reproduce the Turing experiments is available in https://github.com/cmaarkol/nonparametricmh. Turing s SMC implementation is nondeterministic (even with a ﬁxed random seed), so its results may vary somewhat, but everything else is exactly reproducible.

Table 1. Geometric distribution example: total variation difference from the ground truth, averaged over 10 runs, and standard deviation. Each run: 103 samples, L leapfrog steps, step size ϵ = 0.1, persistence parameter α {0.5}. persistence TVD from ground truth L = 5 0.0524 0.0069 L = 5 α = 0.5 0.0464 0.0074 L = 5 α = 0.1 0.0461 0.0083 L = 2 0.0768 0.0181 L = 2 α = 0.5 0.0570 0.0115 L = 2 α = 0.1 0.0534 0.0058

before rejecting a proposed sample (so K = 0 corresponds to standard HMC). Detailed descriptions of these algorithms and how they ﬁt into the (Multiple Step) NP-i MCMC framework can be found in App. E.4 and E.5.

We evaluate these extensions of NP-HMC on the benchmarks from (Mak et al., 2021b): a model for the geometric distribution, a model involving a random walk, and an unbounded Gaussian mixture model. Note that similarly to (Mak et al., 2021b), we actually work with a discontinuous version of NP-HMC, called NP-DHMC, which is a nonparametric extension of discontinuous HMC (Nishimura et al., 2020).6 The discontinuous version can handle the discontinuities arising from the jumps between dimensions more efﬁciently. We don t discuss it in this paper due to lack of space. However, the modiﬁcations necessary to this discontinuous version are the same as for the standard NP-HMC. Mak et al. (2021b) demonstrated the usefulness of NP-DHMC and how it can obtain better results than other general-purpose inference algorithms like Lightweight Metropolis-Hastings and Random-walk Lightweight Metropolis-Hastings. Here, we focus on the beneﬁts of nonreversible versions of NPDHMC, which were derived using the (Multiple Step) NPi MCMC framework.

Geometric distribution The geometric distribution benchmark from (Mak et al., 2021b) illustrates the usefulness of persistence: we ran NP-DHMC for a step count L {2, 5} with and without persistence. As can be seen in Table 1, persistence usually decreases the distance from the ground truth. In fact, the conﬁguration L = 2, α = 0.1 is almost as good as L = 5 without persistence, despite taking 2.5 times less computing time.

Random walk The next benchmark from (Mak et al., 2021b) models a random walk and observes the distance travelled. Fig. 4 shows the effective sample size (ESS) in terms of the number of samples drawn, comparing versions of NP-DHMC with persistence (α = 0.5) and look-ahead

6The source code is available at https://github.com/ fzaiser/nonparametric-hmc.

Nonparametric Involutive Markov Chain Monte Carlo

0 2000 4000 6000 8000 10000 number of samples

effective sample size

method NP-DHMC NP-Lookahead-DHMC (K=1) NP-Lookahead-DHMC (K=2) NP-DHMC pers. (α=0.1) NP-Lookahead-DHMC pers. (α=0.1, K=1) NP-Lookahead-DHMC pers. (α=0.1, K=2)

Figure 4. ESS for the random walk example in terms of number of samples, computed from 10 runs. Each run: 103 samples with L = 5 leapfrog steps of size ϵ = 0.1, persistence parameter α {0.5, 0.1}, and look-ahead K {1, 2}.

(K {1, 2}). We can see again that persistence is clearly advantageous. Look-ahead (K {1, 2}) seems to give an additional boost on top. We ran all these versions with the same computation time budget, which is why the the lines for K = 1, 2 are cut off before the others.

Unbounded Gaussian mixture model Next, we consider a Gaussian mixture model where the number of mixture components is drawn from a Poisson prior. Inference is performed on a training data set generated from a mixture of 9 components (the ground truth). We then compute the log pointwise predictive density (LPPD) on a test data set drawn from the same distribution as the training data. The LPPD is shown in Fig. 5 in terms of the number of samples. Note that again, all versions were run with the same computation budget, which is why some of the lines are cut off early. Despite this, we can see that the versions with lookahead (K {1, 2}) converge more quickly than the versions without lookahead. Persistent direction (α = 0.5) also seems to have a (smaller) beneﬁt.

Dirichlet process mixture model Finally, we consider a Gaussian mixture whose weights are drawn from a Dirichlet process. The rest of the setup is the same as for the Poisson prior, and the results are shown in Fig. 6. The version with persistence is worse at the start but obtains a better LPPD at the end. Look-ahead (K {1, 2}) yields a small additional boost in the LPPD. It should be noted that the variance over the 10 runs is larger in this example than in the previous benchmarks, so the conclusion of this benchmark is less clear-cut.

200 400 600 800 1000 number of samples

log pointwise predictive density

NP-DHMC NP-Lookahead-DHMC (K=1) NP-Lookahead-DHMC (K=2) NP-DHMC pers. (α=0.5) NP-Lookahead-DHMC pers. (α=0.5, K=1) NP-Lookahead-DHMC pers. (α=0.5, K=2) ground truth

Figure 5. Gaussian mixture with Poisson prior: LPPD in terms of number of samples, averaged over 10 runs. The shaded area is one standard deviation. Each run: 103 samples with L = 25 leapfrog steps of size ϵ = 0.05, persistence parameter α = 0.5, and look-ahead K {1, 2}.

6. Related work and Conclusion

Involutive MCMC and its instances The involutive MCMC framework (Neklyudov et al., 2020; Cusumano Towner et al., 2019; Matheos et al., 2020) can in principle be used for nonparametric models by setting X := S n N Xn

n N Yn in Fig. 1 and deﬁning an auxiliary kernel on X Y := S

n N Yn an involution on X Y := S

n N Xn Yn. For instance, Reversible Jump MCMC (Green, 1995) is an instance of i MCMC that works for the inﬁnite GMM model, with the split-merge proposal (Richardson & Green, 1997) specifying when and how states can jump across dimensions. However, designing appropriate auxiliary kernels and involutions that enable the extension of an i MCMC sampler to nonparametric models remains challenging and model speciﬁc. By contrast, NPi MCMC only requires the speciﬁcation of involutions on the ﬁnite-dimensional space Xn Yn; moreover, it provides a general procedure (via Step 3) that drives state movement between dimensions. For designers of nonparametric samplers who do not care to custom build trans-dimensional methods, we contend that NP-i MCMC is their method of choice.

The performance of NP-i MCMC and i MCMC depends on the complexity of the respective auxiliary kernels, involutions and the model in question. Take i GMM for example. RJMCMC with the split-merge proposal which computes the weight, mean, and variance of the new component(s) would be slower than NP-MH, an instance of NP-i MCMC with a computationally light involution (a swap), but more efﬁcient than NP-HMC, an instance of (Multiple Step) NPi MCMC with the computationally heavy leapfrog integrator as involution.

Nonparametric Involutive Markov Chain Monte Carlo

20 40 60 80 100 120 140 number of samples

log pointwise predictive density

NP-DHMC NP-Lookahead-DHMC (K=1) NP-Lookahead-DHMC (K=2) NP-DHMC pers. (α=0.5) NP-Lookahead-DHMC pers. (α=0.5, K=1) NP-Lookahead-DHMC pers. (α=0.5, K=2) ground truth

Figure 6. Dirichlet process mixture: LPPD in terms of number of samples, averaged over 10 runs. The shaded area is one standard deviation. Each run: 150 samples with L = 20 leapfrog steps of size ϵ = 0.05, persistence parameter α = 0.5, and look-ahead K {1, 2}.

Trans-dimensional samplers A standard MCMC algorithm for universal PPLs is the Lightweight Metropolis Hastings algorithm (LMH) (Yang et al., 2014; Tolpin et al., 2015; Ritchie et al., 2016). Widely implemented in several universal PPLs (Anglican, Venture, Gen, and Web PPL), LMH performs single-site updates on the current sample and re-executes the program from the resampling point.

Divide, Conquer, and Combine (DCC) (Zhou et al., 2020) is an inference algorithm that is applicable to probabilistic programs that use branching and recursion. A hybrid algorithm, DCC solves the problem of designing a proposal that can efﬁciently transition between conﬁgurations by performing local inferences on submodels, and returning an appropriately weighted combination of the respective samples.

Mak et al. (2021b) have recently introduced Nonparametric Hamiltonian Monte Carlo (NP-HMC), which generalises HMC to nonparametric models. As we ve seen, NP-HMC is an instantiation of (Multiple Step) NP-i MCMC.

Conclusion We have introduced the nonparametric involutive MCMC algorithm as a general framework for designing MCMC algorithms for models expressible in a universal PPL, and provided a correctness proof. To demonstrate the relative ease of make-to-order design of nonparametric extensions of existing MCMC algorithms, we have constructed several new algorithms, and demonstrated empirically that the expected features and statistical properties are preserved.

Acknowledgements

We thank the reviewers for their insightful feedback and pointing out important related work. We are grateful to Maria Craciun who gave detailed comments on an early draft, and to Hugo Paquet and Dominik Wagner for their helpful comments and advice. We gratefully acknowledge support from the EPSRC and the Croucher Foundation.

Nonparametric Involutive Markov Chain Monte Carlo

Borgström, J., Lago, U. D., Gordon, A. D., and Szymczak, M. A lambda-calculus foundation for universal probabilistic programming. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming (ICFP 2016), pp. 33 46, 2016.

Campos, C. M. and Sanz-Serna, J. M. Extra chance generalized hybrid Monte Carlo. Journal of Computational Physics, 281:365 374, 2015.

Culpepper, R. and Cobb, A. Contextual equivalence for probabilistic programs with continuous random variables and scoring. In Yang, H. (ed.), Proceedings of the 26th European Symposium on Programming (ESOP 2017), Held as Part of the European Joint Conferences on Theory and Practice of Software (ETAPS 2017), volume 10201 of Lecture Notes in Computer Science, pp. 368 392. Springer, 2017.

Cusumano-Towner, M., Lew, A. K., and Mansinghka, V. K. Automating involutive MCMC using probabilistic and differentiable programming, 2020.

Cusumano-Towner, M. F., Saad, F. A., Lew, A. K., and Mansinghka, V. K. Gen: a general-purpose probabilistic programming system with programmable inference. In Mc Kinley, K. S. and Fisher, K. (eds.), Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019), pp. 221 236. ACM, 2019.

Danos, V. and Ehrhard, T. Probabilistic coherence spaces as a model of higher-order probabilistic computation. Information and Computation, 209(6):966 991, 2011.

Devroye, L. Discrete univariate distributions. In Non Uniform Random Variate Generation, chapter 10, pp. 485 553. Springer-Verlag, New York, NJ, USA, 1986.

Ehrhard, T., Tasson, C., and Pagani, M. Probabilistic coherence spaces are fully abstract for probabilistic PCF. In Jagannathan, S. and Sewell, P. (eds.), Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 2014), pp. 309 320. ACM, 2014.

Ehrhard, T., Pagani, M., and Tasson, C. Measurable cones and stable, measurable functions: a model for probabilistic higher-order programming. Proceedings of the ACM on Programming Languages, 2(POPL):59:1 59:28, 2018.

Ge, H., Xu, K., and Ghahramani, Z. Turing: Composable inference for probabilistic programming. In Storkey, A. J. and Pérez-Cruz, F. (eds.), Proceedings of the 21st International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2018), volume 84 of Proceedings of Machine Learning Research, pp. 1682 1690. PMLR, 2018.

Geman, S. and Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, PAMI-6(6):721 741, 1984.

Goodman, N. D., Mansinghka, V. K., Roy, D. M., Bonawitz, K., and Tenenbaum, J. B. Church: a language for generative models. In Mc Allester, D. A. and Myllymäki, P. (eds.), Proceedings of the 24th Conference in Uncertainty in Artiﬁcial Intelligence (UAI 2008), pp. 220 229. AUAI Press, 2008.

Green, P. J. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711 732, 12 1995.

Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1): 97 109, 04 1970.

Horowitz, A. M. A generalized guided Monte Carlo algorithm. Physics Letters B, 268(2):247 252, 1991. ISSN 0370-2693.

Hur, C., Nori, A. V., Rajamani, S. K., and Samuel, S. A provably correct sampler for probabilistic programs. In Proceedings of the 35th IARCS Annual Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2015), pp. 475 488, 2015.

Kiselyov, O. Problems of the lightweight implementation of probabilistic programming. In Proceedings of Workshop on Probabilistic Programming Semantics, 2016.

Kudlicka, J., Murray, L. M., Ronquist, F., and Schön, T. B. Probabilistic programming for birth-death models of evolution using an alive particle ﬁlter with delayed sampling. In Globerson, A. and Silva, R. (eds.), Proceedings of the 35th Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2019), volume 115 of Proceedings of Machine Learning Research, pp. 679 689. AUAI Press, 2019.

Mak, C., Ong, C. L., Paquet, H., and Wagner, D. Densities of almost surely terminating probabilistic programs are differentiable almost everywhere. In Yoshida, N. (ed.), Proceedings of the 30th European Symposium on Programming (ESOP 2021), Held as Part of the European Joint Conferences on Theory and Practice of Software (ETAPS 2021), volume 12648 of Lecture Notes in Computer Science, pp. 432 461. Springer, 2021a.

Mak, C., Zaiser, F., and Ong, L. Nonparametric Hamiltonian Monte Carlo. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning (ICML 2021), volume 139 of Proceedings of Machine Learning Research, pp. 7336 7347. PMLR, 2021b.

Nonparametric Involutive Markov Chain Monte Carlo

Manning, C. and Schütze, H. Foundations of Statistical Natural Language Processing. MIT Press. Cambridge, MA, May 1999.

Matheos, G., Lew, A. K., Ghavamizadeh, M., Russell, S., Cusumano-Towner, M., and Mansinghka, V. K. Transforming worlds: Automated involutive MCMC for openuniverse probabilistic models. In the 3rd Symposium on Advances in Approximate Bayesian Inference, pp. 1 37, 2020.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087 1092, 1953.

Murray, L. M., Lundén, D., Kudlicka, J., Broman, D., and Schön, T. B. Delayed sampling and automatic raoblackwellization of probabilistic programs. In Storkey, A. J. and Pérez-Cruz, F. (eds.), Proceedings of the 21st International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2018), volume 84 of Proceedings of Machine Learning Research, pp. 1037 1046. PMLR, 2018.

Neal, R. M. MCMC using Hamiltonian dynamics. In Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (eds.), Handbook of Markov Chain Monte Carlo, chapter 5. Chapman & Hall CRC Press, 2011.

Neklyudov, K., Welling, M., Egorov, E., and Vetrov, D. P. Involutive MCMC: a unifying framework. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), volume 119 of Proceedings of Machine Learning Research, pp. 7273 7282. PMLR, 2020.

Nishimura, A., Dunson, D. B., and Lu, J. Discontinuous Hamiltonian Monte Carlo for discrete parameters and discontinuous likelihoods. Biometrika, 107(2):365 380, 2020.

Ratner, B. Variable selection methods in regression: Ignorable problem, outing notable solution. Journal of Targeting, Measurement and Analysis for Marketing, 18: 65 75, 2010.

Richardson, S. and Green, P. J. On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(4):731 792, 1997.

Ritchie, D., Stuhlmüller, A., and Goodman, N. D. C3: lightweight incrementalized MCMC for probabilistic programs using continuations and callsite caching. In Gretton, A. and Robert, C. C. (eds.), Proceedings of the 19th International Conference on Artiﬁcial Intelligence and

Statistics (AISTATS 2016), volume 51 of JMLR Workshop and Conference Proceedings, pp. 28 37. JMLR.org, 2016.

Ronquist, F., Kudlicka, J., Senderov, V., Borgström, J., Lartillot, N., Lundén, D., Murray, L., Schön, T. B., and Broman, D. Universal probabilistic programming offers a powerful approach to statistical phylogenetics. Communications Biology, 4(244):2399 3642, 2021.

Scibior, A., Kammar, O., Vákár, M., Staton, S., Yang, H., Cai, Y., Ostermann, K., Moss, S. K., Heunen, C., and Ghahramani, Z. Denotational validation of higher-order Bayesian inference. Proceedings of the ACM on Programming Languages, 2(POPL):60:1 60:29, 2018.

Scott, D. S. A type-theoretical alternative to ISWIM, CUCH, OWHY. Theoretical Computer Science, 121(1&2):411 440, 1993.

Sieber, K. Relating full abstraction results for different programming languages. pp. 373 387, 1990.

Sohl-Dickstein, J., Mudigonda, M., and De Weese, M. R. Hamiltonian Monte Carlo without detailed balance. In Proceedings of the 31th International Conference on Machine Learning (ICML 2014), volume 32 of JMLR Workshop and Conference Proceedings, pp. 719 726. JMLR.org, 2014.

Staton, S. Commutative semantics for probabilistic programming. In Yang, H. (ed.), Proceedings of the 26th European Symposium on Programming, (ESOP 2017), Held as Part of the European Joint Conferences on Theory and Practice of Software (ETAPS 2017), volume 10201 of Lecture Notes in Computer Science, pp. 855 879. Springer, 2017.

Staton, S., Yang, H., Wood, F. D., Heunen, C., and Kammar, O. Semantics for probabilistic programming: higherorder functions, continuous distributions, and soft constraints. In Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science (LICS 2016), pp. 525 534, 2016.

Tolpin, D., van de Meent, J.-W., Paige, B., and Wood, F. Output-sensitive adaptive metropolis-hastings for probabilistic programs. In Appice, A., Rodrigues, P. P., Santos Costa, V., Gama, J., Jorge, A., and Soares, C. (eds.), Machine Learning and Knowledge Discovery in Databases, pp. 311 326, Cham, 2015. Springer International Publishing. ISBN 978-3-319-23525-7.

Turitsyn, K. S., Chertkov, M., and Vucelja, M. Irreversible Monte Carlo algorithms for efﬁcient sampling. Physica D: Nonlinear Phenomena, 240(4):410 414, 2011. ISSN 0167-2789.

Nonparametric Involutive Markov Chain Monte Carlo

Vákár, M., Kammar, O., and Staton, S. A domain theory for statistical probabilistic programming. Proceedings of the ACM on Programming Languages, 3(POPL):36:1 36:29, 2019.

Wand, M., Culpepper, R., Giannakopoulos, T., and Cobb, A. Contextual equivalence for a probabilistic language with continuous random variables and recursion. Proceedings of the ACM on Programming Languages, 2(ICFP):87:1 87:30, 2018.

Wingate, D., Stuhlmüller, A., and Goodman, N. D. Lightweight implementations of probabilistic programming languages via transformational compilation. In Gordon, G. J., Dunson, D. B., and Dudík, M. (eds.), Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2011), volume 15 of JMLR Proceedings, pp. 770 778. JMLR.org, 2011.

Wood, F. D., van de Meent, J., and Mansinghka, V. A new approach to probabilistic programming inference. In Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2014), volume 33 of JMLR Workshop and Conference Proceedings, pp. 1024 1032. JMLR.org, 2014.

Yang, L., Hanrahan, P., and Goodman, N. D. Generating efﬁcient MCMC kernels from probabilistic programs. In Proceedings of the 17th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2014), volume 33 of JMLR Workshop and Conference Proceedings, pp. 1068 1076. JMLR.org, 2014.

Zhou, Y., Gram-Hansen, B. J., Kohn, T., Rainforth, T., Yang, H., and Wood, F. LF-PPL: A low-level ﬁrst order probabilistic programming language for non-differentiable models. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2019), volume 89 of Proceedings of Machine Learning Research, pp. 148 157. PMLR, 2019.

Zhou, Y., Yang, H., Teh, Y. W., and Rainforth, T. Divide, conquer, and combine: a new inference strategy for probabilistic programs with stochastic support. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), volume 119 of Proceedings of Machine Learning Research, pp. 11534 11545. PMLR, 2020.

Nonparametric Hamiltonian Monte Carlo (Appendix)

A. Statistical PCF

In this section, we present a functional probabilistic programming language (PPL) with (stochastic) branching and recursion, and its operational semantics. We also deﬁne what it means for a program to be almost surely terminating and integrable. We conclude the section by showing that a broad class of programs satisﬁes the assumptions for the NP-i MCMC inference algorithm described in Sec. 3.

A.1. Syntax

Statistical PCF (SPCF) is a statistical probabilistic extension of the call-by-value PCF (Scott, 1993; Sieber, 1990) with the reals and Booleans as the ground types. The terms and part of the typing system of SPCF are presented in Fig. 7.

SPCF has three probabilistic constructs:

(1) The continuous sampler normal draws from the standard Gaussian distribution N with mean 0 and variance 1. (2) The discrete sampler coin is a fair coin (formally coin draws from the Bernoulli distribution Bern(0.5) with probability 0.5). (3) The scoring construct score(M) enables conditioning on observed data by multiplying the weight of the current execution with the real number denoted by M. Remark A.1 (Continuous Sampler). The continuous sampler in most PPLs (Culpepper & Cobb, 2017; Wand et al., 2018; Ehrhard et al., 2018; Vákár et al., 2019; Mak et al., 2021a) draw from the standard uniform distribution U with endpoints 0 and 1. However, we decided against U since its support is not the whole of R, which is a common target space for inference algorithms (e.g. Hamiltonian Monte Carlo (HMC) inference algorithm). Instead our continuous sampler draws from the standard normal distribution N which has the whole of R as its support. This design choice does not restrict nor extend our language as we will see in Ex. 6. Remark A.2 (Discrete Sampler). Like (Danos & Ehrhard, 2011; Scibior et al., 2018), we choose the fair coin as our discrete sampler for its simplicity. However, as shown in Ex. 7, this is not limiting. (Ehrhard et al., 2014), for example, samples from the discrete uniform distribution.

Following the convention, the set of all terms is denoted as Λ with meta-variables M, N, L, the set of free variables of a term M is denoted as FV(M) and the set of all closed terms is denoted as Λ0. In the interest of readability, we sometimes use pseudocode in the style of ML (e.g. Ex. 5) to express SPCF terms. Example 5. let rec f x = if coin then f(x+normal)else x in f 0 is a simple program which keeps tossing a coin and sampling from the normal distribution until the ﬁrst coin failure, upon which it returns the sum of samples from the normal distribution.

A.2. Primitive Functions

Primitive functions play an important role in the expressiveness of SPCF. To be concise, we only consider partial, measurable functions of types Rn 2m R or Rn 2m 2 for some n, m N. Examples of these primitives include addition +, division /, comparison < and equality =. As we will see in Ex. 6 and 7, it is important that the cumulative distribution functions (cdf) and probability density functions (pdf) of distributions are amongst the primitives in F. However, we do not require all measurable functions to be primitives, unlike (Staton et al., 2016; Staton, 2017). Example 6. (1) Let cdfnormal be the cdf of the standard normal distribution. Then, the standard uniform distribution with endpoints 0 and 1 can be described as uniform = cdfnormal(normal) . (2) Any distribution with an inverse cdf f in the set of primitives can be described as f(uniform) . For instance, the

Nonparametric Involutive Markov Chain Monte Carlo

Types (typically denoted σ, τ) and terms (typically M, N, L):

σ, τ ::= R | B | σ τ

M, N, L ::= r | a | f(M1, . . . , Mℓ) (Constants and functions)

| y | λy.M | M N (Higher-order)

| if(L, M, N) | YM (Branching and recursion)

| normal | coin | score(M) (Probabilistic)

Typing system:

a 2 Γ a : B

{Γ Mi : R}n i=1 {Γ Nj : B}m j=1 f : Rn 2m G

Γ f(M1, . . . , Mn, N1, . . . , Nm) :

( R if G = R B if G = 2

Γ L : B Γ M : σ Γ N : σ Γ if(L, M, N) : σ Γ M : (σ τ) (σ τ)

Γ normal : R Γ coin : B Γ M : R Γ score(M) : R

Figure 7. Syntax of SPCF, where r, q, p R, a, b 2, x, y, z are variables, and f, g, h ranges over a set F of primitive functions.

inverse cdf of the exponential distribution (with rate 1) is f(p) := ln(1 p) and hence -ln(1-uniform) describes the distribution. (3) The Poisson distribution can be speciﬁed using the uniform distribution ((Devroye, 1986)) as follows.

1 Poi(rate) = let p = exp(-rate) in 2 let rec f x p s = if s < uniform then f (x+1) (p*rate/x) (s+p) else x 3 in f 0 p p Example 7. It might be beneﬁcial for some inference algorithm if discrete distributions are speciﬁed using discrete random variables. Hence, we show how different discrete distributions can be speciﬁed by our discrete sampler coin .

(1) The Bernoulli distribution with probability p [0, 1] D, where D := { n

2m | n, m N} is the set of all Dyadic numbers, can be speciﬁed by

1 let rec bern(p) = if p = 0 then False else 2 if p = 1 then True else 3 if p < 0.5 then 4 if coin then bern(2*p) else False 5 else 6 if coin then True else bern(2*(p-0.5)) (2) The geometric distribution with rate p [0, 1] D can be speciﬁed by

1 geo(p) = count = 0; while bern(1-p): count += 1; return count (3) The binomial distribution with n N trails and probability p [0, 1] D can be speciﬁed by bin(n,p) = sum([1 for i in range(n)if bern(p)]) . (4) Let pdf Pio and pdfgeo be the pdfs of the Poisson and geometric distributions respectively. Then, the Poisson distribution can be described by

Nonparametric Involutive Markov Chain Monte Carlo

1 Poi(rate) = n = geo(0.5); 2 score(pdf Poi(rate,n)/pdfgeo(0.5,n)); 3 return n

A.3. Church Encodings

We can represent pairs and lists in SPCF using Church encoding as follows:

Pair(σ, τ) := σ τ (σ τ R) R List(σ) := (σ R R) (R R)

M, N λz.z M N [M1, . . . , Mℓ] λfx.f M1(f M2 . . . (f Mℓ0))

Moreover standard primitives on pairs and lists, such as projection , len , append and sum , can be deﬁned easily.

A.4. Operational Semantics

A.4.1. TRACE SPACE

Since normal samples from the standard normal distribution N and coin from the Bernoulli distribution Bern(0.5), the sample space of SPCF is the union of the measurable spaces of R and 2. Formally it is the measurable space with set Ω:= R 2, σ-algebra ΣΩ:= {V W | V B, W Σ2} and measure µΩ(V W) := N(V ) + Bern(0.5)(W). We denote the product of n copies of the sample space as (Ωn, ΣΩn, µΩn) and call it the n-dimensional sample space.

A trace is a record of the values sampled in the course of an execution of a SPCF term. Hence, the trace space is the union of sample spaces of varying dimension. Formally it is the measurable space with set T := S

n N Ωn, σalgebra ΣT := {S

n N Un | Un ΣΩn} and measure µT(S

n N Un) = P

n N µΩn(Un). We present traces as lists, e.g. [ 0.2, T, T, 3.1, F] and []. Remark A.3. Another way of recording the sampled value in a run of a SPCF term is to have separate records for the values of the continuous and discrete samples. In this case, the trace space will be the set S

m N 2m. We ﬁnd separating the continuous and discrete samples unnecessarily complex for our purposes and hence follow the more conventional deﬁnition of trace space.

A.4.2. SMALL-STEP REDUCTION

The small-step reduction of SPCF terms can be seen as a rewrite system of conﬁgurations, which are triples of the form M, w, t where M is a closed SPCF term, w > 0 is a weight, and t T a trace, as deﬁned in Fig. 8.

In the rule for normal, a random value r R is generated and recorded in the trace, while the weight remains unchanged: even though the program samples from a normal distribution, the weight does not factor in Gaussian densities as they are already accounted for by µT. Similarly, in the rule for coin, a random boolean a 2 is sampled and recorded in the trace with an unchanged weight. In the rule for score(r), the current weight is multiplied by r R: typically this reﬂects the likelihood of the current execution given some observed data. Similar to (Borgström et al., 2016) we reduce terms which cannot be reduced in a reasonable way (i.e. scoring with nonpositive constants or evaluating functions outside their domain) to fail.

We write + for the transitive closure and for the reﬂexive and transitive closure of the small-step reduction.

A.4.3. VALUE AND WEIGHT FUNCTIONS

Following (Borgström et al., 2016), we view the set Λ of all SPCF terms as S

n,m N(SKn,m Rn 2m) where SKn,m is the set of SPCF terms with exactly n real-valued and m boolean-valued place-holders. The measurable space of terms is equipped with the σ-algebra ΣΛ that is the Borel algebra of the countable disjoint union topology of the product topology of the discrete topology on SKn,m, the standard topology on Rn and the discrete topology on 2m. Similarly the subspace Λ0 v of closed values inherits the Borel algebra on Λ.

Let M be a closed SPCF term. Its value function value M : T Λ0 v { } returns, given a trace, the output value of the program, if the program terminates in a value. Its weight function weight M : T [0, ) returns the ﬁnal weight of the

Nonparametric Involutive Markov Chain Monte Carlo

Values (typically denoted V ), redexes (typically R) and evaluation contexts (typically E):

V ::= r | a | λy.M

R ::= f(c1, . . . , cℓ) | (λy.M) V | if(a, M, N) | Y(λy.M)

| normal | coin | score(r)

E ::= [] | E M | (λy.M) E | if(E, M, N) | f(c1, . . . , ci 1, E, Mi+1, . . . , Mℓ) | YE

Redex contractions:

f(c1, . . . , cℓ), w, t

( f(c1, . . . , cℓ), w, t if (c1, . . . , cℓ) Dom(f), fail otherwise

(λy.M) V, w, t M[V/y], w, t

if(a, M, N), w, t

( M, w, t if a, N, w, t otherwise

Y(λy.M), w, t λz.M[Y(λy.M)/y] z, w, t (for fresh variable z)

normal, w, t r, w, t ++ [r] (for some r R)

coin, w, t a, w, t ++ [a] (for some a 2)

score(r), w, t

( r, r w, t if r > 0, fail otherwise.

Evaluation contexts: R, w, t , w , t E[R], w, t E[ ], w , t R, w, t fail E[R], w, t fail

Figure 8. Small-step reduction of SPCF, where r, q, p R, a, b 2, c R 2, x, y, z are variables, and f, g, h ranges over the set F of primitive functions.

corresponding execution. Formally:

value M(t) :=

( V if M, 1, [] V, w, t otherwise. weight M(t) :=

( w if M, 1, [] V, w, t 0 otherwise.

It follows readily from (Borgström et al., 2016) that the functions value M and weight M are measurable.

Finally, every closed SPCF term M has an associated value measure M on Λ0 v given by

M : ΣΛ0v [0, )

value M 1(U) weight M dµT

Remark A.4. A trace is in the support of the weight function if and only if the value function returns a (closed) value when given this trace. i.e. Supp(weight M) = value M 1(Λ0) for all closed SPCF term M. Remark A.5. The weight function deﬁned here is the density of the target distribution from which an inference algorithm typically samples. In this work, we call it the weight function when considering semantics following (Culpepper & Cobb, 2017; Vákár et al., 2019; Mak et al., 2021a), and call it density function when discussing inference algorithms following (Zhou et al., 2019; 2020; Cusumano-Towner et al., 2020).

A.5. Tree Representable Functions

We consider a necessary condition for the weight function of closed SPCF terms which would help us in designing inference algorithms for them. Note that not every function of type T [0, ) makes sense as a weight function. Consider the program

Nonparametric Involutive Markov Chain Monte Carlo

[t1] ? Supp1(w)

[t1, t2] ? Supp2(w)

w([t1, t2]) t3

[t1, t2, t3] ? Supp3(w)

w([t1, t2, t3]) ...

Figure 9. Program tree of a tree representable function w

let rec f x = if coin then f(x+normal)else x in f 0 in Ex. 5. This program executes successfully with the trace [T, 0.5, F]. This immediately tells us that upon sampling T and 0.5, there must be a sample following them, and this third sample must be a boolean. In other words, the program does not terminate with any proper preﬁx of [T, 0.5, F] such as [T, 0.5], nor any traces of the form [T, 0.5, r] for r R.

Hence, we consider measurable functions w : T [0, ) satisfying

preﬁx property: whenever t Suppn(w)7 then for all k < n, we have t1...k Suppk(w); and

type property: whenever t Suppn(w) then for all k < n and for all t Ω\ Type(tk+1)8 we have t1...k ++ [t] Suppk+1(w).

They are called tree representable (TR) functions (Mak et al., 2021b) because any such function w can be represented as a (possibly) inﬁnite but ﬁnitely branching tree, which we call program tree.

This is exempliﬁed in Fig. 9 (left), where a hexagon node denotes an element of the input of type Ω; a triangular node gives the condition for t Suppn(w) (with the left, but not the right, child satisfying the condition); and a leaf node gives the result of the function on that branch. Any branch (i.e. path from root to leaf) in a program tree of w represents a set of ﬁnite sequences [t1, . . . , tn] in Supp(w). In fact, every program tree of a TR function w speciﬁes a countable partition of Supp(w) via its branches. The preﬁx property guarantees that for each TR function w, there are program trees of the form in Fig. 9 representing w.

The program tree of M is depicted in Fig. 9 (right), where a circular node denotes a real-valued input and a squared node denotes a boolean-valued input.

The following proposition ties SPCF terms and TR functions together. Proposition A.6. Every closed SPCF term has a tree representable weight function.

We will see in Sec. 3 how the TR functions, in particular the preﬁx property, is instrumental in the design of the inference algorithm.

7Suppn(w) := Supp(w) Ωn for all n N. 8The type Type(t) of a sample t Ωis R if t R and is 2 if t 2.

Nonparametric Involutive Markov Chain Monte Carlo

A.6. Almost Sure Termination and Integrability

Deﬁnition A.7. We say a SPCF term M terminates almost surely if M is closed and µT({t T | V, w . M, 1, []

V, w, t }) = 1.

We denote the set of terminating traces as Tter := {t T | V, w . M, 1, [] V, w, t }.

Remark A.8. The set of traces on which a closed SPCF term M terminates, i.e. {t T | V, w . M, 1, [] V, w, t }, can be understood as the support of its weight function Supp(weight M), or as discussed in Rem. A.4, the traces on which the value function returns a value, i.e. value M 1(Λ0 v). Hence, M almost surely terminates if and only if µT(Supp(weight M)) = µT(value M 1(Λ0 v)) = 1.

Deﬁnition A.9. Following (Mak et al., 2021a), we say a trace t T is maximal w.r.t. a closed term M if there exists a term N, weight w where M, 1, [] N, w, t and for all t T \ {[]} and all terms N , N, w, t N , w , t ++ t .

We denote the set of maximal traces as Tmax.

Proposition A.10 ((Mak et al., 2021a), Lemma 9). A closed term M is almost surely terminating if µT(Tmax \ Tter) = 0.

Proposition A.11. The value measure M of a closed almost surely terminating SPCF term M which does not contain score( ) as a subterm is probabilistic.

Deﬁnition A.12. We say a SPCF term M is integrable if M is closed and its value measure is ﬁnite, i.e. M (Λ0 v) < ;

Proposition A.13. An integrable term has an integrable weight function.

Example 8. Now we look at a few examples in which we show that almost surely termination and integrability identify two distinct sets of SPCF terms.

(1) The term M1 deﬁned as let rec f x = if coin then f (x+1)else x in score(2**(f 0)) almost surely terminates since it only diverges on the inﬁnite trace [F, F, . . . ] which has zero probability. However, it is not integrable as the value measure applied to all closed values M1 (Λ0 v) = R

{[T,...,T,F]} weight M1 dµT = P n=0 R

{[T ]n++[F]} weight M1 dµ2n9 = P n=0( 1

2)n+1 2n = P n=0 1 2 is inﬁnite.

(2) Consider the term M2 deﬁned as if coin then Y (lambda x:x)0 else 1 . Since it reduces to a diverging term, namely Y (lambda x:x)0 , with non-zero probability, it does not terminate almost surely. However, it is integrable, since M2 (Λ0 v) = R

{[F]} weight M2 dµT = 1

2 < . (3) The term M3 deﬁned as if(coin, M1, M2) is neither almost surely terminating nor integrable, since M1 is not integrable and M2 is not almost surely terminating. (4) All terms considered previously in Ex. 5 to 7 are both almost surely terminating and integrable.

9We write [x]n to be the list that contains n copies of x.

Nonparametric Involutive Markov Chain Monte Carlo

B. Hybrid Nonparametric Involutive MCMC and its Correctness

In this section, we present the Hybrid Nonparametric Involutive Markov chain Monte Carlo (Hybrid NP-i MCMC), an inference algorithm that simulates the probabilistic model speciﬁed by a given SPCF program that may contains both discrete and continuous samplers.

To start, we detail the Hybrid NP-i MCMC inference algorithm: its state space, conditions on the inputs and steps to generate the next sample; and study how the sampler moves between states of varying dimensions and returns new samples of a nonparametric probabilistic program. We then give an implementation of Hybrid NP-i MCMC in SPCF and demonstrate how the Hybrid NP-i MCMC method extends the MH sampler. Last but not least, we conclude with a discussion on the correctness of Hybrid NP-i MCMC.

B.1. State Spaces

A state in the Hybrid NP-i MCMC algorithm is a pair (x, v) of equal dimension (but not necessarily equal length) parameter and auxiliary variables. The parameter variable x is used to store traces and the auxiliary variable v is used to record randomness. Both variables are vectors of entropies, i.e. Real-Boolean pairs. This section gives the formal deﬁnitions of the entropy, parameter and auxiliary variables and the state, in preparation for the discussion of the Hybrid NP-i MCMC sampler.

B.1.1. ENTROPY SPACE

As shown in App. A, the reduction of a SPCF program is determined by the input trace t T := S

n N(R 2)n, a record of drawn values in a particular run of the program. Hence in order to simulate a probabilistic model described by a SPCF program, the Hybrid NP-i MCMC sampler should generate Markov chains on the trace space. However traversing through the trace space is a delicate business because the positions and numbers of discrete and continuous values in a trace given by a SPCF program may vary. (Consider if coin: normal else: coin .)

Instead, we pair each value ti in a trace t with a random value t of the other type to make a Real-Boolean pair (ti, t) (or (t, ti)). For instance, the trace [T, 3.1] can be made into a Real-Boolean vector [(1.5, T), ( 3.1, T)] with randomly drawn values 1.5 and T. Now, the position of discrete and continuous random variables does not matter and the number of discrete and continuous random variables are ﬁxed in each vector.

We call a Real-Boolean pair an entropy and deﬁne the entropy space E to be the product space R 2 of the Borel measurable space and the Boolean measurable space, equipped with the σ-algebra ΣE := σ {R B | R B, B Σ2} , and the product measure µE := N µ2 where µ2 := Bern(0.5). Note the Radon-Nikodym derivative ϕE of µE can be deﬁned as ϕE(r, a) := 1

2ϕ(r). A n-length entropy vector is then a vector of n entropies, formally an element in the product measurable space (En, ΣEn). We write |x| to mean the length of the entropy vector x.

As mentioned earlier, the parameter variable of a state is an entropy vector that stores traces. Hence, it would be useless if a unique trace cannot be restored from an entropy vector. We found that such a recovery is possible if the trace is in the support of a tree representable function.

Say we would like to recover the trace ˆt that is used to form the entropy vector x by pairing each value in the trace with a random value of the other type. First we realise that traces can be made by selecting either the Real or Boolean component of each pair in a preﬁx of x. For example, traces like [], [T], [ 0.2], [T, 2.9] and [ 0.2, T, F] can be made from the entropy vector [( 0.2, T), (2.9, T), (1.3, F)]. We call these traces instances of the entropy vector. Formally, a trace t T is an instance of an entropy vector x En if |t| n and ti {r, a | (r, a) = xi} for all i = 1, . . . , |t|. We denote the set of all instances of x as instance(x) T. Then, the trace ˆt must be an instance of x. Moreover, if we can further assume that ˆt is in the support of a tree representable function, then Prop. B.1 says we can uniquely identify ˆt amongst all instances of x. Proposition B.1. There is at most one (unique) trace that is both an instance of an entropy vector and in the support of a tree representable function.

Finally, we consider differentiability on the multi-dimensional entropy space. We say a function f : Ek1 Ek2 is differentiable almost everywhere if for all i 2k1, j 2k2, the partial function fi j : Rk1 Rk2 where

fi j(r) = q f(zip(r, i)) = (zip(q, j))10

10We write zip(ℓ1, ℓ2) to be the n-length vector [(ℓ1 1, ℓ2 1), (ℓ1 2, ℓ2 2), . . . , (ℓ1 n, ℓ2 n)] (L1 L2)n for any vectors ℓ1 Ln1 1 and ℓ2 Ln2 2 with n := min{n1, n2}.

Nonparametric Involutive Markov Chain Monte Carlo

is differentiable almost everywhere on its domain Dom(fi j) := {r Rk1 | q Rk2 . f(zip(r, i)) = (zip(q, j))}. The Jacobian of f on (zip(r, i)) is given by fi j(r), if it exists.

B.1.2. PARAMETER SPACE

A parameter variable x of dimension n is an entropy vector of length ιX(n) where ιX : N N is a strictly monotone map. For instance, the parameter variable x := [( 0.2, T), (2.9, T), (1.3, F)] is of dimension two if ιX(n) := n + 1, and dimension three if ιX(n) := n. We write dim(x) to mean the dimension of x and |x| to mean the length of x. Hence, dim(x) |x| and ιX(dim(x)) = |x|. We extend the notion of dimension to traces and say a trace t T has dimension n if |t| = ιX(n). Importantly, we assume that every trace in the support of w has a dimension (w.r.t. ιX), i.e. Supp(w) = S

n N SuppιX(n)(w).

Formally, the n-dimensional parameter space (X(n), ΣX(n)) is the product of ιX(n) copies of the entropy space (E, ΣE) and the base measure µX(n) on X(n) is the product of ιX(n) copies of the entropy measure µE with the Radon-Nikodym derivative ϕX(n). For ease of reference, we write (X, ΣX, µX) for the one-dimensional parameter space.

B.1.3. AUXILIARY SPACE

Similarly, an auxiliary variable v of dimension n is an entropy vector of length ιY(n) where ιY : N N is a strictly monotone map. The n-dimensional auxiliary space (Y(n), ΣY(n)) is the product of ιY(n) copies of the entropy space (E, ΣE) and the base measure µY(n) on Y(n) is the product of ιY(n) copies of the entropy measure µE with the Radon-Nikodym derivative ϕY(n). For ease of reference, we write (Y, ΣY, µY) for the one-dimensional auxiliary space.

B.1.4. STATE SPACE

A state is a pair of equal dimension but not necessarily equal length parameter and auxiliary variable. For instance with ιX(n) := n + 1 and ιY(n) := n, the parameter variable x := [( 0.2, T), (2.9, T), (1.3, F)] and the auxiliary variable v := [(1.5, T), ( 2.1, F)] are both of dimension two and (x, v) is a two-dimensional state.

Formally, the state space S is the list measurable space of the product of parameter and auxiliary spaces of equal dimension, i.e. S := S

n N(X(n) Y(n)), equipped with the σ-algebra ΣS := σ{Xn Vn | Xn ΣX(n), Vn ΣY(n), n N} and measure µS(S) := P

Y(n) µX(n)({x X(n) | (x, v) S}) µY(n)(dv). We write S(n) for the set consisting of all n-dimensional states.

We extend the notion of instances to states and say a trace t is an instance of a state (x, v) if it is an instance of the parameter component x.

The distinction between dimension and length in parameter and auxiliary variables gives us the necessary pliancy to discuss techniques for further extension of the Hybrid NP-i MCMC sampler in App. C. Before that, we present the inputs to the Hybrid NP-i MCMC sampler.

B.2. Inputs of Hybrid NP-i MCMC Algorithm

Besides the target density function, the Hybrid NP-i MCMC sampler, like i MCMC, introduces randomness via auxiliary kernels and moves around the state space via involutions in order to propose the next sample. We now examine each of these inputs closely.

B.2.1. TARGET DENSITY FUNCTION

Similar to other inference algorithms for probabilistic programming, the Hybrid NP-i MCMC sampler takes the weight function w : T [0, ) as the target density function. Recall w(t) gives the weight of a particular run of the given probabilistic program indicated by the trace t. By Prop. A.6, the weight function w is tree representable. For the sampler to work properly, we also require weight function w to satisfy the following assumptions.

(H1) w is integrable, i.e. R

T w dµT =: Z < (otherwise, the inference problem is undeﬁned).

(H2) w is almost surely terminating (AST), i.e. µT({t T | w(t) > 0}) = 1 (otherwise, the loop in the Hybrid NP-i MCMC algorithm may not terminate almost surely).

Nonparametric Involutive Markov Chain Monte Carlo

Virtually all useful probabilistic models can be speciﬁed by SPCF programs with densities satisfying H1 and 2. Exceptions are models that are not normalizable or diverge with non-zero probability. (See App. A.6 for more details.)

B.2.2. AUXILIARY KERNELS

To introduce randomness, the Hybrid NP-i MCMC sampler takes, for each n N, a probability auxiliary kernel K(n) : X(n) Y(n) which gives a probability distribution K(n)(x, ) on Y(n) for each n-dimensional parameter variable x. We assume each auxiliary kernel K(n) has a probability density function (pdf) pdf K(n) : X(n) Y(n) [0, ) w.r.t. µY(n).

B.2.3. INVOLUTIONS

To move around the state space S, the Hybrid NP-i MCMC sampler takes, for each n N, an endofunction Φ(n) on X(n) Y(n) that is both involutive and differentiable almost everywhere. We require the set {Φ(n)}n of involutions to satisfy the projection commutation property:

(H3) For all (x, v) S where dim(x) = m, if SuppιX(n)(w) instance(x) = for some n, then for all k = n, . . . , m, takek(Φ(m)(x, v)) = Φ(k)(takek(x, v))

where takek is the projection that given a state (x, v), takes the ﬁrst ιX(k) coordinates of x and the ﬁrst ιY(k) coordinates of v and forms a k-dimensional state.

The projection commutation property ensures that the order of applying a projection and an involution to a state (which has an instance in the support of the target density function) does not matter.

B.3. The Hybrid NP-i MCMC Algorithm

After identifying the state space and the necessary conditions on the inputs of the Hybrid NP-i MCMC sampler, we have enough foundation to describe the algorithm.

Given a SPCF program M with weight function w on the trace space T, the Hybrid Nonparametric Involutive Markov chain Monte Carlo (Hybrid NP-i MCMC) algorithm generates a Markov chain on T as follows. Given a current sample t0 of dimension k0 (i.e. |t0| = ιX(k0)),

1. (Initialisation Step) Form a k0-dimensional parameter variable x0 X(k0) by pairing each value t0i in t0 with a randomly drawn value t of the other type to make a pair (t0i, t) or (t, t0i) in the entropy space E. Note that t0 is the unique instance of x0 that is in the support of w.

2. (Stochastic Step) Introduce randomness to the sampler by drawing a k0-dimensional value v0 Y(k0) from the probability measure K(k0)(x0, ).

3. (Deterministic Step) Move around the n-dimensional state space X(n) Y(n) and compute the new state (x, v) by applying the involution Φ(n) to the initial state (x0, v0) where n = dim (x0) = dim (v0).

4. (Extend Step) Test whether any instance t of x is in the support of w. If so, proceed to the next step with t as the proposed sample; otherwise

(i) Extend the n-dimensional initial state to a state (x0 ++ y0, v0 ++ u0) of dimension n + 1 where y0 and u0 are values drawn randomly from µEιX(n+1) ιX(n) and µEιY(n+1) ιY(n) respectively, (ii) Go to Step 3 with the initial state (x0, v0) replaced by (x0 ++ y0, v0 ++ u0).

5. (Accept/reject Step) Accept the proposed sample t as the next sample with probability

min 1; w(t) pdf K(k)(takek(x, v)) ϕX(n)(x) ϕY(n)(v) w(t0) pdf K(k0)(takek0(x0, v0)) ϕX(n)(x0) ϕY(n)(v0) |det( Φ(n)(x0, v0))| (1)

where n = dim (x0) = dim (v0), k is the dimension of t and k0 is the dimension of t0; otherwise reject the proposal and repeat t0.

Nonparametric Involutive Markov Chain Monte Carlo

Remark B.2. The integrable assumption on the target density (H1) ensures the inference problem is well-deﬁned. The almost surely terminating assumption on the target density (H2) guarantees that the Hybrid NP-i MCMC sampler almost surely terminates. (See App. B.4.1 for a concrete proof.) The projection commutation property on the involutions (H3) allows us to deﬁne the invariant distribution

B.3.1. MOVEMENT BETWEEN SAMPLES OF VARYING DIMENSIONS

All MCMC samplers that simulate a nonparametric model must decide how to move between samples of varying dimensions. We now discuss how the Hybrid NP-i MCMC sampler as given in App. B.3 achieves this.

Form initial and new states in the same dimension Say the current sample t0 has a dimension of k0. Step 1 to 3 form a k0-dimensional initial state (x0, v0) and a new k0-dimensional state (x, v).

Move between dimensions The novelty of Hybrid NP-i MCMC is its ability to generate a proposed sample t in the support of the target density w which may not be of same dimension as t0. This is achieved by Step 4.

Propose a sample of a lower dimension Step 4 ﬁrst checks whether any instance of the parameter-component x X(k0)

of the new state (computed in Step 3) is in the support of w. If so, we proceed to Step 5 with that instance, say t, as the proposed sample.

Say the dimension of t is k. Then, we must have k k0 as the instance t T of a k0-dimensional parameter x X(k0)

must have a dimension that is lower than or equals to k0. Hence, the dimension of the proposed sample t is lower than or equals to the current sample t0.

Propose a sample of a higher dimension Otherwise (i.e. none of the instances of x X(k0) is in the support of w) Step 4 extends the initial state (x0, v0) X(k0) Y(k0) to (x0 ++ y0, v0 ++ u0) X(k0+1) Y(k0+1); and computes a new (k0 + 1)-dimensional state (x ++ y, v ++ u) X(k0+1) Y(k0+1) (via Step 3). This process of incrementing the dimensions of both the initial and new states is repeated until an instance t of the new state, say of dimension n, is in the support of w. At which point, the proposed sample is set to be t.

Say the dimension of t is k. Then, we must have k > k0 as t is not an instance of the k0-dimensional parameter x X(k0)

but one of x ++ y X(n). Hence, the dimension of the proposed sample t is higher than the current sample t0.

Accept or reject the proposed sample Say the proposed sample t is of dimension k. With the probability given in Equation (1), Step 5 accepts t as the next sample and Hybrid NP-i MCMC updates the current sample t0 of dimension k0 to a sample t of dimension k. Otherwise, the current sample t0 is repeated and the dimension remains unchanged.

B.3.2. HYBRID NP-IMCMC IS A GENERALISATION OF NP-IMCMC

Given a target density w on S

n N Rn, we can set the entropy space E to be R and the index maps ιX and ιY to be identities. Then, the n-dimensional parameter space X(n) := Rn, the n-dimensional auxiliary space Y(n) := Rn and the state space S := S

n N(X(n) Y(n)) = S

n N(Rn Rn) of the Hybrid NP-i MCMC sampler matches with those given in Sec. 3.2 for the NP-i MCMC sampler. An instance t is then a preﬁx x1..k of a parameter variable x. Moreover, the assumptions H1 to 3 on the inputs of Hybrid NP-i MCMC are identical to those V1 to 3 on the inputs of NP-i MCMC. Hence the Hybrid NP-i MCMC algorithm (App. B.3) is a generalisation of the NP-i MCMC sampler (Fig. 2).

B.3.3. PSEUDOCODE OF HYBRID NP-IMCMC ALGORITHM

We implement the Hybrid NP-i MCMC algorithm in the ﬂexible and expressive SPCF language explored in App. A.

The NPi MCMC function in Listing 2 is an implementation of the Hybrid NP-i MCMC algorithm in SPCF. We assume that the following SPCF types and terms exist. For each n N, the SPCF types T , X[n] and Y[n] implements T, X(n) and Y(n) respectively; the SPCF term w of type T -> R implements the target density w; for each n N, the SPCF terms

auxkernel[n] of type X[n] -> Y[n] implements the auxiliary kernel K(n) : X(n) Y(n); pdfauxkernel[n] of type X[n]*Y[n] -> R implements the probability density function pdf K(n) : X(n) Y(n) R of the auxiliary kernel; involution[n] of type X[n]*Y[n] -> X[n]*Y[n] implements the involution Φ(n) on X(n) Y(n); and

Nonparametric Involutive Markov Chain Monte Carlo

Listing 2. Pseudocode of the Hybrid NP-i MCMC algorithm

1 def NPi MCMC(t0): 2 k0 = dim(t0) # initialisation step 3 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 4 v0 = auxkernel[k0](x0) # stochastic step 5 (x,v) = involution[k0](x0,v0) # deterministic step 6 n = k0 # extend step 7 while not intersect(instance(x),support(w)): 8 x0 = x0 + [(normal, coin)]*(index X(n+1)-index X(n)) 9 v0 = v0 + [(normal, coin)]*(index Y(n+1)-index Y(n)) 10 n = n + 1 11 (x,v) = involution[n](x0,v0) 12 t = intersect(instance(x),support(w))[0] # accept/reject step 13 k = dim(t) 14 return t if uniform < min{1, w(t)/w(t0) * pdfauxkernel[k](proj((x,v),k))/ 15 pdfauxkernel[k0](proj((x0,v0),k0)) * 16 pdfpar[n](x)/pdfpar[n](x0) * 17 pdfaux[n](v)/pdfaux[n](v0) * 18 absdetjacinv[n](x0,v0)} 19 else t0

absdetjacinv[n] of type X[n]*Y[n] -> R implements the absolute value of the Jacobian determinant of Φ(n).

We further assume that the following primitives are implemented: dim returns the dimension of a given trace; index X and

index Y implements the maps ιY and ιX respectively; pdfpar[n] implements the derivative ϕX(n) of the n-dimensional parameter space X(n); pdfaux[n] implements the derivative ϕY(n) of the n-dimensional auxiliary space Y(n); instance returns a list of all instances of a given entropy vector; support returns a list of traces in the support of a given function; and proj implements the projection function where proj((x,v),k)=(x[:index X(k)],v[:index Y(k)]) .

B.4. Correctness

The Hyrbid Nonparametric Involutive Markov chain Monte Carlo (Hyrbid NP-i MCMC) algorithm is presented in App. B.3 for the simulation of probabilistic models speciﬁed by probabilistic programs.

We justify this by proving that the Markov chain generated by iterating the Hybrid NP-i MCMC algorithm preserves the target distribution, speciﬁed by

ν : ΣT [0, )

U w dµT where Z := Z

as long as the target density function w (given by the weight function of the probabilistic program) is integrable (H1) and almost surely terminating (H2); with a probability kernel K(n) : X(n) Y(n) and an endofunction Φ(n) on X(n) Y(n) that is involutive and differentiable almost everywhere for each n N such that {Φ(n)}n satisﬁes the projection commutation property (H3).

Throughout this chapter, we assume the assumptions stated above, and prove the followings.

1. The Hybrid NP-i MCMC sampler almost surely returns a sample for the simulation (Lem. B.4).

2. The state movement in the Hybrid NP-i MCMC sampler preserves a distribution on the states (Lem. B.14).

3. The marginalisation of the state distribution which the state movement of Hybrid NP-i MCMC preserves coincides with the target distribution (Lem. B.17).

Nonparametric Involutive Markov Chain Monte Carlo

B.4.1. ALMOST SURE TERMINATION

In Rem. B.2, we asserted that the almost surely terminating assumption (H2) on the target density guarantees that the Hybrid NP-i MCMC algorithm (App. B.3) almost surely terminates. We justify this claim here.

Step 3 in the Hybrid NP-i MCMC algorithm (App. B.3) repeats itself if the sample-component x of the new state (x, v) (computed by applying the involution Φ(n) on the extended initial state (x0, v0)) does not have an instance in the support of w. This loop halts almost surely if the measure of

{(x0, v0) S | (x, v) = Φ(n)(x0, v0) and instance(x) Supp(w) = }

tends to zero as the dimension n tends to inﬁnity. Since Φ(n) is invertible and |det Φ(n)(x0, v0)| > 0 for all n N and (x0, v0) S,

µS({(x0, v0) S | (x, v) = Φ(n)(x0, v0) and instance(x) Supp(w) = })

= Φ(n) µS({(x, v) S | instance(x) Supp(w) = })

< µS({(x, v) S | instance(x) Supp(w) = })

= µX(n)({x X(n) | instance(x) Supp(w) = }).

Thus it is enough to show that the measure of a n-dimensional parameter variable not having any instances in the support of w tends to zero as the dimension n tends to inﬁnity, i.e.

µX(n)({x X(n) | instance(x) Supp(w) = }) 0 as n .

We start with the following proposition which shows that the chance of a n-dimensional parameter variable having some instances in the support of w is the same as the chance of w terminating before n reduction steps. Proposition B.3. µX(n)({x X(n) | instance(x) Supp(w) = }) = µT(Sn i=1 SuppιX(i)(w)) for all n N and all tree representable function w.

Proof. Let n N and w be a tree representable function.

For each i n, we unpack the set {x X(i) | instance(x) A = } of i-dimensional parameter variables that has an instance in the set A ΣΩιX(i) of traces of length ιX(i) where Ω:= R 2. Write π : {1, . . . , ιX(i)} {R, 2} for the measurable space π(1) π(2) π(ιX(i)) with a probability measure µπ := µΩιX(i) on π; π 1 for the inverse measurable space of π, i.e. π 1(j) := Ω\ π(j) for all j ιX(i); and S for the set of all such measurable spaces. Then, for any i-dimensional parameter variable x, t instance(x) A if and only if there is some π S where t A π and x zip(A π, π 1). Hence, {x X(i) | instance(x) A = } can be written as S

π S zip(A π, π 1). Moreover µX(i)(zip(A π, π 1)) = µπ(A π) µπ 1(π 1) = µπ(A π).

Consider the case where A := SuppιX(i)(w). Then, we have

{x X(i) | instance(x) SuppιX(i)(w) = } = [

π S zip(SuppιX(i)(w) π, π 1).

We ﬁrst show that this is a disjoint union, i.e. for all π S, zip(SuppιX(i)(w) π, π 1) are disjoint. Let x zip(SuppιX(i)(w) π1, π1 1) zip(SuppιX(i)(w) π2, π2 1) where π1, π2 S. Then, at least one instance t1 of x is in SuppιX(i)(w) π1 and similarly at least one instance t2 of x is in SuppιX(i)(w) π2. By Prop. B.1, t1 = t2 and hence π1 = π2.

Since zip(SuppιX(i)(w) π, π 1) are disjoint for all π S, we have

µX(i)({x X(i) | instance(x) SuppιX(i)(w) = }) = µX(i)( [

π S zip(SuppιX(i)(w) π, π 1))

π S µX(i)(zip(SuppιX(i)(w) π, π 1)) = X

π S µπ(SuppιX(i)(w) π)

π S µΩιX(i)(SuppιX(i)(w) π) = µΩιX(i)(SuppιX(i)(w)) = µT(SuppιX(i)(w)).

Nonparametric Involutive Markov Chain Monte Carlo

Finally, {x X(n) | instance(x) Supp(w) = } is equal to Sn i=1{x X(i) | instance(x) SuppιX(i)(w) = } EιX(n) ιX(i) and hence

µX(n)({x X(n) | instance(x) Supp(w) = })

i=1 {x X(i) | instance(x) SuppιX(i)(w) = } EιX(n) ιX(i))

i=1 µX(i)({x X(i) | instance(x) SuppιX(i)(w) = })

i=1 µT(SuppιX(i)(w))

i=1 SuppιX(i)(w))

Prop. B.3 links the termination of the Hybrid NP-i MCMC sampler with that of the target density function w. Hence by assuming that w terminates almost surely (H2), we can deduce that the Hybrid NP-i MCMC algorithm (App. B.3) almost surely terminates.

Lemma B.4 (Almost Sure Termination). Assuming H 2, the Hybrid NP-i MCMC algorithm (App. B.3) almost surely terminates.

Proof. Since Φ(n) is invertible for all n N, and w almost surely terminates (H2), i.e. limm µT(Sm j=1 Suppj(w)) = 1, we deduce from Prop. B.3 that

µS({(x0, v0) S | (x, v) = Φ(n)(x0, v0) and instance(x) Supp(w) = })

< µX(n)({x X(n) | instance(x) Supp(w) = })

= µX(n)(X(n) \ {x X(n) | instance(x) Supp(w) = })

= 1 µX(n)({x X(n) | instance(x) Supp(w) = })

i=1 SuppιX(i)(w)) (Prop. B.3)

1 1 = 0 as n . (H2)

So the probability of satisfying the condition of the loop in Step 3 of Hybrid NP-i MCMC sampler tends to zero as the dimension n tends to inﬁnity, making the Hybrid NP-i MCMC sampler (App. B.3) almost surely terminating.

B.4.2. INVARIANT STATE DISTRIBUTION

After ensuring the Hybrid NP-i MCMC sampler (App. B.3) almost always returns a sample (Lem. B.4), we identify the distribution on the states and show that it is invariant against the movement between states of varying dimensions in Hybrid NP-i MCMC.

State Distribution Recall a state is an equal dimension parameter-auxiliary pair. We deﬁne the state distribution π on the state space S := S

n N(X(n) Y(n)) to be a distribution with density ζ (with respect to µS) given by

1 Z w(t) pdf K(k)(takek(x, v))

if (x, v) Svalid and t instance(x) Supp(w) has dimension k 0 otherwise

where Z := R

T w dµT (which exists by H1) and Svalid is the subset of S consisting of all valid states.

Nonparametric Involutive Markov Chain Monte Carlo

Remark B.5. If there is some trace in instance(x) Supp(w) for a parameter variable x, by Prop. B.1 this trace t is unique and hence x represents a sample of the target distribution.

We say a n-dimensional state (x, v) is valid if

(i) instance(x) Supp(w) = , and

(ii) (y, u) = Φ(n)(x, v) implies instance(y) Supp(w) = , and

(iii) takek(x, v) Svalid for all k < n.

Intuitively, valid states are the states which, when transformed by the involution Φ(n), the instance of the parametercomponent of which does not fall beyond the support of w.

We write Svalid n := Svalid (X(n) Y(n)) to denote the the set of all n-dimensional valid states. The following proposition shows that involutions preserve the validity of states.

Proposition B.6. Assuming H 3, the involution Φ(n) sends Svalid n to Svalid n for all n N. i.e. If (x, v) Svalid n , then (y, u) = Φ(n)(x, v) Svalid n .

Proof. Let (x, v) Svalid n and (y, u) = Φ(n)(x, v). We prove (y, u) Svalid n by induction on n N.

Let n = 1. As Φ(1) is involutive and (x, v) is a valid state,

(i) instance(y) Supp(w) = and (ii) (x, v) = Φ(1)(y, u) and instance(x) Supp(w) = . (iii) holds trivially

and hence (y, u) Svalid 1 .

Assume for all m < n, (z, w) Svalid m implies (z , w ) = Φ(m)(z, w) Svalid m . Similar to the base case, (i) and (ii) hold as Φ(n) is involutive and (x, v) is a valid state. Assume for contradiction that (iii) does not hold, i.e. there is k < n where takek(y, u) Svalid k . As instance(takek(y)) Supp(w) = , by H3 and the inductive hypothesis,

takek(x, v) = takek(Φ(n)(y, u)) = Φ(k)(takek(y, u)) Svalid k

which contradicts with the fact that (x, v) is a valid state.

We can partition the set Svalid of valid states. Let (x, v) be a m-dimensional valid state. The parameter variable x can be written as zip(t1, t2)++y where t1 instance(x) Supp(w) is of dimension k0, t2 is a trace where zip(t1, t2) = takek0(x), and y := dropk0(x) where dropk drops the ﬁrst ιX(k) components of the input parameter. Similarly, the auxiliary variable v can be written as v1 ++ v2 where v1 := takek0(v) and v2 := dropk0(v) where dropk drops the ﬁrst ιY(k) components of the input parameter. Hence, we have

m=1 {(zip(t1, t2) ++ y, v1 ++ v2) Svalid m |

t1 SuppιX(k0)(w), t2 T, y EιX(m) ιX(k0), v1 Y(k0), v2 EιY(m) ιY(k0)}

and the state distribution π on the measurable set S ΣS can be written as

EιY(m) ιY(k0)

EιX(m) ιX(k0)

SuppιX(k0)(w)

[(zip(t1, t2) ++ y, v1 ++ v2) S Svalid m ] 1

Z w(t1) pdf K(k0)(zip(t1, t2), v1)

µT(dt1) µT(dt2) µEιX(m) ιX(k0)(dy) µY(k0)(dv1) µEιY(m) ιY(k0)(dv2)

Nonparametric Involutive Markov Chain Monte Carlo

We can now show that the state distribution π is indeed a probability measure and the set of valid states almost surely covers all states w.r.t. the state distribution.

Proposition B.7. Assuming H1,

1. π(S) = 1; and

2. π(S \ Sn k=1 Svalid k ) 0 as n .

Proof. 1. Consider the set Svalid with the partition discussed above.

= π(Svalid)

EιY(m) ιY(k0)

EιX(m) ιX(k0)

SuppιX(k0)(w)

[(zip(t1, t2) ++ y, v1 ++ v2) Svalid m ] 1

Z w(t1) pdf K(k0)(zip(t1, t2), v1)

µT(dt1) µT(dt2) µEιX(m) ιX(k0)(dy) µY(k0)(dv1) µEιY(m) ιY(k0)(dv2)

SuppιX(k0)(w)

1 Z w(t1) pdf K(k0)(zip(t1, t2), v1)

Eℓ2 [(zip(t1, t2) ++ y, v1 ++ v2) Svalid] µEℓ2(dy) µEℓ1(dv2)

µT(dt1) µT(dt2) µY(k0)(dv1)

SuppιX(k0)(w)

1 Z w(t1) Z

Y(k0) pdf K(k0)(zip(t1, t2), v1) µY(k0)(dv1)

µT(dt1) µT(dt2)

SuppιX(k0)(w)

1 Z w(t1) µT(dt1) µT(dt2)

1 Z w(t1) µT(dt1) = 1

2. Since π is a probability distribution and π(S \ Svalid) = 0, the series P n=1 π(Svalid n ) which equals π(S n=1 Svalid n ) = π(Svalid) = 1 must converge. Hence π(S\Sn k=1 Svalid k ) = π(Svalid \Sn k=1 Svalid k ) = P i=n+1 π(Svalid i ) 0 as n .

Equivalent Program Though the Hybrid NP-i MCMC algorithm (App. B.3) traverses state, it takes and returns samples on the trace space T. Hence instead of asking whether the state distribution π is invariant against the Hybrid NP-i MCMC sampler directly, we consider a similar program which takes and returns states and prove the state distribution π is invariant w.r.t. this program.

Consider the program e NPi MCMC in Listing 3. It is similar to NPi MCMC (Listing 2) syntactically except it takes and returns states instead of traces, and has two additional lines (Lines 2 and 13). Hence, it is easy to deduce from Lem. B.4 that

e NPi MCMC almost surely terminates.

In e NPi MCMC , we group the commands differently and into two groups:

Line 2-12 An initial valid state (x0,v0) is constructed so that x0 and x* have the same instance in the support of w .

Line 14-22 A proposed state (x,v) is computed and accepted/rejected.

Nonparametric Involutive Markov Chain Monte Carlo

Listing 3. Code for the equivalent program

1 def e NPi MCMC(x*,v*): 2 t0 = intersect(instance(x*),support(w))[0] # find a valid state 3 k0 = dim(t0) 4 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 5 v0 = auxkernel[k0](x0) 6 (x,v) = involution[k0](x0,v0) 7 n = k0 8 while not intersect(instance(x),support(w)): 9 x0 = x0 + [(normal, coin)]*(index X(n+1)-index X(n)) 10 v0 = v0 + [(normal, coin)]*(index Y(n+1)-index Y(n)) 11 n = n + 1 12 (x,v) = involution[n](x0,v0) 13 (x,v) = involution[n](x0,v0) # accept/reject proposed state 14 t = intersect(instance(x),support(w))[0] 15 k = dim(t) 16 return (x,v) if uniform < min{1, w(t)/w(t0) * 17 pdfauxkernel[k](proj((x,v),k))/ 18 pdfauxkernel[k0](proj((x0,v0),k0)) * 19 pdfpar[n](x)/pdfpar[n](x0) * 20 pdfaux[n](v)/pdfaux[n](v0) * 21 absdetjacinv[n](x0,v0)} 22 else (x0,v0)

Invariant Distribution Take a SPCF program M of type List(X*Y)-> List(X*Y) where the SPCF types X and Y implements the parameter space X and auxiliary space Y respectively. We deﬁne the transition kernel of M to be the kernel T M : S S where

T M (s, S) := Z

1(S ) weight M(s) dµT = M(s) (S )

where s implements the state s and S is the set consisting of SPCF terms that implements states in S. Intuitively, T M (s, S) gives the probability that the term M returns a state in S given the current state s.

Proposition B.8. Let M be a SPCF term of type List(X*Y)-> List(X*Y) . If M(s) does not contain any scoring subterm and almost surely terminates for all SPCF terms s , then its transition kernel T M is probabilistic.

Proof. Since the term M(s) does not contain score( ) and terminates almost surely, by Prop. A.11 its value measure must be probabilistic. Hence T M ( s , S) = M(s) (S ) = M(s) (Λ0 v) = 1.

We say a distribution µ on states S is invariant w.r.t. a almost surely terminating SPCF program M of type List(X*Y)-> List(X*Y) if µ is not altered after applying M , formally R

S T M (s, S) µ(ds) = µ(S).

We now prove that e NPi MCMC preserves the state distribution π stated in App. B.4.2 by considering the transition kernels given by the two steps in e NPi MCMC given in App. B.4.2: ﬁnd a valid state (Lines 2-12) and accept/reject the computed proposed state (Lines 13-22).

Finding a Valid State Assuming the initial state (x*,v*) is valid, e NPi MCMC (Lines 2-12) aims to construct a valid state (x0,v0) where x* and x0 share the same instance t0 that is in the support of the density w .

To do this, it ﬁrst ﬁnds the instance t0 of x* which is in the support of w (Line 2). Say the dimension of t0 is

k0 (Line 3). It then forms a k0 -dimensional state (x0,v0) by sampling partners t for each value in the trace t0

Nonparametric Involutive Markov Chain Monte Carlo

to form a k0 -dimensional parameter variable x0 (Line 4); and drawing a k0 -dimensional auxiliary variable from

auxkernel[k0](x0) (Line 5). Say v is the auxiliary value drawn. Then, the k0 -dimensional state can be written as

(zip(t0,t),v) .

Note that the k0 -dimensional state (zip(t0,t),v) might not be valid. In which case, it repeatedly appends

zip(t0,t) and v with entropies (normal, coin) until the resulting state is valid (Lines 6-12). Say y and u are the entropy vectors drawn for the parameter and auxiliary variables respectively. Then the resulting state can be written as (zip(t0,t)+y, v+u) .

The transition kernel of Lines 2-12 can be expressed

T1((x , v ), S)

EιY(n) ιY(k0)

EιX(n) ιX(k0)

T [(zip(t0, t) ++ y, v ++ u) S Svalid n ] pdf K(k0)(zip(t0, t), v)

µT(dt) µY(k0)(dv) µEιX(n) ιX(k0)(dy) µEιY(n) ιY(k0)(du)

if (x , v ) Svalid and t0 instance(x ) Supp(w) has some dimension k0 N; and 0 otherwise.

Remark B.9. Recall zip(ℓ1, ℓ2) := [(ℓ1 1, ℓ2 1), (ℓ1 2, ℓ2 2), . . . , (ℓ1 n, ℓ2 n)] (L1 L2)n for any vectors ℓ1 Ln1 1 and ℓ2 Ln2 2 with n := min{n1, n2}. Here we extend the deﬁnition to lists ℓ1, ℓ2 (L1 L2)n such that either (ℓ1 i, ℓ2 i) or (ℓ2 i, ℓ1 i) is in L1 L2 for all i = 1, . . . , n. Then, we write zip(ℓ1, ℓ2) for the list of pairs in L1 L2.

Proposition B.10. Assuming H2, T1((x0, v0), Svalid) = 1 for all (x0, v0) Svalid.

Proof. Since Lines 2-12 in e NPi MCMC can be described by a closed SPCF term that does not contain score( ) and terminates almost surely. By Prop. B.8, its transition kernel is probabilistic. Moreover, as this term always return a valid state, we have T1((x0, v0), Svalid) = T1((x0, v0), S) = 1.

Proposition B.11. Assuming H1 and 2, the state distribution π is invariant against Lines 2-12 in e NPi MCMC .

Proof. We aim to show: R

S T1((x , v ), S) π(d(x , v )) = π(S) for any measurable set S ΣS. (Changes are highlighted for readability.)

T1((x , v ), S) π(d(x , v ))

= { T1((x , v ), S) = 0 for all (x , v ) Svalid } Z

Svalid T1( (x , v ) , S) π(d(x , v ))

= { Writing (x , v ) as (zip(t1, t2) ++ y, v1 ++ v2) Svalid m where t1 SuppιX(k0)(w), t2 T, y EιX(m) ιX(k0), v1 Y(k0), v2 EιY(m) ιY(k0), m, k0 N }

EιY(m) ιY(k0)

EιX(m) ιX(k0)

SuppιX(k0)(w)

T1((zip(t1, t2) ++ y, v1 ++ v2), S)

[(zip(t1, t2) ++ y, v1 ++ v2) Svalid m ] 1

Z w(t1) pdf K(k0)(zip(t1, t2), v1)

µT(dt1) µT(dt2) µEιX(m) ιX(k0)(dy) µY(k0)(dv1) µEιY(m) ιY(k0)(dv2)

= { Deﬁnition of T1 on (zip(t1, t2) ++ y, v1 ++ v2) Svalid where t1 SuppιX(k0)(w) }

Nonparametric Involutive Markov Chain Monte Carlo

EιY(m) ιY(k0)

EιX(m) ιX(k0)

SuppιX(k0)(w) X

EιY(n) ιY(k0)

EιX(n) ιX(k0)

[(zip(t1, t ) ++ y , v ++ u ) S Svalid n ] pdf K(k0)(zip(t1, t ), v )

µT(dt ) µY(k0)(dv ) µEιX(n) ιX(k0)(dy ) µEιY(n) ιY(k0)(du )

[(zip(t1, t2) ++ y, v1 ++ v2) Svalid m ] 1

Z w(t1) pdf K(k0)(zip(t1, t2), v1)

µT(dt1) µT(dt2) µEιX(m) ιX(k0)(dy) µY(k0)(dv1) µEιY(m) ιY(k0)(dv2)

= { Tonelli s theorem as all measurable functions are non-negative } P k0=1 P n=1 R

EιY(n) ιY(k0) R

EιX(n) ιX(k0) R

SuppιX(k0)(w) X

EιY(m) ιY(k0)

EιX(m) ιX(k0)

[(zip(t1, t2) ++ y, v1 ++ v2) Svalid m ] pdf K(k0)(zip(t1, t2), v1)

µT(dt2) µY(k0)(dv1) µEιX(m) ιX(k0)(dy) µEιY(m) ιY(k0)(dv2)

[(zip(t1, t ) ++ y , v ++ u ) S Svalid n ] 1

Z w(t1) pdf K(k0)(zip(t1, t ), v )

µT(dt1) µT(dt ) µEιX(n) ιX(k0)(dy ) µY(k0)(dv ) µEιY(n) ιY(k0)(du )

= { Deﬁnition of T1 on (zip(t1, t2) ++ y, v1 ++ v2) Svalid where t1 SuppιX(k0)(w) }

EιY(n) ιY(k0)

EιX(n) ιX(k0)

SuppιX(k0)(w)

T1((zip(t1, t2) ++ y, v1 ++ v2), Svalid)

[(zip(t1, t ) ++ y , v ++ u ) S Svalid n ] 1

Z w(t1) pdf K(k0)(zip(t1, t ), v )

µT(dt1) µT(dt ) µEιX(n) ιX(k0)(dy ) µY(k0)(dv ) µEιY(n) ιY(k0)(du )

= { By Prop. B.10, T1((zip(t1, t2) ++ y, v1 ++ v2), Svalid) = 1 }

EιY(n) ιY(k0)

EιX(n) ιX(k0)

SuppιX(k0)(w)

[(zip(t1, t ) ++ y , v ++ u ) S Svalid n ] 1

Z w(t1) pdf K(k0)(zip(t1, t ), v )

µT(dt1) µT(dt ) µEιX(n) ιX(k0)(dy ) µY(k0)(dv ) µEιY(n) ιY(k0)(du )

= { Writing (x , v ) S Svalid n as (zip(t1, t ) ++ y , v ++ u ) where t1 SuppιX(k0)(w), t T, y EιX(m) ιX(k0), v Y(k0), u EιY(m) ιY(k0), n, k0 N }

Accept/Reject Proposed State After constructing a valid state (x0,v0) , say of dimension n , e NPi MCMC traverses the state space via involution[n] to obtain a proposal state (x,v) (Line 13). By Prop. B.6, (x,v) must also be a

n -dimensional valid state. Say it has an instance t of dimension k in the support of w , then (Line 14-22) (x,v) is

Nonparametric Involutive Markov Chain Monte Carlo

accepted with probability

α(x0, v0) := min n 1, w(t) pdf K(k)(takek(x, v)) ϕX(n)(x) ϕY(n)(v) w(t0) pdf K(k0)(takek0(x0, v0)) ϕX(n)(x0) ϕY(n)(v0) |det Φ(n)(x0, v0) | o

= min n 1, ζ(x, v) ϕX(n)(x) ϕY(n)(v) ζ(x0, v0) ϕX(n)(x0) ϕY(n)(v0) |det Φ(n)(x0, v0) | o .

The transition kernel for Line 13-22 can be expressed as

T2(s, S) := α(s) [Φ(n)(s) S] + (1 α(s)) [s S]

if s Svalid n for some n N; and 0 otherwise.

To show that the state distribution π is invariant against T2, we consider a partition of the set of valid states. Let S(n) ij be the

set of n-dimensional valid states where i is the list of boolean values in all s S(n) ij and Φ(n) maps s to a (valid) state with boolean values given by the list j. Note that both lists i, j of booleans must be of length n := ιX(n) + ιY(n). Formally,

S(n) ij := {s Svalid n | s = zip(r, i) and Φ(n)(s) = s = zip(q, j) for some r, q R n}.

Then, the set Svalid of valid states can written as S{S(n) ij | i, j 2 n and n N}.

Proposition B.12. Assuming H1 to 3, for n N, s Svalid n and s = Φ(n)(s), we have

α(s ) ζ (s ) |det Φ(n)(s) | = α(s) ζ (s)

where ζ (z, w) := ζ(z, w) ϕX(m)(z) ϕY(m)(w) for any (z, w) Sm.

Proof. Let s S(n) ij where there are r, q R n, i, j 2 n such that s = zip(r, i) and s := Φ(n)(s) = zip(q, j). Hence,

taking the Jacobian determinant on both sides of the equation Φ(n) ji Φ(n) ij = id gives us

|det Φ(n)(s ) | = |det Φ(n) ji (q) | = 1

|det Φ(n) ij (r) | = 1 |det ( Φ(n)(s))|. (2)

Moreover we can write the acceptance ratio in terms of ζ as

α(s ) = min{1, ζ (Φ(n)(s ))

ζ (s ) |det Φ(n)(s ) |} for any s Sm.

Hence given s = Φ(n)(s), we have

α(s ) ζ (s ) |det Φ(n)(s) |

ζ (s) ζ (s ) |det Φ(n)(s ) | ζ (s ) |det ( Φ(n)(s))| if ζ (s)

ζ (s ) |det Φ(n)(s ) | < 1

ζ (s ) |det Φ(n)(s) | otherwise (s = Φ(n)(s ))

ζ (s) if ζ (s )

ζ (s) |det ( Φ(n)(s))| > 1

ζ (s) |det ( Φ(n)(s))| ζ (s) otherwise (By Equation (2))

= α(s) ζ (s)

Proposition B.13. Assuming H1 to 3, the state distribution π is invariant against Line 13-22 in e NPi MCMC .

Nonparametric Involutive Markov Chain Monte Carlo

Proof. We aim to show: R

S T2(s, S) π(ds) = π(S) for all S ΣS.

Let s be a n-dimensional valid state and S ΣS. Then we can write T2(s, S) as [s S] + [Φ(n)(s) S] α(s) [s S] α(s). Hence, it is enough to show that the integral of the second and third terms over all valid states are the same, i.e. Z

Svalid [Φ(n)(s) S] α(s) π(ds) = Z

Svalid [s S] α(s) π(ds)

First we consider the valid states in S(n) ij where n N, i, j 2 n and n := ιX(n) + ιY(n). These are n-dimensional valid states with boolean values given by i and are mapped by Φ(n) to valid states with boolean values given by j. Then we have zip( , j) 1(S(n) ji ) = Φ(n) ij zip( , i) 1(S(n) ij ) where zip( , j) : R n E n is a measurable function. Writing ζ (z, w) for ζ(z, w) ϕX(m)(z) ϕY(m)(w) for any (z, w) Sm, we have Z

S(n) ji [s S] α(s) π(ds)

S(n) ji [s S] α(s) ζ (s) µE n(ds) (Deﬁnition of π)

zip( ,j) 1(S(n) ji ) [zip(r, j) S] α(zip(r, j)) ζ (zip(r, j)) µR n(dr) (zip( , j) µR n = µE n on S(n) ji )

zip( ,i) 1(S(n) ij ) [zip(Φ(n) ij (q), j) S] α(zip(Φ(n) ij (q), j)) ζ (zip(Φ(n) ij (q), j)) |det Φ(n) ij (q)| µR n(dq)

(Change of variable where r = Φ(n) ij (q))

zip( ,i) 1(S(n) ij ) [Φ(n)(zip(q, i)) S] α(zip(q, i)) ζ (zip(q, i)) µR n(dq)

(Prop. B.12 as Φ(n)(zip(q, i)) = zip(Φ(n) ij (q), j) for (zip(q, i)) S(n) ij )

S(n) ij [Φ(n)(s) S] α(s) ζ (s) µE n(ds) (zip( , i) µR n = µE n on S(n) ij )

S(n) ij [Φ(n)(s) S] α(s) π(ds)

Recall the set Svalid of all valid states can be written as S{S(n) ij | i, j 2 n and n N}. Hence, we conclude our proof with

Svalid [Φ(n)(s) S] α(s) π(ds) =

S(n) ij [Φ(n)(s) S] α(s) π(ds)

S(n) ji [s S] α(s) π(ds) = Z

Svalid [s S] α(s) π(ds).

Since the transition kernel of e NPi MCMC is the composition of T1 and T2 and both T1 and T2 are invariant against π (Propositions B.11 and B.13), we deduce that e NPi MCMC preserves the state distribution π.

Lemma B.14 (State Invariant). π is the invariant distribution of the Markov chain generated by iterating e NPi MCMC .

B.4.3. MARGINALISED MARKOV CHAINS

As discussed above, the Markov chain {si}i N generated by iterating e NPi MCMC (which has invariant distribution π (Lem. B.14)) has elements on the state space S and not the trace space T. The chain we are in fact interested in is the marginalised chain {m(si)}i N where the measurable function m : Svalid T takes a valid state s = (x, v) and returns the instance of the parameter variable x that is in the support of the target density function w.

Nonparametric Involutive Markov Chain Monte Carlo

In this section we show that this marginalised chain simulates the target distribution ν. Let T NPi MCMC : T T be a kernel

T NPi MCMC (t, A) :=

( T e NPi MCMC (s, m 1(A)) if t Supp(w) and s m 1({t})

0 otherwise.

Comparing the commands of NPi MCMC and e NPi MCMC in Listings 2 and 3, we claim that T NPi MCMC is the transition

kernel of NPi MCMC .

Proposition B.15. We consider some basic properties of T NPi MCMC .

1. T NPi MCMC is well-deﬁned.

2. T e NPi MCMC (s, m 1(A)) = T NPi MCMC (m(s), A) for all s Svalid and A ΣT.

Proof. 1. Let t Supp(w) and A ΣT. Say s, s m 1({t}). Since only the instance of the input state matters in e NPi MCMC (Listing 3), the value of T NPi MCMC (t, A) given by s and s are the same,

i.e. T e NPi MCMC (s, m 1(A)) = T e NPi MCMC (s , m 1(A)).

2. Let s Svalid and A ΣT. Then, T NPi MCMC (m(s), A) = T e NPi MCMC (s , m 1(A)) for some s m 1({m(s)}).

Since s m 1({m(s)}), we have T e NPi MCMC (s , m 1(A)) = T e NPi MCMC (s, m 1(A)).

To show T NPi MCMC preserves the target distribution, we consider a distribution πn on each of the n-dimensional state

space S(n) := X(n) Y(n) with density ζn (w.r.t. µS(n)) given by

ζn(x, v) :=

1 Zn w(t) pdf K(k)(takek(x, v)) if t instance(x) Supp(w) has dimension k n

0 otherwise

where Zn := R

T [|t| ιX(n)] w(t) µT(dt). Notice that Zn ζn and Z ζ are the same, except on non-valid states. The following proposition shows how the state distribution π can be represented using πn.

Proposition B.16. Let n N.

1. πn is a probability measure.

2. For k n, Zk πk = Zn takek πn on Svalid k .

3. Let g(n) : S(n) Sn k=1 Svalid k be the partial measurable function that returns the projection of the input state that is valid, if it exists. Formally, g(n)(s) = takek(s) if takek(s) Svalid k . Then Z π = Zn g(n) πn on Sn k=1 Svalid k .

Proof. 1. Consider πn(S(n)),

S(n) [t instance(x)] [|t| = ιX(k)] 1

Zn w(t) pdf K(k)(takek(x, v)) µS(n)(d(x, v))

SuppιX(k)(w)

Y(k) 1 Zn w(t) pdf K(k)(zip(t, t ), v )) µY(k)(dv ) µT(dt ) µT(dt)

SuppιX(k)(w)

1 Zn w(t) µT(dt ) µT(dt) (K(k) is a probability kernel)

T [|t| ιX(n)] 1

Zn w(t) µT(dt) = 1

Nonparametric Involutive Markov Chain Monte Carlo

2. Let S ΣS where S Svalid k . Hence Zk ρk(s) = Zk ρk(s ) if s S and s = takek(s ). Then,

Zn (takek πn)(S)

= Zn πn(takek 1(S))

S(n) [takek(s ) S] Zn ζn(s ) µS(n)(d(s ))

S(k) [(s) S] Zk ζk(s) µS(k)(d(s))

3. Let S ΣS where S Sn k=1 Svalid k . Then, Z ρ(s) = Zk ρk(s) for all s S Svalid k .

S [s Svalid] Z ζ(s) µS(ds)

S [s Svalid k ] Z ζ(s) µS(k)(ds)

S [s Svalid k ] Zk ζk(s) µS(k)(ds)

k=1 Zk πk(S Svalid k )

k=1 takek πn(S Svalid k ) (i)

k=1 {s S(n) | takek(s) S Svalid k })

= Zn g(n) πn(S).

Lemma B.17 (Invariant). Assuming H1 to 3, ν is the invariant distribution of the Markov chain generated by iterating the Hyrbid NP-i MCMC algorithm (App. B.3).

Proof. Assuming (1) ν = m π on T and (2) µT = m µS on Supp(w), we have for any A ΣT,

ν(A) = m π(A) (1)

S T e NPi MCMC (s, m 1(A)) µS(ds) (Lem. B.14)

Svalid T e NPi MCMC (s, m 1(A)) µS(ds)

Svalid T NPi MCMC (m(s), A) µS(ds) (Prop. B.15.ii)

Supp(w) T NPi MCMC (t, A) m µS(dt)

Supp(w) T NPi MCMC (t, A) µT(dt) (2)

T T NPi MCMC (t, A) µT(dt).

It is enough to show (1) and (2).

Nonparametric Involutive Markov Chain Monte Carlo

1. Let A ΣT where A SuppιX(n)(w) and δ > 0. Then partitioning m 1(A) using Svalid k , we have for sufﬁciently large m,

k=1 m 1(A) Svalid k

k=m+1 m 1(A) Svalid k

k=1 m 1(A) Svalid k

+ δ (Prop. B.15.iii, Prop. B.16.ii)

Z πm({(zip(t, t ) ++ y, v) | t A, t T, y EιX(m) ιX(n), v Y(m)}) + δ

EιX(m) ιX(n) w(t)

Y(m) pdf K(n)(taken(zip(t, t ) ++ y, v)) µY(m)(dv)

µEιX(m) ιX(n)(dy)µT(dt )µT(dt) + δ

A w(t) µT(dt) + δ (K(n) is a probability kernel)

= ν(A) + δ.

For any measurable set A ΣT, we have m π(A) = m π(A Supp(w)) = P n=1 m π(A SuppιX(n)(w)) P n=1 ν(A SuppιX(n)(w)) = ν(A Supp(w)) = ν(A). Since both ν and π are probability distributions, we also have ν(A) = 1 ν(T \ A) 1 m π(T \ A) = 1 (1 m π(A)) = m π(A). Hence m π = ν on T.

2. Similarly, let A ΣT where A SuppιX(n)(w) and δ > 0. Then by Prop. B.15.iii, for sufﬁciently large m, we must have µS(S k=m+1 Svalid k ) = µS(Svalid \ Svalid m) < δ. Hence,

k=1 m 1(A) Svalid k

k=m+1 m 1(A) Svalid k

k=1 µS(k)(m 1(A) Svalid k ) + δ

k=1 µS(m)({(x, v) S(m) | takek(x, v) m 1(A) Svalid k }) + δ

k=1 {(x, v) S(m) | takek(x, v) m 1(A) Svalid k }) + δ

µS(m)({(zip(t, t ) ++ y, v) | t A, t T, y EιX(m) ιX(n), v Y(m)}) + δ

= µT(A) + δ.

Then the proof proceeds as in (1). Note that since w almost surely terminating (H 2), m µS(Supp(w)) = µT(Supp(w)) = 1

B.4.4. CORRECTNESS OF NP-IMCMC

The correctness of the NP-i MCMC sampler (Fig. 2) can be deduce from Lem. B.17 and the fact that Hyrbid NP-i MCMC is a generalisation of NP-i MCMC Corollary B.18 (Invariant). If all inputs satisfy V1 to 3 then ν is the invariant distribution of the Markov chain generated by iterating the algorithm described in Fig. 2.

Nonparametric Involutive Markov Chain Monte Carlo

C. Transforming Nonparametric Involutive MCMC

In this section, we discuss how the techniques discussed in (Neklyudov et al., 2020) can be applied to the Hybrid NP-i MCMC sampler presented in App. B. Hence instances of the Hybrid NP-i MCMC sampler, such as NP-MH and NP-HMC, can be extended using these techniques to become more ﬂexible and efﬁcient.

We assume the input target density function w : T [0, ) is tree representable, integrable (H 1) and almost surely terminating (H2).

C.1. State-dependent Hybrid NP-i MCMC Mixture

Say we want to use multiple Hybrid NP-i MCMC samplers to simulate the posterior given by the target density function w. The following technique allows us to mix Hybrid NP-i MCMC samplers in such a way that the resulting sampler still preserves the posterior.

Given a collection of Hybrid NP-i MCMC samplers, indexed by m Eα, for some α N, each with auxiliary kernels {K(n) m : X(n) Y(n)}n N and involutions {Φ(n)m : X(n) Y(n) X(n) Y(n)}n N satisfying the projection commutation property (H3), the State-dependent Hybrid NP-i MCMC Mixture sampler determines which Hybrid NP-i MCMC sampler to use by drawing an indicator m Eα from a probability measure KM(x0, ) where KM : S

n N X(n) Em is a probability kernel and x0 is the entropy vector constructed from the current sample t0 at the initialisation step (Step 1 of Hybrid NP-i MCMC). Then, using the m-indexed Hybrid NP-i MCMC sampler, a proposal t is generated and accepted with a modiﬁed probability that includes the probability of picking m, namely

min 1; w(t) pdf K(k) m(takek(x, v)) ϕX(n)(x) ϕY(n)(v) w(t0) pdf K(k0)m(takek0(x0, v0)) ϕX(n)(x0) ϕY(n)(v0) pdf KM(x01..k0, m)

pdf KM(x1..k, m) |det( Φ(n) m(x0, v0))|

where (x0, v0) is the (possibly extended) initial state, (x, v) is the new state, n = dim (x0) = dim (v0), k0 is the dimension of t0 (i.e. |t0| = ιX(k0)) and k is the dimension of t (i.e. |t| = ιX(k)).

Pseudocode This sampler can be implemented in SPCF as the Mixture NPi MCMC function in Listing 4. (Terms speciﬁc to this technique are highlighted.) We assume the following SPCF terms exists: mixkernel of type List(X) -> (R*B) l implements the mixture kernel KM : S

n N X(n) Eα; pdfmixkernel of type List(X)*(R*B) l -> R implements the probability density function pdf KM : S

n N X(n) Eα R ; and for each m Eα and n N, auxkernel[n][m] im-

plements the auxiliary kernel K(n) m ; pdfauxkernel[n][m] and implements the pdf pdf K(n) m; involution[n][m] implements the involution Φ(n)m; and absdetjacinv[n][m] implements the absolute value of the Jacobian determinant of Φ(n)m.

Correctness Similar to the correctness arguments in (Neklyudov et al., 2020), we show that the State-dependent Hybrid NP-i MCMC Mixture sampler is correct by formulating Mixture NPi MCMC as an instance of NPi MCMC (Listing 2). This means specifying auxkernel[n] and involution[n] in NPi MCMC and arguing that the resulting NPi MCMC function is equivalent to Mixture NPi MCMC .

The SPCF terms mixauxkernel[n] and mixinvolution[n] given in Listing 5 should sufﬁce. The auxiliary space is expanded to embed the indicator m in such a way that the auxiliary variable mixv is in the space Eα Y(n) where its ﬁrst ℓ-th components mixv[:l] gives m and the rest mixv[l:] gives v . Since the auxiliary space is expanded to include the indicator, the maps mixindex X and mixindex Y and the projection mixproj are modiﬁed accordingly.

To see how the NPi MCMC function with auxkernel[n] replaced by mixauxkernel[n] and involution[n] replaced by mixinvolution[n] is equivalent to Mixture NPi MCMC , we onyl need to consider the probability density of mixauxkernel[k] at mixproj((x,mixv),k) .

pdfmixauxkernel[k](x[:mixindex X(k)], mixv[:mixindex Y(k)]) = pdfmixauxkernel[k](x[:index X(k)], mixv[:l+index Y(k)]) = pdfmixkernel(x[:index X(k)], mixv[:l]) * pdfauxkernel[k][mixv[:l]](x[:index X(k)],

mixv[l:l+index Y(k)])

Nonparametric Involutive Markov Chain Monte Carlo

Listing 4. Pseudocode of the State-dependent Hybrid NP-i MCMC Mixture algorithm

1 def Mixture NPi MCMC(t0): 2 k0 = dim(t0) # initialisation step 3 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 4 m = mixkernel(x0) # mixture step 5 v0 = auxkernel[k0][m](x0) # stochastic step 6 (x,v) = involution[k0][m](x0,v0) # deterministic step 7 n = k0 # extend step 8 while not intersect(instance(x),support(w)): 9 x0 = x0 + [(normal, coin)]*(index X(n+1)-index X(n)) 10 v0 = v0 + [(normal, coin)]*(index Y(n+1)-index Y(n)) 11 n = n + 1 12 (x,v) = involution[n][m](x0,v0) 13 t = intersect(instance(x),support(w))[0] # accept/reject step 14 k = dim(t) 15 return t if uniform < min{1, w(t)/w(t0) * 16 pdfauxkernel[k][m](proj((x,v),k))/ 17 pdfauxkernel[k0][m](proj((x0,v0),k0)) * 18 pdfpar[n](x)/pdfpar[n](x0) * 19 pdfaux[n](v)/pdfaux[n](v0) * 20 pdfmixkernel(proj(x,k),m)/ 21 pdfmixkernel(proj(x0,k0),m) * 22 absdetjacinv[n][m](x0,v0)} 23 else t0

Listing 5. Pseudocode for the correctness of the State-dependent Hybrid NP-i MCMC Mixture algorithm

1 def mixauxkernel[n](x0) 2 m = mixkernel(x0) 3 v0 = auxkernel[n][m](x0) 4 return m + v0 5 6 def mixinvolution[n](x0,mixv0) 7 m = mixv0[:l] 8 v0 = mixv0[l:] 9 (x,v) = involution[n][m](x0,v0) 10 return (x,m + v) 11 12 def mixindex X(n): return index X(n) 13 def mixindex Y(n): return l + index Y(n) 14 def mixproj((x,v),k): return (x[:mixindex X(k)],v[:mixindex Y(k)])

Nonparametric Involutive Markov Chain Monte Carlo

= pdfmixkernel(x[:index X(k)], m) * pdfauxkernel[k][m](x[:index X(k)], v[:index Y(k)]) = pdfmixkernel(proj(x,k),m) * pdfauxkernel[k][m](proj((x,v),k))

where m = mixv[:l] and v = mixv[l:] . This shows that the acceptance probability in NPi MCMC is identical to that in Mixture NPi MCMC and hence the two algorithms are equivalent.

C.2. Direction Hybrid NP-i MCMC Algorithm

Sometimes it is difﬁcult to specify involutions that explores the model fully. The following technique tells us that bijections are good enough.

Given endofunctions f (n) on X(n) Y(n) that are differentiable almost everywhere and bijective for each n N such that the sets {f (n)}n and {f (n) 1}n satisfy the projection commutative property (H3), the Direction Hybrid NP-i MCMC algorithm randomly use either f (n) or f (n) 1 to move around the state space and proposes a new sample.

Pseudocode This sampler can be expressed in SPCF as the Direction NPi MCMC function in Listing 6. (Terms speciﬁc to this technique are highlighted.) We assume for each n N and d 2, there is a SPCF term bijection[n][d] where

bijection[n][True] implements the bijection f (n) and bijection[n][False] implements the inverse f (n) 1 and SPCF term absdetjacbij[n][d] that implements the absolute value of the Jacobian determinant of f (n) if d = True

and the inverse f (n) 1 otherwise.

Correctness We show that Direction NPi MCMC can be formulated as an instance of NPi MCMC (Listing 2) with a speciﬁcation of auxkernel[n] and involution[n] .

The SPCF terms dirauxkernel[n] and dirinvolution[n] in Listing 7 would work. The auxiliary space is expanded to include the direction variable d0 so that the auxiliary variable dirv0 is in the space E Y(n) where the Booleancomponent dirv0[0][1] of its ﬁrst coordinate gives d0 and the second to last coordinates dirv0[1:] gives v0 . (Note the value of dirv0[0][0] is redundant and is only used to make dirv0[0] an entropy.) Since the auxiliary space is expanded, the maps dirindex X and dirindex Y and the projection dirproj are modiﬁed accordingly.

To see how the NPi MCMC function with auxkernel[n] replaced by dirauxkernel[n] and involution[n] replaced by dirinvolution[n] is equivalent to Direction NPi MCMC , we ﬁrst consider the density of

dirauxkernel[k0] at dirproj((x0,dirv0),k0) .

pdfdirauxkernel[k0](x0[:dirindex X(k0)], dirv0[:dirindex Y(k0)]) = pdfdirauxkernel[k0](x0[:index X(k0)], dirv0[:1+index Y(k0)]) = pdfcoin(dirv0[0][1]) * pdfnormal(dirv0[0][0]) * pdfauxkernel[k0](x0[:index X(k0)],

dirv0[1:1+index Y(k0)]) = 0.5 * pdfnormal(dirv0[0][0]) * pdfauxkernel[k0](proj((x0,v0),k0))

where v0 = dirv0[1:] . A similar argument can be made for pdfdirauxkernel[k](dirproj((x,dirv),k)) , which makes the acceptance probability in NPi MCMC identical to that in Direction NPi MCMC . Moreover, writing

d0 for dirv0[0][1] , the absolute value of the Jacobian determinant of dirinvolution[n] at (x0,dirv0) is

absdetjacbij[n][d0](x0,v0) . Most importantly, dirinvolution[n] is now involutive. Hence, NPi MCMC is the same as Direction NPi MCMC .

C.3. Persistent Hybrid NP-i MCMC Algorithm

It is known that irreversible transition kernels (those that does not satisfy detailed balance) have better mixing times, i.e. converges more quickly to the target distribution, compared to reversible ones. The following technique gives us a method to transform Hybrid NP-i MCMC algorithms to irreversible ones that still preserves the target distribution. The key is to compose the Hybrid NP-i MCMC sampler with a transition kernel so that the resulting algorithm does not satisfy detailed

Nonparametric Involutive Markov Chain Monte Carlo

Listing 6. Pseudocode of the Direction Hybrid NP-i MCMC algorithm

1 def Direction NPi MCMC(t0): 2 d0 = coin # direction step 3 k0 = dim(t0) # initialisation step 4 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 5 v0 = auxkernel[k0](x0) # stochastic step 6 (x,v) = bijection[k0][d0](x0,v0) # deterministic step 7 n = k0 # extend step 8 while not intersect(instance(x),support(w)): 9 x0 = x0 + [(normal, coin)]*(index X(n+1)-index X(n)) 10 v0 = v0 + [(normal, coin)]*(index Y(n+1)-index Y(n)) 11 n = n + 1 12 (x,v) = bijection[n][d0](x0,v0) 13 t = intersect(instance(x),support(w))[0] # accept/reject step 14 k = dim(t) 15 return t if uniform < min{1, w(t)/w(t0) * 16 pdfauxkernel[k](proj((x,v),k))/ 17 pdfauxkernel[k0](proj((x0,v0),k0)) * 18 pdfpar[n](x)/pdfpar[n](x0) * 19 pdfaux[n](v)/pdfaux[n](v0) * 20 absdetjacbij[n][d0](x0,v0)} 21 else t0

Listing 7. Pseudocode for dirauxkernel and dirinvolution

1 def dirauxkernel[n](x0) 2 d0 = coin 3 v0 = auxkernel[n](x0) 4 return [(normal, d0)] + v0 5 6 def dirinvolution[n](x0,dirv0) 7 d0 = dirv0[0][1] 8 v0 = dirv0[1:] 9 (x,v) = bijection[n][d0](x0,v0) 10 d = not d0 11 return (x, [(dirv0[0][0],d)] + v) 12 13 def dirindex X(n): return index X(n) 14 def dirindex Y(n): return 1+index Y(n) 15 def dirproj((x,v),k): return (x[:dirindex X(k)], v[:dirindex Y(k)])

Nonparametric Involutive Markov Chain Monte Carlo

The Persistent Hybrid NP-i MCMC algorithm is a MCMC algorithm similar to the Direction Hybrid NP-i MCMC sampler in which the direction variable is used to determine auxiliary kernels ({K(n) 1 : X(n) Y(n)}n or {K(n) 2 : X(n) Y(n)}n) and bijections ({f (n) : X(n) Y(n) X(n) Y(n)}n or {f (n) 1 : X(n) Y(n) X(n) Y(n)}n) being used. The difference is that Persistent Hybrid NP-i MCMC keeps track of the direction (instead of sampling a fresh one in each iteration) and ﬂips it strategically to make the resulting algorithm irreversible.

Pseudocode This sampler can be expressed in SPCF as Persistent NPi MCMC in Listing 8. (Terms speciﬁc to this technique are highlighted.) In addition to the SPCF terms in Direction NPi MCMC , we assume there is a SPCF term auxkernel[n][d] such that auxkernel[n][True] implements the auxiliary kernel K(n) 1 : X(n) Y(n) and pdfauxkernel[n][True] its pdf pdf K(n) 1 and similarly for auxkernel[n][False] and

pdfauxkernel[n][False] . Note that Persistent NPi MCMC updates samples on the space X(n) 2, which can easily be marginalised to X(n) by taking the ﬁrst ιX(n) components.

Correctness We show that Persistent NPi MCMC can be formulated as a composition of two instances of NPi MCMC (Listing 2).

Consider the NPi MCMC with auxiliary kernel perauxkernel[n] and involution perinvolution[n] in Listing 9. In this case, the parameter space is expanded to include the direction variable so that a parameter variable perx is on the space E X(n) where perx[0][1] gives d and perx[1:] gives x . Since the parameter space is expanded, the maps

perindex X and perindex Y and projection perproj are modiﬁed accordingly.

Again, we ﬁrst consider the density of perauxkernel[k0] at perproj((perx0,v0),k0) .

pdfperauxkernel[k0](perx0[:perindex X(k0)], v0[:perindex Y(k0)]) = pdfauxkernel[k0][perx[0][1]](perx0[1:1+index X(k0)], v0[:index Y(k0)]) = pdfauxkernel[k0][d0](x0[:index X(k0)], v0[:index Y(k0)]) = pdfauxkernel[k0][d0](proj((x0,v0),k0))

where d0 = perx0[0][1] and x0 = perx0[1:] . A similar argument can be made for

pdfperauxkernel[k](perproj((perx,v),k)) . Moreover, the absolute value of the Jacobian determinant of perinvolution[n] at (perx0,v0) is absdetjacbij[n][d0](x0,v0) . Hence, the acceptance probability in

NPi MCMC is identical to that in Persistent NPi MCMC .

The NPi MCMC function with auxkernel[n] replaced by perauxkernel[n] and involution[n] replaced by

perinvolution[n] is almost equivalent to Persistent NPi MCMC , except NPi MCMC induces a transition kernel on E X(n) whereas Persistent NPi MCMC induces a transition kernel on 2 X(n); and when the proposal t is accepted,

NPi MCMC returns d whereas Persistent NPi MCMC returns not d .

These differences can be reconciled by composing NPi MCMC with flipdir , which is an instance of NPi MCMC which skips the stochastic step and has an involution that ﬂips the direction variable stored in perx0[0][1] . The composition generates a Markov chain on E X(n) and marginalising it to a Markov chain on 2 X(n) gives us the same result as

Persistent NPi MCMC .

Nonparametric Involutive Markov Chain Monte Carlo

Listing 8. Pseudocode of the Persistent Hybrid NP-i MCMC algorithm

1 def Persistent NPi MCMC(t0,d0): 2 k0 = dim(t0) # initialisation step 3 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 4 v0 = auxkernel[k0][d0](x0) # stochastic step 5 (x,v) = bijection[k0][d0](x0,v0) # deterministic step 6 n = k0 # extend step 7 while not intersect(instance(x),support(w)): 8 x0 = x0 + [(normal, coin)]*(index X(n+1)-index X(n)) 9 v0 = v0 + [(normal, coin)]*(index Y(n+1)-index Y(n)) 10 n = n + 1 11 (x,v) = bijection[n][d0](x0,v0) 12 d = not d0 13 t = intersect(instance(x),support(w))[0] # accept/reject step 14 k = dim(t) 15 return (t, not d) if uniform < min{1, w(t)/w(t0) * 16 pdfauxkernel[k][d](proj((x,v),k))/ 17 pdfauxkernel[k0][d0](proj((x0,v0),k0)) * 18 pdfpar[n](x)/pdfpar[n](x0) * 19 pdfaux[n](v)/pdfaux[n](v0) * 20 absdetjacbij[n][d0](x0,v0)} 21 else (t0, d)

Listing 9. Pseudocode for perauxkernel and perinvolution

1 def perauxkernel[n](perx0) 2 d0 = perx0[0][1] 3 x0 = perx0[1:] 4 v0 = auxkernel[n][d0](x0) 5 return v0 6 7 def perinvolution[n](perx0,v0) 8 d0 = perx0[0][1] 9 x0 = perx0[1:] 10 (x,v) = bijection[n][d0](x0,v0) 11 d = not d0 12 return ([(perx0[0][0],d)] + x, v) 13 14 def perindex X(n): return 1+index X(n) 15 def perindex Y(n): return index Y(n) 16 def perproj((x,v),k): return (x[:perindex X(k)],v[:perindex Y(k)]) 17 18 def flipdir(perx0): 19 d0 = perx0[0][1] 20 perx0[0][1] = not d0 21 return perx0

Nonparametric Involutive Markov Chain Monte Carlo

D. Multiple Step Nonparametric Involutive MCMC

In this section, we study the Multiple Step NP-i MCMC sampler, a generalisation of the Hybrid NP-i MCMC sampler (App. B.3) (and also of NP-i MCMC (Fig. 2)), where the involution is applied multiple times to generate a proposed state.

D.1. Motivation

Step 4 in the Hyrbid NP-i MCMC sampler may seem inefﬁcient. While it terminates almost surely (thanks to H2), This is because whenever the dimension of the state is changed, the algorithm has to re-run the involution again (Step 4.ii). This means the expected number of iterations may be inﬁnite.

To remedy this problem, we introduce two new concepts:

The slice function which might make re-runs (Step 4.ii) quicker.

The Multiple Step NP-i MCMC sampler, a generaliation of Hyrbid NP-i MCMC, which uses a list of bijections to move around the state space.

D.2. Slice function

For each dimension n N, we call the measurable function s(n) : S(n) EιX(n) ιX(n 1) EιY(n) ιY(n 1) a slice of the endofunction Φ(n) on S(n) if it captures the movement of the n-th dimensional states with an instance of dimension lower than n. Formally, this means

s(n)(x, v) = (dropn 1 Φ(n))(x, v) if t instance(x) Supp(w) and |t| < ιX(n).

Note we can always deﬁne a slice of Φ(n) by setting s(n) := dropn 1 Φ(n).

With the slice function s(n) deﬁned for each involution Φ(n), Step 4.ii in the Hyrbid NP-i MCMC algorithm (App. B.3):

(Step 4.ii) Move around the n + 1-dimensional state space X(n+1) Y(n+1) and compute the new state by applying the involution Φ(n+1) to the initial state (x0 ++ y0, v0 ++ u0);

can be replaced by the following Step 4.ii :

(Step 4.ii ) Replace and extend the n-dimensional new state from (x, v) to a state (x ++ y, v ++ u) of dimension n + 1 where (y, u) is the result of s(n+1)(x0 ++ y0, v0 ++ u0).

By H3, the ﬁrst n components of the new n + 1-dimensional state Φ(n+1)(x0 ++ y0, v0 ++ u0) is

taken(Φ(n+1)(x0 ++ y0, v0 ++ u0)) = Φ(n)(taken(x0 ++ y0, v0 ++ u0)) = Φ(n)(x0, v0) = (x, v)

and by the deﬁnition of slice the (n + 1)-th component of the new state is

dropn(Φ(n+1)(x0 ++ y0, v0 ++ u0)) = s(n+1)(x0 ++ y0, v0 ++ u0).

Hence the new state computed by Step 4.ii and Step 4.ii are the same.

The slice function s(n) is useful when the involution is computationally expensive but has a light slice function. After Step 4.ii is replaced by Step 4.ii , the Hyrbid NP-i MCMC sampler need only to run the involution once (Step 3) and any subsequent re-runs (Step 4) can be performed by the slice function.

If the slice function s(n) is implemented as slice[n] in SPCF, Line 11 in NPi MCMC can be changed from

(x,v) = involution[n](x0,v0) to

(x ,v ) = slice[n](x0,v0); (x,v) = (x + x , v + v )

Nonparametric Involutive Markov Chain Monte Carlo

D.2.1. EXAMPLE (HMC)

Momentum update is the most computationally heavy component in the HMC sampler. Hence it would be useful if it has a lightweight slice function.

In the setting of Hyrbid NP-i MCMC, we assume the trace space T is a list measurable space of the Real measurable space R. Then, the n-dimensional momentum update φM k is an endofunction on Rn Rn deﬁned as

φM k (q, p) := (q, p k U(q))

where U(q) := log max{w(t) | t instance(q)}. is the n-dimensional potential energy.

Given a n-dimensional state (q, p) where t instance(q) Supp(w) has dimension lower than n, the gradient of the potential energy U at q w.r.t. the n-th coordinate is zero. Hence,

(dropn 1 φM k )(q, p) = dropn 1(q, p k U(q)) = (qn, pn),

and the slice of the momentum update φM k is simply the projection dropn 1(q, p) := (qn, pn).

However, not all 2L momentum updates in the re-runs of the leapfrog function L can be replaced by its slice dropn 1. This is because when the dimension increments to say n + 1, only the extended initial state (x0 ++ y0, v0 ++ u0) has the property that it has an instance with dimension lower than n + 1 and not the intermediate states.

D.3. Multiple Step NP-i MCMC

Say the involution of a Hybrid NP-i MCMC sampler is comprised of a list of bijective endofunctions on S(n), namely Φ(n) := f (n) L f (n) 2 f (n) 1 . To compute the new state, we can either

apply the involution Φ(n) to the initial state (x0, v0) in one go and check whether the result (x, v) has an instance in the support of w, or

for each ℓ= 1, . . . , L, apply the endofunction f (n) ℓ to (xℓ 1, vℓ 1) and (immediately) check whether the intermediate state (xℓ, vℓ) has an instance in the support of w.

The Hybrid NP-i MCMC sampler presented in App. B.3 takes the ﬁrst option as it is conceptually simpler. However, the second option is just as valid and more importantly give us the requirements needed to replace each endofunction by its slice in any subsequent re-runs .

D.3.1. THE MULTIPLE STEP NP-IMCMC ALGORITHM

Assume the target density w satisﬁes V1 and 2; and for each n N, there is a probability kernel K(n) and a list of L bijective endofunctions {f (n) ℓ : S(n) S(n) | ℓ= 1, . . . , L}n such that for each ℓ, {f (n) ℓ }n satisﬁes the projection commutation property (V3) and for each n N, their composition f (n) L f (n) 1 is involutive.

Let s(n) ℓ be a slice of the endofunction f (n) ℓ . Given a SPCF program M with weight function w on the trace space, the Multiple Step NP-i MCMC sampler generates a Markov chain as follows. Given a current sample t0 of dimension k0,

1. (Initialisation Step) Form a k0-dimensional parameter variable x0 X(k0) by pairing each value t0i in t0 with a randomly drawn value t of the other type to make a pair (t0i, t) or (t, t0i) in the entropy space E.

2. (Stochastic Step) Introduce randomness to the sampler by drawing a k0-dimensional value v0 Y(k0) from the probability measure K(k0)(x0, ).

3. (Multiple Step) Initialise ℓ= 1. If ℓ= L, proceed to Step 4 with t as the proposed sample; otherwise

3.1. (Deterministic Step) Compute the ℓ-th state (xℓ, vℓ) by applying the endofunction f (n) ℓ to (xℓ 1, vℓ 1) where n = dim (xℓ 1). 3.2. (Extend Step) Test whether any instance t of xℓis in the support of w. If so, go to Step 3 with an incremented ℓ; otherwise (none of the instances of xℓis in the support of w),

Nonparametric Involutive Markov Chain Monte Carlo

3.2.i Extend and replace the n-dimensional initial state from (x0, v0) to a state (x0 ++y0, v0 ++u0) of dimension n + 1 where y0 and u0 are values drawn randomly from µEιX(n+1) ιX(n) and µEιY(n+1) ιY(n) respectively. 3.2.ii For each i = 1, . . . , ℓ, extend and replace the n-dimensional i-th intermediate state from (xi, vi) to a state (xi ++ yi, vi ++ ui) of dimension n + 1 where (yi, ui) is the result of s(n+1) i (xi 1, vi 1). 3.2.iii Go to Step 3.2 with the extended n + 1-dimensional states (xi, vi) for i = 0, . . . , ℓ.

4. (Accept/reject Step) Accept the proposed sample t as the next sample with probability

min 1; w(t) pdf K(k)(takek(x L, v L)) ϕX(n)(x L) ϕY(n)(v L)

w(t0) pdf K(k0)(takek0(x0, v0)) ϕX(n)(x0) ϕY(n)(v0)

ℓ=1 |det( f (n) ℓ (xℓ 1, vℓ 1))|

where n = dim (x0) = dim (v0), k is the dimension of t and k0 is the dimension of t0; otherwise reject the proposal and repeat t0.

Unlike in Hybrid NP-i MCMC, the Multiple Step NP-i MCMC sampler computes the intermediate states {(xℓ, vℓ)}ℓ=1,...,L one-by-one, making sure in Step 3.2 that each of these state (xℓ, vℓ) has an instance in the support of w. Hence when the dimension is incremented from n to n + 1 we can use the slice functions to extend intermediate states to states of dimension n + 1. Remark D.1. The Multiple Step NP-i MCMC sampler can be seen as a generalisation of Hybrid NP-i MCMC (and hence a generalisation of NP-i MCMC (App. B.3.2)) as we can recover Hybrid NP-i MCMC by setting L to one and taking the involution Φ(n) as the only endofunction in Multiple Step NP-i MCMC.

D.3.2. PSEUDOCODE OF MULTIPLE STEP NP-IMCMC ALGORITHM

Listing 10 gives a SPCF implementation of Multiple Step NP-i MCMC as the function Multistep NPi MCMC with target density w ; auxiliary kernel auxkernel[n] and its density pdfauxkernel[n] and L number of endofunctions

f[n][l] ( l ranges from 1 to L ) for each dimension n with slice slice[n][l] and the absolute value of its Jacobian determinant absdetjacf[n][l] ; parameter and auxiliary index maps index X and index Y and projection

D.3.3. CORRECTNESS OF MULTIPLE STEP NP-IMCMC ALGORITHM

The Multiple Step NP-i MCMC sampler cannot be formulated as an instance of Hybrid NP-i MCMC and requires a separate proof. Nonetheless, the arguments are similar.

Prop. B.3 tells us that as long as w almost surely terminates (V2), the measure of a n-dimensional parameter variable not having any instances in the support of w tends to zero as the dimension n tends to inﬁnity. As f (n) ℓ is bijective (and hence invertible), the Multiple Step NP-i MCMC sampler almost surely satisﬁes the condition in the loop in Step 3.2 and hence almost surely terminates.

Next, we identify the state distribution of Multiple Step NP-i MCMC. We say a n-dimensional state (x, v) is valid if

(i) For all ℓ= 1, . . . , L, instance(xℓ) Supp(w) = where (x0, v0) := (x, v) and (xℓ, vℓ) := f (n) ℓ (xℓ 1, vℓ 1); and (ii) For all ℓ= 1, . . . , L, instance(yℓ) Supp(w) = where (y0, u0) := (x, v) and (y L ℓ+1, u L ℓ+1) :=

1(y L ℓ, u L ℓ); and (iii) For all k < n, takek(x, v) is not a valid state.

Then, we can deﬁne the state distribution and show that the state movement in Multiple Step NP-i MCMC is invariant against this distribution.

Finally, we conclude by Lem. B.17 that the Multiple Step NP-i MCMC sampler is correct.

D.3.4. TRANSFORMING MULTIPLE STEP NP-IMCMC SAMPLER

Recall we discussed three techniques in App. C, each when applied to the Hybrid NP-i MCMC sampler, improve its ﬂexibility and/or efﬁciency. We now see how these techniques can be applied to Multiple Step NP-i MCMC.

Nonparametric Involutive Markov Chain Monte Carlo

Listing 10. Code for Multiple Step NP-i MCMC

1 def Multistep NPi MCMC(t0): 2 k0 = dim(t0) # initialisation step 3 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 4 v0 = auxkernel[k0](x0) # stochastic step 5 6 # start of multiple step 7 n = k0 8 (x[0],v[0]) = (x0,v0) 9 for l in range(1,L+1): 10 (x[l],v[l]) = f[n][l](x[l-1],v[l-1]) # deterministic step 11 while not intersect(instance(x[l]),support(w)): # extend step 12 x[0] = x[0] + [(normal, coin)]*(index X(n+1)-index X(n)) 13 v[0] = v[0] + [(normal, coin)]*(index Y(n+1)-index Y(n)) 14 for i in range(1,l+1): 15 (y,u) = slice[n+1][i](x[i-1],v[i-1]) 16 (x[i],v[i]) = (x[i]+y, v[i]+u) 17 n = n + 1 18 (x0,v0) = (x[0],v[0]) 19 (x,v) = (x[L],v[L]) 20 # end of multiple step 21 22 t = intersect(instance(x),support(w))[0] # accept/reject step 23 k = dim(t) 24 return t if uniform < min{1,w(t)/w(t0) * 25 pdfauxkernel[k](proj((x,v),k))/ 26 pdfauxkernel[k0](proj((x0,v0),k0)) * 27 pdfpar[n](x)/pdfpar[n](x0) * 28 pdfaux[n](v)/pdfaux[n](v0) * 29 product([absdetjacf[n][l](x[l-1],v[l-1]) for l in range(1,L+1)])

} 30 else t0

D.3.5. STATE-DEPENDENT MULTIPLE STEP NP-IMCMC MIXTURE

This technique allows us to mix Multiple Step NP-i MCMC samplers in such a way that the resulting sampler still preserves the posterior. Given a collection of Multiple Step NP-i MCMC samplers, indexed by m Eα for some α N, the State-dependent Multiple Step NP-i MCMC Mixture sampler draws an indicator m Eα from a probability measure KM(x0, ) on Eα where KM : S

n N X(n) Eα is a probability kernel and x0 is the parameter variable constructed from the current sample t0 in Step 1. A proposal t is then generated by running Steps 2 and 3 of the m-indexed Multiple Step NP-i MCMC sampler, and is accepted with a modiﬁed probability that includes the probability of picking m.

Pseudocode Listing 11 gives the SPCF implementation of this sampler as the Mixture MSNPi MCMC function. (Terms speciﬁc to this technique are highlighted.) We assume the SPCF term mixkernel implements the mixture kernel KM; pdfmixkernel implements the probability density function pdf KM; and for each m Eα and n N,

auxkernel[n][m] implements the auxiliary kernel and pdfauxkernel[n][m] implements its density; f[n][l][m] implements the endofunction slice[n][l][m] implements its slice and absdetjacf[n][l][m] implements the absolute value of the Jacobian determinant of the endofunction of the m -indexed Multiple Step NP-i MCMC sampler.

Correctness Mixture MSNPi MCMC can be formulated as an instance of Multistep NPi MCMC with auxiliary kernel

mixauxkernel[n] and its density mixpdfauxkernel[n] and L number of endofunctions mixf[n][l] ( l ranges from 1 to L ) for each dimension n with slice mixslice[n][l] and the absolute value of its Jacobian determinant

absdetjacmixf[n][l] ; parameter and auxiliary index maps mixindex X and mixindex Y and projection mixproj given in Listing 12.

Nonparametric Involutive Markov Chain Monte Carlo

Listing 11. Pseudocode of the State-dependent Multiple Step NP-i MCMC Mixture algorithm

1 def Mixture MSNPi MCMC(t0): 2 k0 = dim(t0) # initialisation step 3 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 4 m = mixkernel(x0) # mixture step 5 v0 = auxkernel[k0][m](x0) # stochastic step 6 7 # start of multiple step 8 n = k0 9 (x[0],v[0]) = (x0,v0) 10 for l in range(1,L+1): 11 (x[l],v[l]) = f[n][l][m](x[l-1],v[l-1]) # deterministic step 12 while not intersect(instance(x[l]),support(w)): # extend step 13 x[0] = x[0] + [(normal, coin)]*(index X(n+1)-index X(n)) 14 v[0] = v[0] + [(normal, coin)]*(index Y(n+1)-index Y(n)) 15 for i in range(1,l+1): 16 (y,u) = slice[n+1][i][m](x[i-1],v[i-1]) 17 (x[i],v[i]) = (x[i]+y, v[i]+u) 18 n = n + 1 19 (x0,v0) = (x[0],v[0]) 20 (x,v) = (x[L],v[L]) 21 # end of multiple step 22 23 t = intersect(instance(x),support(w))[0] # accept/reject step 24 k = dim(t) 25 return t if uniform < min{1, w(t)/w(t0) * 26 pdfauxkernel[k][m](proj((x,v),k))/ 27 pdfauxkernel[k0][m](proj((x0,v0),k0)) * 28 pdfpar[n](x)/pdfpar[n](x0) * 29 pdfaux[n](v)/pdfaux[n](v0) * 30 pdfmixkernel(proj(x,k),m)/ 31 pdfmixkernel(proj(x0,k0),m) * 32 product([absdetjacf[n][m](x[l-1],v[l-1]) for l in range

(1,L+1)])} 33 else t0

Listing 12. Pseudocode for the correctness of State-dependent Multiple Step NP-i MCMC Mixture

1 def mixauxkernel[n](x0): 2 m = mixkernel(x0) 3 v0 = auxkernel[n][m](x0) 4 return m + v0 5 6 def mixf[n][l](x0,mixv0): 7 m = mixv0[:a] 8 v0 = mixv0[a:] 9 (x,v) = f[n][l][m](x0,v0) 10 return (x,m + v) 11 12 def mixslice[n][l](x0,mixv0): 13 m = mixv0[:a] 14 v0 = mixv0[a:] 15 (y,u) = slice[n][l][m](x0,v0) 16 return (y,u) 17 18 def mixindex X(n): return index X(n) 19 def mixindex Y(n): return a + index Y(n) 20 def mixproj((x,v),k): return (x[:mixindex X(k)],v[:mixindex Y(k)])

Nonparametric Involutive Markov Chain Monte Carlo

D.3.6. DIRECTION MULTIPLE STEP NP-IMCMC

This technique allows us to relax the assumption that the composition f (n) L f (n) 2 f (n) 1 is involutive. Assume for

ℓ= 1, . . . , L, both sets {f (n) ℓ }n and {f (n) ℓ

1}n satisfy the projection commutation property (V3), the Direction Multiple

Step NP-i MCMC sampler randomly employ either f (n) L f (n) 2 f (n) 1 or f (n) 1

1 to move around the n-dimensional state space and proposes a new sample.

Pseudocode Listing 13 gives the SPCF implementation of this sampler as Direction MSNPi MCMC function. (Terms speciﬁc to this technique are highlighted.) We assume for each n N and d 2, the SPCF term f[n][l][True] implements

the endofunction f (n) ℓ and f[n][l][False] implements the inverse f (n) L ℓ+1

1; slice[n][l][True] implements

the slice of f (n) ℓ and slice[n][l][False] implements the slice of f (n) L ℓ+1

1; and absdetjacf[n][l][True]

implements the absolute value of the Jacobian determinant of f (n) ℓ and absdetjacf[n][l][False] implements that of

f (n) L ℓ+1.

Correctness Direction MSNPi MCMC can be formulated as an instance of Multistep NPi MCMC with auxiliary kernel

dirauxkernel[n] and its density pdfdirauxkernel[n] and dir L number of endofunctions dirf[n][l] ( l ranges from 1 to dir L ) for each dimension n with slice dirslice[n][l] and the absolute value of its Jacobian determinant absdetjacf[n][l] ; parameter and auxiliary index maps dirindex X and dirindex Y and projection

dirproj given in Listing 14. Note the dirf[n] function denotes the composition that ﬂips the direction after applying

the endofunctions f (n) ℓ for ℓ= 1, . . . , L with an inverse the ﬂips the direction and then apply the endofunctions f (n) L ℓ+1 for ℓ= 1, . . . , L.

D.3.7. PERSISTENT MULTIPLE STEP NP-IMCMC ALGORITHM

This technique gives us a method to construct irreversible Multiple Step NP-i MCMC samplers. The key is to persist the direction from a previous iteration.

The Persistent Multiple Step NP-i MCMC sampler keeps trace of a direction variable d0 2 (instead of sampling a fresh one at the start) and use it to determine the auxiliary kernel (K(n) T : X(n) Y(n) or K(n) F : X(n) Y(n)) and list of

endofunctions (f (n) L f (n) 1 or f (n) 1

1 ) employed. This direction variable is ﬂipped strategically to make the resulting algorithm irreversible.

Pseudocode Listing 15 gives the SPCF implementation of this sampler as the function Persistent MSNPi MCMC . (Terms speciﬁc to this technique are highlighted.) In addition to the SPCF terms in Direction MSNPi MCMC , the SPCF term auxkernel[n][True] implements the auxiliary kernel K(n) T and pdfauxkernel[n][True] implements its

density pdf K(n) T and auxkernel[n][False] implements the auxiliary kernel K(n) F and pdfauxkernel[n][False] implements its density pdf K(n) F. Note that Persistent MSNPi MCMC updates samples on the space X(n) 2, which can easily be marginalised to X(n) by taking the ﬁrst ιX(n) components.

Correctness Consider the Multistep NPi MCMC function with auxiliary kernel perauxkernel[n] and its density

pdfperauxkernel[n] and per L number of endofunctions perf[n][l] ( l ranges from 1 to per L ) for each dimension n with slice perslice[n][l] and the absolute value of its Jacobian determinant absdetjacperf[n][l] ; parameter and auxiliary index maps perindex X and perindex Y and projection perproj given in Listing 16.

The Multistep NPi MCMC function with the primitives indicated in Listing 16 is almost equivalent to

Persistent MSNPi MCMC , except Multistep NPi MCMC induces a transition kernel on E X(n) whereas

Persistent MSNPi MCMC induces a transition kernel on 2 X(n); and when the proposal t is accepted, Multistep NPi MCMC returns d whereas Persistent MSNPi MCMC returns not d .

By composing Multistep NPi MCMC with flipdir which ﬂips the direction and marginalising the Markov chain

Nonparametric Involutive Markov Chain Monte Carlo

Listing 13. Pseudocode of the Direction Multiple Step NP-i MCMC algorithm

1 def Direction MSNPi MCMC(t0): 2 d0 = coin # direction step 3 k0 = dim(t0) # initialisation step 4 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 5 v0 = auxkernel[k0](x0) # stochastic step 6 n = k0 # multiple step 7 (x[0],v[0]) = (x0,v0) 8 for l in range(1,L+1): 9 (x[l],v[l]) = f[n][l][d0](x[l-1],v[l-1]) # deterministic step 10 while not intersect(instance(x[l]),support(w)): # extend step 11 x[0] = x[0] + [(normal, coin)]*(index X(n+1)-index X(n)) 12 v[0] = v[0] + [(normal, coin)]*(index Y(n+1)-index Y(n)) 13 for i in range(1,l+1): 14 (y,u) = slice[n+1][i][d0](x[i-1],v[i-1]) 15 (x[i],v[i]) = (x[i]+y, v[i]+u) 16 n = n + 1 17 (x0,v0) = (x[0],v[0]) 18 (x,v) = (x[L],v[L]) 19 d = not d0 # flip direction (not used) 20 t = intersect(instance(x),support(w))[0] # accept/reject step 21 k = dim(t) 22 return t if uniform < min{1, w(t)/w(t0) * 23 pdfauxkernel[k](proj((x,v),k))/ 24 pdfauxkernel[k0](proj((x0,v0),k0)) * 25 pdfpar[n](x)/pdfpar[n](x0) * 26 pdfaux[n](v)/pdfaux[n](v0) * 27 product([absdetjacf[n][l][d0](x[l-1],v[l-1]) for l in

range(1,L+1)])} 28 else t0

Listing 14. Pseudocode for the correctness of Direction Multiple Step NP-i MCMC

1 def dirauxkernel[n](x0): 2 d0 = coin 3 v0 = auxkernel[n](x0) 4 return [(normal, d0)] + v0 5 6 def dirf[n][l](x,dirv): 7 d = dirv[0][1] 8 v = dirv[1:] 9 if l == dir L: return (x,[(dirv[0][0],not d)] + v) 10 else: 11 (x,v) = f[n][l][d](x,v) 12 return (x, [(dirv[0][0],d)] + v) 13 14 def dirslice[n][l](x,dirv): 15 d = dirv[0][1] 16 v = dirv[1:] 17 return slice[n][l][d](x,v) 18 19 dir L = L+1 20 def dirindex X(n): return index X(n) 21 def dirindex Y(n): return 1+index Y(n) 22 def dirproj((x,v),k): return (x[:dirindex X(k)], v[:dirindex Y(k)])

Nonparametric Involutive Markov Chain Monte Carlo

Listing 15. Pseudocode of the Persistent Multiple Step NP-i MCMC algorithm

1 def Persistent MSNPi MCMC(t0,d0): 2 k0 = dim(t0) # initialisation step 3 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 4 v0 = auxkernel[k0][d0](x0) # stochastic step 5 n = k0 # multiple step 6 (x[0],v[0]) = (x0,v0) 7 for l in range(1,L+1): 8 (x[l],v[l]) = f[n][l][d0](x[l-1],v[l-1]) # deterministic step 9 while not intersect(instance(x[l]),support(w)): # extend step 10 x[0] = x[0] + [(normal, coin)]*(index X(n+1)-index X(n)) 11 v[0] = v[0] + [(normal, coin)]*(index Y(n+1)-index Y(n)) 12 for i in range(1,l+1): 13 (y,u) = slice[n+1][i][d0](x[i-1],v[i-1]) 14 (x[i],v[i]) = (x[i]+y, v[i]+u) 15 n = n + 1 16 (x0,v0) = (x[0],v[0]) 17 (x,v) = (x[L],v[L]) 18 d = not d0 # flip direction 19 t = intersect(instance(x),support(w))[0] # accept/reject step 20 k = dim(t) 21 return (t, not d) if uniform < min{1, w(t)/w(t0) * 22 pdfauxkernel[k][d](proj((x,v),k))/ 23 pdfauxkernel[k0][d0](proj((x0,v0),k0)) * 24 pdfpar[n](x)/pdfpar[n](x0) * 25 pdfaux[n](v)/pdfaux[n](v0) * 26 product([absdetjacf[n][l][d0](x[l-1],v[l-1]) for l in range(1,L

+1)])} 27 else (t0, d)

Listing 16. Pseudocode for the correctness of Persistent Multiple Step NP-i MCMC

1 def perauxkernel[n](perx0) 2 d0 = perx0[0][1]; x0 = perx0[1:] 3 v0 = auxkernel[n][d0](x0) 4 return v0 5 6 def perf[n][l](perx,v) 7 d = perx[0][1]; x = perx[1:] 8 if l == per L: return ([(perx[0][0],not d)] + x, v) 9 else: 10 (x,v) = f[n][l][d](x,v) 11 return ([(perx[0][0],d)] + x, v) 12 13 def perslice[n][l](perx,v) 14 d = perx[0][1]; x = perx[1:] 15 return slice[n][l][d](x,v) 16 17 per L = L+1 18 def perindex X(n): return 1+index X(n) 19 def perindex Y(n): return index Y(n) 20 def perproj((x,v),k): return (x[:perindex X(k)],v[:perindex Y(k)]) 21 22 def flipdir(perx0,v0): 23 perx0[0][1] = not perx0[0][1] 24 return (perx0,v0)

Nonparametric Involutive Markov Chain Monte Carlo

generated by the composition from E X(n) to 2 X(n), we get Persistent MSNPi MCMC .

Nonparametric Involutive Markov Chain Monte Carlo

E. Examples of Nonparametric Involutive MCMC

In this section, we design novel nonparametric samplers using the Hybrid NP-i MCMC method described in App. B or the Multiple Step NP-i MCMC method described in App. D.

We assume the target density function w on the trace space T is tree representable and satisﬁes H1 and 2. Speciﬁcations of the auxiliary kernels and involutions are given for each sampler.

E.1. Nonparametric Metropolis-Hastings

As discussed in Sec. 2.1, the standard MH sampler can be seen as an instance of the i MCMC sampler with the proposal distribution q as the auxiliary kernel and a swap function as the involution.

Suppose a proposal kernel q(n) : En En exists for each dimension n N. Setting both ιX and ιY to be identities (which means X(n) = Y(n) = En for all n N), the Hybrid NP-i MCMC method (App. B.3) gives an nonparametric extension of the MH sampler.

Listing 17. Pseudocode of the NP-MH algorithm

1 def NPMH(t0): 2 k0 = dim(t0) # initialisation step 3 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 4 v0 = q[k0](x0) # stochastic step 5 (x,v) = (v0,x0) # deterministic step 6 while not intersect(instance(x),support(w)): # extend step 7 x0 = x0 + [(normal, coin)] 8 v0 = v0 + [(normal, coin)] 9 (x,v) = (v0,x0) 10 t = intersect(instance(x),support(w))[0] # accept/reject step 11 k = dim(t) 12 return t if uniform < min{1, w(t)/w(t0) * pdfq[k](proj((x,v),k))/pdfq[k0](proj((x0,v0)

,k0))} 13 else t0

The NPMH function in Listing 17 is a SPCF implementation of this sampler. It can seen as an instance of the NPi MCMC function with auxkernel[n] replaced by the proposal distribution q[n] , pdfauxkernel[n] replaced by the pdf of the proposal distribution pdfq[n] , involution[n] replaced by a swap function, and index X and index Y replaced by identities, alongside a simpliﬁed acceptance ratio as (x, v) = (v0, x0), ϕX(n) = ϕY(n), and

ϕX(n)(x) ϕX(n)(x0) ϕY(n)(v)

ϕY(n)(v0) |det Φ(n)(x0, v0)| = 1.

E.2. Nonparametric Metropolis-Hastings with Persistence

Following the persistent technique described in App. C.3 for Hybrid NP-i MCMC, we can form a nonreversible variant of the NP-MH sampler described in App. E.1. We call the resulting algorithm the Nonparametric Metropolis-Hastings with Persistence (NP-MH-P) sampler.

Suppose a proposal kernel q(n) : En En exists for each dimension n N. Similar to NP-MH, both ιX and ιY are set to be identities (which means X(n) = Y(n) = En for all n N). Following (Turitsyn et al., 2011), given a parameter variable x En, we can partition the auxiliary space En into two sets Ux,+ := {v En | η(v) η(x)} and Ux, := {v En | η(v) < x(x)} where η : En R is some measurable function; and form two kernels K(n)+ and K(n) from q(n) deﬁned as

K(n) +(x, V ) := q(n)(V Ux,+)

q(n)(Ux,+) and K(n) (x, V ) := q(n)(V Ux, )

q(n)(Ux, ) . (3)

Using the Persistent (Hybrid) NP-i MCMC sampler as described in App. C.3, a nonreversible variant of NP-MH can be formed.

Nonparametric Involutive Markov Chain Monte Carlo

Listing 18. Pseudocode of the NP-MH with Persistence algorithm

1 def NPMHw P(t0,d0): 2 k0 = dim(t0) # initialisation step 3 x0 = [(e, coin) if Type(e) in R else (normal, e) for e in t0] 4 v0 = auxkernel[k0][d0](x0) # stochastic step 5 (x,v) = (v0,x0) # deterministic step 6 n = k0 # extend step 7 while not intersect(instance(x),support(w)): 8 x0 = x0 + [(normal, coin)] 9 v0 = v0 + [(normal, coin)] 10 n = n + 1 11 (x,v) = (v0,x0) 12 d = not d0 13 t = intersect(instance(x),support(w))[0] # accept/reject step 14 k = dim(t) 15 return (t, not d) if uniform < min{1, w(t)/w(t0) * 16 pdfauxkernel[k][d](proj((x,v),k))/ 17 pdfauxkernel[k0][d0](proj((x0,v0),k0))} 18 else (t0, d)

The Lifted NPMH function in Listing 18 is a SPCF implementation of this sampler. It can seen as an instance of the Persistent NPi MCMC function (Listing 8) with auxkernel[n][True] implementing K(n)+ and

auxkernel[n][False] implementing K(n) .

See how the direction d0 (and hence the family of auxiliary kernels) is persisted if the proposal t is accepted.

E.3. Nonparametric Hamiltonian Monte Carlo

The Nonparametric Hamiltonian Monte Carlo (NP-HMC) is a MCMC sampler introduced by (Mak et al., 2021b) for probabilistic programming. Here we show that it is an instance of the Direction Multiple Step NP-i MCMC sampler (App. D.3.6).

Typically, the Hamiltonian Monte Carlo (HMC) sampler takes a target density on Rn and proposes a new state by simulating L leapfrog steps: L := (φM ϵ/2 φP ϵ φM ϵ/2)L

where φM ϵ (x, v) := (x, v ϵ U(x)) and φP ϵ (x, v) := (x + ϵv, v) are the momentum and position updates with step size ϵ respectively. Notice that that the momentum and position updates satisfy projection commutation property (H3), have inverses (φM ϵ ) 1 = M φM ϵ M and (φP ϵ ) 1 = M φP ϵ M where M(x, v) := (x, v) and slices dropn 1 (for φM ϵ/2, see App. D.2.1 for more details) and (x, v) 7 (xn + ϵvn, vn) (for φP ϵ ) respectively. Moreover, the absolute value of the Jacobian determinant of both updates are |det φM ϵ (x, v)| = |det φP ϵ/2(x, v)| = 1.

Given a target density w on S

n N Rn, the HMC sampler can be extended using the Direction Multiple Step NP-i MCMC sampler. Given an input sample t0 Rk0, a k0-dimensional initial state (x0, v0) is formed where x0 := t0 and v0 drawn from K(n)(x, ) := Nn. A direction variable d0 is drawn to determine whether the leapfrog steps L or its inverse L 1 is performed on the initial state (x0, v0), one update at a time, extending the dimension as required. Say the initial state is extended to a n-dimensional state (x0, v0) and is traversed to the n-dimensional new state (x , v ) which has an instance t in the support of w. t is returned with probability

min 1; w(t) ϕn(x ) ϕn(v )

w(t0) ϕn(x0) ϕn(v0)

Pseudocode of NP-HMC Listing 19 gives the SPCF implementations leapfrog[n][m] and

leapfrogslice[n][m] , where leapfrog[n][m][True] and leapfrogslice[n][m][True] return the

m -th endofunction and its slice in the composition (φM ϵ/2 φP ϵ φM ϵ/2)L of 3L updates respectively; and similarly,

leapfrog[n][m][False] and leapfrogslice[n][m][False] return the m -th endofunction and its slice in

Nonparametric Involutive Markov Chain Monte Carlo

Listing 19. Pseudocode of the leapfrog step and its slice in NP-HMC

1 def leapfrog[n][m][d0](x,v): 2 if d0: 3 if m % 3 == 0 or m % 3 == 2: 4 return (x, v-ep/2*grad(U)(x)) # half momentum update 5 else: return (x+ep*v, v) # full position update 6 else: 7 if m % 3 == 0 or m % 3 == 2: 8 return (x, v+ep/2*grad(U)(x)) # inverse of half momentum update 9 else: return (x-ep*v, v) # inverse of full position update 10 11 def leapfrogslice[n][m][d0](x,v): 12 if d0: 13 if m % 3 == 0 or m % 3 == 2: 14 return (x[-1], v[-1]) # slice of half momentum update 15 else: return (x[-1]+ep*v[-1], v[-1]) # slice of full position update 16 else: 17 if m % 3 == 0 or m % 3 == 2: 18 return (x[-1], v[-1]) # slice of inverse of half momentum 19 else: return (x[-1]-ep*v[-1], v[-1]) # slice of inverse of full position

Listing 20. Pseudocode for NP-HMC

1 def NPHMC(t0): 2 d0 = coin # direction step 3 k0 = dim(t0) # initialisation step 4 x0 = t0 5 v0 = [normal]*k0 # stochastic step 6 # start of multiple step 7 n = k0 8 (x[0],v[0]) = (x0,v0) 9 for m in range(1,3L+1): 10 (x[m],v[m]) = leapfrog[n][m][d0](x[m-1],v[m-1]) # deterministic step 11 while not intersect(instance(x[m]),support(w)): # extend step 12 x[0] = x[0] + [normal] 13 v[0] = v[0] + [normal] 14 for i in range(1,m+1): 15 (y,u) = leapfrogslice[n][m][d0](x[m-1],v[m-1]) 16 (x[i],v[i]) = (x[i]+y, v[i]+u) 17 n = n + 1 18 (x0,v0) = (x[0],v[0]) 19 (x,v) = (x[3L],v[3L]) 20 d = d0 21 # end of multiple step 22 t = intersect(instance(x),support(w))[0] # accept/reject step 23 k = dim(t) 24 return t if uniform < min{1,w(t)/w(t0) * pdfnormal[n](x)/pdfnormal[n](x0) * 25 pdfnormal[n](v)/pdfnormal[n](v0)} 26 else t0

Nonparametric Involutive Markov Chain Monte Carlo

Listing 21. Pseudocode for the NP-HMC with Persistence algorithm

1 NPHMCw Persistent((x0,v0),d0) = Persist Mom(Corrupt Mom((x0,v0),d0)) 2 3 def HMCw(x,v): return w(x) 4 5 def Corrupt Mom((x0,v0),d0): 6 u = [normal(v0[i]*sqrt(1-alpha 2), alpha 2) for i in range(len(v0))] 7 return ((x0,u),d0) 8 9 def Persist Mom((x0,v0),d0): 10 k0 = dim(x0) # initialisation step 11 # start of multiple step 12 n = k0 13 (x[0],v[0]) = (x0,v0) 14 for m in range(1,3L+1): 15 (x[m],v[m]) = leapfrog[n][m][d0](x[m-1],v[m-1]) # deterministic step 16 while not intersect(instance(x[m],v[m]),support(HMCw)): # extend step 17 x[0] = x[0] + [normal] 18 v[0] = v[0] + [normal] 19 for i in range(1,m+1): 20 (y,u) = leapfrogslice[n+1][i][d0](x[i-1],v[i-1]) 21 (x[i],v[i]) = (x[i]+y, v[i]+u) 22 n = n + 1 23 d = not d0 # flip direction 24 # end of multiple step 25 (x,v) = intersect(instance(x[3L],v[3L]),support(HMCw))[0] # accept/reject step 26 return ((x,v), not d) if uniform < min{1, HMCw(x,v)/HMCw(x0,v0) * 27 pdfnormal[n](x[3L])/pdfnormal[n](x[0]) * 28 pdfnormal[n](v[3L])/pdfnormal[n](v[0]) } 29 else ((x0,v0), d)

(φM ϵ/2 1 φP ϵ 1 φM ϵ/2 1)L.

Listing 20 gives the SPCF implementation NPHMC of the NP-HMC sampler as an instance of the Direction Multiple Step NP-i MCMC sampler. Importantly, the expensive leapfrog[n][m] function is called once for each m ranging from 1 to 3L the lightweight leapfrogslice is called in any subsequent re-runs.

Correctness Since both φM ϵ/2 and φP ϵ are bijective and satisﬁes the projection commutation property (H3), the correctness of NP-HMC is implied by the correctness of Direction Multiple Step NP-i MCMC sampler.

E.4. Nonparametric Hamiltonian Monte Carlo with Persistence

With the catalogue of techniques explored in App. D.3.4, novel irreversible variants of the NP-HMC algorithm can be formed. Here we focus on the NP-HMC with Persistence algorithm which can be seen as a nonparametric extension of the Generalised HMC algorithm (Horowitz, 1991).

E.4.1. GENERALISED HMC

Horowitz (1991) made two changes to the conventional HMC algorithm in order to generate an irreversible Markov chain on Rn Rn and improve its performance.

1. A corrupted momentum is used to move round the state space.

2. The direction is persisted if the proposal is accepted; otherwise it is negated.

The resulting sampler is called the Generalised HMC algorithm as it is a generalisation of the typical HMC sampler.

Nonparametric Involutive Markov Chain Monte Carlo

As shown in (Neklyudov et al., 2020), the Generalised HMC algorithm can be presented as a composition of an i MCMC algorithm that corrupts the momentum and a Persistent i MCMC algorithm that uses Hamiltonian dynamics to ﬁnd a new state with a persisting direction. We consider a similar approach in our construction of a nonparametric extension of Generalised HMC.

E.4.2. NP-HMC WITH PERSISTENCE

State Density Let the state (x, v) Rn Rn has density w (x, v) := w(x) w.r.t. the normal distribution N2n. It is clear that this density w is integrable (H1) and almost surely terminating (H2). By setting the parameter index map to ιX(n) := 2n and parameter space X(n) := Rn Rn, the state (x, v) of length 2n is a n-dimensional parameter variable.

HMCw in Listing 21 is a SPCF implementation of w .

Corrupt Momentum Given the current state (x0, v0) Rn Rn with direction d0 2, a new momentum is drawn from the distribution Nn(v0

1 α2, α2) for a hyper-parameter α [0, 1).

This can be presented in the NP-i MCMC format with the auxiliary variable u sampled from Nn(v0

1 α2, α2) and the swap (((x0, v0), d0), u) 7 (((x0, u), d0), v0) as the involution. Since the new state (x0, u) always have an instance in the support of w , and the acceptance ratio is

min n 1, w (x0, u) ϕ2n(x0, u) pdf2(d0) ϕn(v0 | u

1 α2, α2) w (x0, v0) ϕ2n(x0, v0) pdf2(d0) ϕn(u | v0

the extend step (Step 3) and the accept/reject step (Step 4) of the NP-i MCMC sampler can both be skipped. This results in a sampler that has the SPCF implementation Corrupt Mom in Listing 21.

Persist Momentum We consider the Persistent Multiple Step NP-i MCMC algorithm (App. D.3.7) with the target density w as follows.

Given a k0-dimensional parameter (x0, v0) X(n) := Rn Rn and direction d0 2, a dummy auxiliary variable u Y(n) := Rn is sampled from K(n)((x0, v0), ) := Nn to form an initial state ((x0, v0), u). Depending on the direction d0, either (φM ϵ/2 id Rn) (φP ϵ id Rn) (φM ϵ/2 id Rn) L or its inverse is performed on ((x0, v0), u), one update at a time, extending the dimension as required. Say the initial state is extended to a n-dimensional ((x 0, v 0), u ) and is traversed to the n-dimensional new state ((x , v ), u ). Then, the instance (x, v) Supp(w ) of the n-dimensional parameter (x , v ) is returned alongside the direction variable d0 with probability

min 1; w (x, v) ϕn(x ) ϕn(v ) w (x0, v0) ϕn(x 0) ϕn(v 0)

Note that the auxiliary variable u has no effect on the sampler. Hence, Listing 21 gives a SPCF implementation Persist Mom where the stochastic step (Step 2) is skipped.

NP-HMC with Persistence Composing the samplers which corrupts and persists the momentum gives us the NPHMC with Persistence algorithm, which is an nonparametric extension of Generalised HMC. Listing 21 gives the SPCF implementation NPHMCw Persistent by composing Corrupt Mom and Persist Mom .

E.5. Nonparametric Look Ahead Hamiltonian Monte Carlo

Last but not least, we extend the Look Ahead HMC algorithm (Sohl-Dickstein et al., 2014), which is equivalent to the Extra Chance Generalised HMC algorithm (Campos & Sanz-Serna, 2015).

E.5.1. LOOK AHEAD HMC

The Look Ahead HMC sampler modiﬁes the Generalised HMC algorithm by performing extra leapfrog steps when the proposal state is rejected. This has the effect of increasing the acceptance rate for each proposal.

To see Look Ahead HMC as an instance of Persistent i MCMC, we consider the involution Φ on Rn Rn [0, 1) 2 given

Nonparametric Involutive Markov Chain Monte Carlo

Listing 22. Pseudocode for the NP Look Ahead HMC algorithm

1 NPLook Ahead HMC((x0,v0),d0) = Extra Leapfrog(Corrupt Mom((x0,v0),d0))) 2 3 def Extra Leapfrog((x0,v0),d0): 4 k0 = dim(x0) # initialisation step 5 u = uniform # stochastic step 6 # start of multiple step 7 n = k0 8 m = 0 9 (x[m],v[m]) = (x0,v0) 10 11 stop = False 12 while not stop: 13 j = 1 14 M = j*3*L 15 # perform a set of leapfrog steps, i.e. to compute (x[i],v[i]) for i in range(m,M) 16 while m < M+1: 17 (x[m],v[m]) = leapfrog[n][m][d0](x[m-1],v[m-1]) # deterministic step 18 while not intersect(instance(x[m],v[m]),support(HMCw)): # extend step 19 x[0] = x[0] + [normal] 20 v[0] = v[0] + [normal] 21 for i in range(1,m+1): 22 (y,u) = leapfrogslice[n+1][i][d0](x[i-1],v[i-1]) 23 (x[i],v[i]) = (x[i]+y, v[i]+u) 24 n = n + 1 25 m = m + 1 26 (x,v) = intersect(instance(x[M],v[M]),support(HMCw))[0] 27 if u > min{1,HMCw(x,v)/HMCw(x0,v0) * 28 pdfnormal[n](x[M])/pdfnormal[n](x[0]) * 29 pdfnormal[n](v[M])/pdfnormal[n](v[0]) }: 30 if j <= J: 31 # perform an extra set of leapfrog steps 32 j = j + 1 33 else: 34 # no leapfrog steps is performed 35 (x,v) = (x0,v0) 36 stop = True 37 d = d0 38 else: 39 # enough leapfrog steps are performed 40 stop = True 41 d = not d0 42 # end of multiple step 43 return ((x,v), not d)

Nonparametric Involutive Markov Chain Monte Carlo

0 σ1 σ3 σ2 σ4 1

(L1(x, v), u

σ1 , F) (L2(x, v), u

(L4(x, v), u

(L0(x, v), u

(L3(x, v), u

Figure 10. Result of Φ(x, v, u, T) for varying u [0, 1].

Φ(x, v, u, T) :=

(Lj(x, v), u

σj , F) if max{σi | i < j} u < min{1, σj}

(x, v, u, T) if max{σj | j J} u

Φ(x, v, u, F) :=

(L j(x, v), u

σ j , T) if max{σ i | i < j} u < min{1, σ j}

(x, v, u, F) if max{σ j | j J} u

σj := ζ(Lj(x, v))

ζ(x, v) , σ j := ζ(L j(x, v))

ζ(x, v) , L j := (L 1)j

and ζ is the state density in HMC.

Note that in the involution, u determines how many sets (j) of leapfrog steps are to be performed. Say the direction is T. If the values of min{1, σj} for j = 1, . . . , J are marked on the unit interval [0, 1], then the probability that Lj is performed can be represented by the distance between min{1, σj} and the highest of σi for i < j, if it is non-negative. App. E.5.1 gives an example of the result of Φ(x, v, u, T) for varying u [0, 1].

The Look Ahead HMC sampler can be formulated as a Persistent i MCMC sampler with the auxiliary kernel K((x, v), ) := U[0, 1) and above involution Φ. Note that the sampler always accept the proposal since for u [max{σi | i < j}, min{1, σj}) with j {1, . . . , J}, the acceptance ratio is

min{1, ζ(Lj(x, v))

ζ(x, v) |det Φ(x, v, u, T)|} = min{1, σj |(det Lj(x, v)) 1

and for u [max{σj | j J}, 1], the acceptance ratio is also 1. A similar argument can be made when the direction is F.

E.5.2. NP LOOK AHEAD HMC

Extra Leapfrog Similar to the NP-HMC with Persistence, we consider the Persistent Multiple Step i MCMC algorithm (App. D.3.7) that applies a random number of leapfrog function (or its inverse) to the current state with the target density w (x, v) := w(x).

Given a k0-dimensional parameter (x0, v0) Rn Rn and direction d0 2, a random variable u [0, 1) and a dummy auxiliary variable u0 Rn are sampled from the uniform distribution U(0, 1) and K(n)((x0, v0), ) := Nn respectively to form an initial state ((x0, v0), (u, u0)).

If the direction d0 is T and u [max{σi | i < j}, min{1, σj}) for some j > 0 where

σj := max{w (t, u) | t instance(Lj(x, v))} ϕ2n(Lj(x, v))

max{w (t, u) | t instance((x, v))} ϕ2n(x, v) ,

leapfrog steps (id Rn Rn ( 1

σj ) id Rn) (Lj id[0,1) Rn) are performed on ((x0, v0), (u, u0)), one update at a time, extending the dimension as required with a ﬂipped direction F. Otherwise, u max{σj | j J} and no leapfrog steps is performed; ((x0, v0), (u, u0)) is returned with the direction T remains unchanged. The treatment when d0 is F is similar.

Nonparametric Involutive Markov Chain Monte Carlo

Say the initial state with direction d0 is extended to a n-dimensional ((x 0, v 0), (u, u 0)) and is traversed to the n-dimensional new state ((x , v ), (u , u )) with direction d. The instance (x, v) Supp(w ) of the n-dimensional parameter (x , v ) is returned alongside a ﬂipped direction not d with probability

min 1; w (x, v) ϕ2n(x , v ) pdf U(0,1)(u ) ϕn(u ) w (x0, v0) ϕ2n(x 0, v 0) pdf U(0,1)(u) ϕn(u 0) | 1

σj | |det Lj(x0, v0)| = 1

if j > 0. Otherwise (j = 0), the acceptance ratio is also 1.

Note that the auxiliary variable u0 has no effect on the sampler. Hence, Listing 22 gives a SPCF implementation

Extra Leapfrog where the sampling of the auxiliary variable u0 is skipped.

NP Look Ahead HMC Combining Extra Leapfrog with Corrupt Mom , the NPLook Ahead HMC function in Listing 22 implements the NP Look Ahead HMC sampler.