# differentiable_algorithm_for_marginalising_changepoints__5c9e6a62.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Differentiable Algorithm for Marginalising Changepoints

Hyoungjin Lim, Gwonsoo Che, Wonyeol Lee, Hongseok Yang School of Computing KAIST, South Korea {lmkmkr, gche, wonyeol, hongseok.yang}@kaist.ac.kr

We present an algorithm for marginalising changepoints in time-series models that assume a ﬁxed number of unknown changepoints. Our algorithm is differentiable with respect to its inputs, which are the values of latent random variables other than changepoints. Also, it runs in time O(mn) where n is the number of time steps and m the number of changepoints, an improvement over a naive marginalisation method with O(nm) time complexity. We derive the algorithm by identifying quantities related to this marginalisation problem, showing that these quantities satisfy recursive relationships, and transforming the relationships to an algorithm via dynamic programming. Since our algorithm is differentiable, it can be applied to convert a model non-differentiable due to changepoints to a differentiable one, so that the resulting models can be analysed using gradient-based inference or learning techniques. We empirically show the effectiveness of our algorithm in this application by tackling the posterior inference problem on synthetic and real-world data.

1 Introduction Time-series data from, for instance, econometrics, medical science, and political science (Erdman and Emerson 2008; Lio and Vannucci 2000; Spokoiny and others 2009; Haynes, Eckley, and Fearnhead 2017; Reeves et al. 2007; Lung-Yut-Fong, L evy-Leduc, and Capp e 2012) often show abrupt regime shifts, so that analysing those data commonly requires reasoning about the moments of these shifts, called changepoints. Two popular reasoning tasks are inferring the number of changepoints and detecting the speciﬁc values or distributions of the changepoints. Information found from these tasks enables the use of different statistical models for different segments of the data, identiﬁed by changepoints, which leads to accurate analysis of the data. However, due to the discrete nature of changepoints, developing efﬁcient algorithms for the tasks is tricky, and often requires an insight into the structure of a class of models used. In the paper, we study the problem of marginalising changepoints, which has been under-explored compared with the two tasks mentioned above. We present a differentiable algorithm for marginalising changepoints for a class

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

of time-series models that assume a ﬁxed number of changepoints. Our algorithm runs in O(mn) time where m is the number of changepoints and n the number of time steps. We do not know of any O(mn)-time algorithm that directly solves this changepoint-marginalisation problem. The class of models handled by our algorithm is broad, including non Markovian time-series models. Our marginalisation algorithm is differentiable with respect to its inputs, which enables the use of gradient-based algorithms for posterior inference and parameter learning on changepoint models. Since changepoints are discrete, gradient-based algorithms cannot be applied to these models, unless changepoints are marginalised out. In fact, marginalising discrete variables, such as changepoints, is a trick commonly adopted by the users of the Hamiltonian Monte Carlo algorithm or its variant (Stan Development Team 2018). Our algorithm makes the trick a viable option for changepoint models. Its O(mn) time complexity ensures low marginalisation overhead. Its differentiability implies that the gradients of marginalised terms can be computed by off-the-shelf automated differentiation tools (Paszke et al. 2017; Abadi et al. 2016). In the paper, we demonstrate these beneﬁts of our algorithm for posterior inference. The key insight of our algorithm is that the likelihood of latent variables with changepoints marginalised out can be expressed in terms of quantities that satisfy recursive relationships. The algorithm employs dynamic programming to compute these quantities efﬁciently. Its O(mn) time complexity comes from this dynamic programming scheme, and its differentiability comes from the fact that dynamic programming uses only differentiable operations. In our experiments with an inference problem, the algorithm outperforms existing alternatives. The rest of the paper is organised as follows. In 2, we describe our algorithm and its theoretical properties, and in 3, we explain how this algorithm can be used to learn model parameters from given data. In 4, we describe our experiments where we apply the algorithm to a posterior-inference problem. In 5, we put our results in the context of existing work on changepoint models and differentiable algorithms, and conclude the paper.

2 Marginalisation Algorithm Let n, m be positive integers with n m, and R+ be the set of positive real numbers. We consider a probabilistic model for n-step time-series data with m+1 changepoints, which has the following form. Let X Rk and Z Rl.

x1:n X n data points over n time steps. w1:n Rn + wt expresses a relative chance of the step t becoming a changepoint. wn = 1.

z1:m Zm latent parameters deciding the distribution of the data points x1:n.

τ0:m Nm+1 changepoints. 0 = τ0 < τ1 < < τm = n.

P(τ0:m | w1:n) 1

i=1 wτi where W =

P(x1:n, z1:m, τ0:m | w1:n)

P(z1:m)P(τ0:m | w1:n)

(j=τi 1+1) P(xj | x1:j 1, zi).

For simplicity, we assume for now that w1:n is ﬁxed and its normalising constant W is known. In 2.1, we will show how the assumption can be removed safely. Our goal is to ﬁnd an efﬁcient algorithm for computing the likelihood of the data x1:n for the latent z1:m, which involves marginalising the changepoints τ0:m as shown below:

P(x1:n | z1:m, w1:n) =

τ0:m P(x1:n, τ0:m | z1:m, w1:n). (1)

Note that summing the terms in (1) naively is not a viable option because the number of the terms grows exponentially in the number of changepoints (i.e., O(nm 1)). Our algorithm computes the sum in (1) in O(mn) time. Two key ideas behind the algorithm are to rewrite the sum in terms of recursively expressible quantities, and to compute these quantities efﬁciently using dynamic programming. For integers k, t with 1 k < m and m 1 t < n, let

Tk,t {τ0:m | τ0:m changepoints, τm 1 = t, and 1 + τi = τi+1 for all i with k i < m 1},

L0,t 0, Lk,t

τ0:m Tk,t P(x1:n, τ0:m | z1:m, w1:n),

Rm,t 1, b(k, t) t ((m 1) k),

P(xb(j,t)+1 | x1:b(j,t), zj) wb(j,t)+1 P(xb(j,t)+1 | x1:b(j,t), zj+1) wb(j,t) .

The ﬁrst Tk,t consists of changepoints τ0:m such that τm 1 ends at t and (τk, τk+1, . . . , τm 1) are consecutive. The next Lk,t selects the summands in (1) whose changepoints τ0:m are in Tk,t. It then sums the selected terms. The b(k, t) computes the value of the k-th changepoint for 1 k < m when the changepoints τk, τk+1, . . . , τm 1 are consecutive and the value of τm 1 is t. The last Rk,t is the ratio of the probabilities of the segment xb(k,t)+1:b(m 1,t)+1

ܮ ǡ௧ ଵ ǡ௧ ଵ ݐ

Figure 1: Visualisation of a case split used in Theorem 2.

(= xb(k,t)+1:t+1) and the changepoints τk:m 1 under two different assumptions. The numerator assumes that τj = b(j, t) + 1 for all k j < m, whereas the denominator assumes that τj = b(j, t) for all those j. A good heuristic is to view Rk,t as the change in the probability of the segment xb(k,t)+1:t+1 and the changepoints τk:m 1 when those changepoints are shifted one step to the right. The next three results formally state what we have promised: the Lk,t can be used to express the sum in (1), and the Lk,t and the Rk,t satisfy recursive relationships.

Proposition 1.

τ0:m P(x1:n, τ0:m | z1:m, w1:n) in (1) is equal to n 1 t=m 1 Lm 1,t.

Proof. Because {Tm 1,t | m 1 t < n} is a partition of the set of all changepoints τ0:m.

Thus, we can marginalise changepoints by computing n 1 t=m 1 Lm 1,t. This begs the question of how to compute the Lm 1,t s. The next two results give an answer.

Theorem 2. For all k, t with 1 k < m and m t < n,

Lk,m 1 = P(x1:n | z1:m, τ0:m=(0, 1, . . . , m 1, n)) P(τ0:m=(0, 1, . . . , m 1, n) | w1:n) , Lk,t = Lk 1,t + Lk,t 1 Rk,t 1 .

Figure 1 visualises a case split used in the second equation. Lk,t is the quantity about changepoints with τk:m 1 = (b(k, t), . . . , t). The ﬁgure shows that such changepoints can be partitioned into those with τk 1 = b(k 1, t) and the rest. The ﬁrst summand Lk 1,t computes the contribution of the changepoints in the ﬁrst partition, and the other summand of the equation that of the changepoints in the second partition.

Proof. By deﬁnition, Tk,m 1 is the singleton set {τ0:m | τ0:m = (0, 1, . . . , m 1, n)}. The ﬁrst equality in the theorem follows from this and the deﬁnition of Lk,m 1. For the second equality, consider k, t that satisfy the condition in the theorem. We will prove the following two equations:

Lk,t 1 Rk,t 1 =

τ0:m Tk,t τk 1 =b(k 1,t)

P(x1:n, τ0:m | z1:m, w1:n) , (2)

τ0:m Tk,t τk 1=b(k 1,t)

P(x1:n, τ0:m | z1:m, w1:n) . (3)

The desired conclusion follows from these two equations:

Lk 1,t + Lk,t 1 Rk,t 1

τ0:m Tk,t P(x1:n, τ0:m | z1:m, w1:n) = Lk,t .

Equation (3) holds since Tk 1,t = {τ0:m Tk,t | τk 1 = b(k 1, t)}. Equation (2) is proved as follows. Let

Tk,t {τ0:m Tk,t|τk 1 = b(k 1, t)},

τ 0:m (τ0:k 1, τk 1, ..., τm 1 1, n) for τ0:m Tk,t.

Then, {τ 0:m|τ0:m Tk,t} = Tk,t 1. For every τ0:m Tk,t,

P(x1:n, τ0:m|z1:m, w1:n) P(x1:n, τ 0:m|z1:m, w1:n)

= P(τ0:m|w1:n)

P(τ 0:m|w1:n) P(x1:n|τ0:m, z1:m)

P(x1:n|τ 0:m, z1:m)

P(xτi|x1:τi 1, zi) P(xτ i+1|x1:τ i, zi+1)

wb(i,t) wb(i,t) 1

P(xb(i,t)|x1:b(i,t) 1, zi) P(xb(i,t)|x1:b(i,t) 1, zi+1)

P(x1:n, τ0:m | z1:m, w1:n)

P(x1:n, τ 0:m | z1:m, w1:n) Rk,t 1

τ0:m Tk,t 1 P(x1:n, τ0:m | z1:m, w1:n) Rk,t 1

= Lk,t 1 Rk,t 1.

Proposition 3. For all k, t with 1 k < m and m 1 t < n,

Rk,t = Rk+1,t P(xb(k,t)+1 | x1:b(k,t), zk) wb(k,t)+1

P(xb(k,t)+1 | x1:b(k,t), zk+1) wb(k,t) .

Proof. Immediate from the deﬁnition of Rk,t.

When combined with the idea of dynamic programming, the recursive formulas in the propositions and the theorem give rise to an O(mn) algorithm for marginalising changepoints. We spell out this algorithm in Algorithm 1. Theorem 4. Algorithm 1 computes P(x1:n | z1:m, w1:n), where m is the number of changepoints and n is the number of steps in given data. Moreover, the algorithm runs in O(mn) time, if computing P(xj | x1:j 1, zi) for all 1 i m and 2 j n takes O(mn) time (i.e., P(xj | x1:j 1, zi) can be computed in O(1) amortised time).

Algorithm 1 Algorithm for marginalising changepoints.

Input: (i) integer m; (ii) weights w1:n with wn = 1; (iii) normalising constant W for P(τ0:m|w1:n); (iv) latent variables z1:m; (v) time-series data x1:n with 1 m n Output: likelihood P(x1:n | z1:m, w1:n) where changepoints τ0:m are marginalised 1: for t m 1 to n 1 do 2: L0,t 0; Rm,t 1

3: L1,m 1 P(x1:n|z1:m, τ0:m=(0, 1, . . . , m 1, n)) P(τ0:m=(0, 1, . . . , m 1, n)|w1:n) 4: for k 2 to m 1 do 5: Lk,m 1 L1,m 1 6: Pi,j P(xj|x1:j 1, zi) for 1 i m and 2 j n 7: for t m 1 to n 1 do 8: for k m 1 downto 1 do 9: Rk,t Rk+1,t Pb(k,t)+1,k wb(k,t)+1 Pb(k,t)+1,k+1 wb(k,t) 10: for t m to n 1 do 11: for k 1 to m 1 do 12: Lk,t Lk 1,t + Lk,t 1 Rk,t 1 13: L 0 14: for t m 1 to n 1 do 15: L L + Lm 1,t return L

Proof. The correctness follows from Propositions 1 and 3 and Theorem 2. We analyse the run time as follows. The line 3 computes the RHS in O(n). The line 6 runs in O(mn) by the assumption. In the rest of the algorithm, nested loops and other loops iterate O(mn) times, and each line inside the loops runs in O(1). So, the algorithm runs in O(mn).

Theorem 5. When P(xj | x1:j 1, zi) is differentiable with respect to x1:j and zi, the result of Algorithm 1 is also differentiable with respect to x1:n and z1:m, and can be computed by applying automated differentiation to the algorithm.

Proof. When P(xj | x1:j 1, zi) is differentiable with respect to x1:j and zi, the likelihood P(x1:n | z1:m, w1:n) is differentiable with respect to x1:n and z1:m. So, the correctness of Algorithm 1 in Theorem 4 implies the claimed differentiability. The other claim about the use of automated differentiation holds because Algorithm 1 does not use any non-differentiable operations such as if statements.

2.1 Computation of normalising constant W So far we have assumed that weights w1:n are ﬁxed and the normalising constant W for P(τ0:m | w1:n) is known. We now discharge the assumption. We present an algorithm for computing W for given w1:n. The algorithm uses dynamic programming, runs in O(mn) time, and is differentiable: the gradient of W with respect to w1:n can be computed by applying automated differentiation to the algorithm. For all k and t with 1 k < m and 0 t < n, let Sk,t

τ0:k, τk t k i=1 wτi and S0,t 1. Note that W = Sm 1,n 1. So, it sufﬁces to design an algorithm for computing Sk,t. The next proposition describes how to do it.

Proposition 6. For all k, t with 1 k < m and k t < n, we have Sk,t = Sk,t 1 + Sk 1,t 1 wt and Sk,k 1 = 0. The recurrence relation for Sk,t in Proposition 6 yields a dynamic-programming algorithm for computing W that runs in O(mn) time. The standard implementation of the algorithm does not use any non-differentiable operations. So, its gradient can be computed by automated differentiation. With this result at hand, we remove the assumption that weights w1:n are ﬁxed and the normalising constant W for P(τ0:m | w1:n) is known a priori. Algorithm 1 no longer receives W as its input. It instead uses the algorithm described above and computes W from given w1:n before starting line 1. Since the computation of W takes O(mn) time and can be differentiated by automated differentiation, all the aforementioned results on Algorithm 1 (Theorems 4 and 5) still hold, and can be extended to cover the differentiability of Algorithm 1 with respect to w1:n.

3 Learning Model Parameters Our algorithm can extend the scope of gradient-based methods for posterior inference and model learning such that they apply to changepoint models despite their nondifferentiability. In this section, we explain the modellearning application. We consider state-space models with changepoints that use neural networks. The goal is to learn appropriate neural-network parameters from given data. We consider a special case of the model described in 2 that satisﬁes the following conditions. 1. The latent parameter zi at i {1, . . . , m} has the ﬁxed value ei in {0, 1}m that has 1 at the i-th position and 0 everywhere else. Formally, this means that the prior P(z1:m) is the Dirac distribution at (e1, e2, . . . , em). 2. The random variable xj at j {1, . . . , n} consists of two parts, x S j XS for the latent state and x O j XO for the observed value. Thus, xj = (x S j , x O j ) and X = XS XO.

3. The probability distribution Pφ(xj | x1:j 1, zi) is parameterised by φ Rp for some p, and has the form

Pφ(xj | x1:j 1, zi) = Pφ(x O j | x S j , zi)Pφ(x S j | x S 1:j 1, zi).

Typically, Pφ is deﬁned using a neural network, and φ denotes the weights of the network. When the model satisﬁes these conditions, we have

Pφ(x1:n | w1:n) =

z1:m Pφ(x1:n, z1:m | w1:n)

z1:m P(z1:m)Pφ(x1:n | z1:m, w1:n)

= Pφ(x1:n | (z1:m = e1:m), w1:n). (4)

By the learning of model parameters, we mean the problem of ﬁnding φ for given observations x O 1:n that makes the log probability of the observations log Pφ(x O 1:n | w1:n) large. A popular approach (Kingma and Welling 2014) is to maximise a lower bound of this log probability, called ELBO, approximately using a version of gradient ascent:

ELBOφ,θ EQθ(x S 1:n|x O 1:n)

log Pφ(x S 1:n, x O 1:n|w1:n) Qθ(x S 1:n|x O 1:n)

where Qθ is an approximating distribution for the posterior Pφ(x S 1:n | x O 1:n, w1:n) and θ Rq denotes the parameters of this distribution, typically the weights of a neural network used to implement Qθ. Our marginalisation algorithm makes it possible to optimise ELBOθ,φ in (5) by an efﬁcient stochastic gradientascent method based on the so called reparameterisation trick (Kingma and Welling 2014; Rezende, Mohamed, and Wierstra 2014; Kucukelbir et al. 2017). Here we use our algorithm with ﬁxed m, w1:n after setting z1:m to e1:m.1 So, only the x1:n = (x S 1:n, x O 1:n) part of its input may vary. To emphasise this, we write ALG(φ, x S 1:n, x O 1:n) for the result of the algorithm. Also, we make the usual assumption of the reparameterisation trick: there are a θ-independent distribution Q(ϵ) and a differentiable function Tθ(ϵ, x O 1:n) such that Tθ(ϵ, x O 1:n) for ϵ Q(ϵ) is distributed according to Qθ(x S 1:n | x O 1:n). The next theorem shows that the gradient of ELBO can be estimated by computing the gradient through the execution of our algorithm via automated differentiation.

Theorem 7. If Pφ(xj | x1:j 1, zi) is differentiable with respect to x1:j and φ, so is ALG(φ, x S 1:n, x O 1:n). In that case, the gradient can be computed by automated differentiation.

Proof. When Pφ(xj | x1:j 1, zi) is differentiable with respect to x1:j and φ, the likelihood Pφ(x1:n | z1:m, w1:n) is differentiable with respect to x1:n and φ. Thus, the correctness of Algorithm 1 in Theorem 4 implies the claimed differentiability. The other claim about the use of automated differentiation comes from the fact that Algorithm 1 does not use any non-differentiable operations.

Theorem 8. When Pφ(xj | x1:j 1, zi) is differentiable with respect to x1:j and φ for all j and i,

MRep φ,θ log ALG(φ, Tθ(ϵ, x O 1:n), x O 1:n) Qθ(Tθ(ϵ, x O 1:n) | x O 1:n) for ϵ Q(ϵ)

is an unbiased estimate for φ,θELBOφ,θ, and can be computed via automated differentiation.

Proof. Pφ(x S 1:n, x O 1:n|w1:n) = Pφ(x1:n|(z1:m=e1:m), w1:n) by (4). The RHS of the equation equals ALG(φ, x S 1:n, x O 1:n) by Theorem 4 and the deﬁnition of ALG. So, ELBOφ,θ =

EQθ(x S 1:n|x O 1:n) log ALG(φ,x S 1:n,x O 1:n) Qθ(x S 1:n | x O 1:n)

. The usual unbiasedness argument of the reparameterisation trick and the differentiability of ALG(φ, x S 1:n, x O 1:n) with respect to x1:n and φ (Theorem 7) give the claimed conclusion.

4 Experimental Evaluation

As mentioned, another important application of our algorithm is posterior inference. In this section, we report the ﬁndings from our experiments with this application, which show the beneﬁts of having an efﬁcient differentiable marginalisation algorithm for posterior inference.

1W is computed by the extension of our algorithm in 2.1. In fact, using the same extension, we can even treat w1:n as a part of φ, and learn appropriate values for w1:n.

1XPEHU RI FKDQJHSRLQWV

+0& QDLYH +0& RXUV

(a) Time per sample vs. m when n = 50.

1XPEHU RI WLPH VWHSV

+0& QDLYH +0& RXUV

(b) Time per sample vs. n when m = 2.

Figure 2: Computation time of HMCnaive and HMCours for different m and n. The xand y-axes are in log-scale.

Hamiltonian Monte Carlo (HMC) (Duane et al. 1987; Neal 2011) is one of the most effective algorithms for sampling from posterior distributions, especially on high dimensional spaces. However, it cannot be applied to models with changepoints directly. This is because HMC requires that a model have a differentiable density, but changepoint models do not meet this requirement due to discrete changepoints. One way of addressing this non-differentiability issue is to use our algorithm and marginalise changepoints. Since our algorithm is differentiable, the resulting models have differentiable densities, and we can analyse their posteriors using HMC. We tested this approach experimentally, aiming at answering the following questions:

RQ1 (Speed): How fast is our marginalisation algorithm when used for HMC?

RQ2 (Sample quality): Is HMC with our marginalisation algorithm better at generating good posterior samples than other Markov Chain Monte Carlo (MCMC) algorithms that do not use gradients nor marginalisation?

We evaluated different inference algorithms on synthetic and real-world data for X = R. The synthetic data were generated as follows: we ﬁxed parameters (n, m , μ 1:m , σ 1:m , τ 0:m ), and then sampled each data point xj (1 j n) in the i-th segment (i.e., τ i 1 < j τ i ) independently from a Gaussian distribution with mean μ i and standard deviation σ i . The changepoint model for analysing the synthetic data is: m = m , Z = R R+, w1:n = (1, . . . , 1), P(zi = (μi, σi)) = Normal(μi|5, 10) Log Normal(σi|0, 2), and P(xj|x1:j 1, zi = (μi, σi)) = Normal(xj|μi, σi). For the real-world application, we used well-log data (Fearnhead 2006), whose data points represent some physical quantity measured by a probe diving in a wellbore. We took a part of the well-log data by removing outliers and choosing 1000 consecutive data points (Figure 3b). The changepoint model for the well-log data is the same as the above except: m = 13 and P(zi = (μi, σi)) = Normal(μi|120000, 20000) Log Normal(σi|8.5, 0.5). Our goal is to infer the posterior distribution of latent parameters z1:m and changepoints τ0:m. For this task, we compared four posterior-inference algorithms: HMCnaive, HMCours, IPMCMC, and LMH. HMCnaive (resp. HMCours) generates samples as follows: it forms a probabilistic model P(x1:n, z1:m | w1:n) where τ0:m are marginalised out by a naive marginalisation scheme (resp. by our marginali-

(a) Synthetic data.

(b) Part of well-log data.

Figure 3: Synthetic and real-world data for RQ2. The x-axis represents time steps.

sation algorithm); samples z1:m from P(z1:m | w1:n, x1,n) by running HMC on P(x1:n, z1:m | w1:n); ﬁnally samples τ0:m from P(τ0:m | w1:n, x1:n, z1:m) using dynamic programming. IPMCMC and LMH jointly sample z1:m and τ0:m from P(z1:m, τ0:m | w1:n, x1:n) by running the variants of the Metropolis-Hastings algorithm called interacting particle MCMC (IPMCMC) (Rainforth et al. 2016) and lightweight Metropolis-Hastings (LMH) (Wingate, Stuhlm uller, and Goodman 2011), respectively. IPMCMC and LMH are applicable to models with discrete or nondifferentiable random variables. They neither exploit gradients nor marginalise out any random variables. For HMCnaive and HMCours, we used the No-U-Turn Sampler (NUTS) (Hoffman and Gelman 2014) in Py Stan (Carpenter et al. 2017) with default hyper-parameters, except for adapt delta = 0.95. For IPMCMC and LMH, we used the implementations in Anglican (Wood, van de Meent, and Mansinghka 2014; Tolpin et al. 2016) with default hyper-parameters, except for the following IPMCMC setup: number-of-nodes = 8 for both the synthetic and well-log data, and pool = 8 for the well-log data. For RQ1, we compared the time taken to generate a single posterior sample by HMCnaive and HMCours. For RQ2, we compared the quality of posterior samples from HMCours, IPMCMC, and LMH, by means of the following quantities: estimates of the ﬁrst and second moments, the Gelman Rubin convergence statistic ( R) (Gelman, Rubin, and others 1992; Brooks and Gelman 1998), and effective sample size (ESS). The experiments were performed on a Ubuntu 16.04 machine with Intel i7-7700 CPU with 16GB of memory.

Results for RQ1. We measured the average time per sample taken by HMCnaive and HMCours for different numbers of changepoints and time steps: for ﬁxed n = 50, we varied m = m from 1 to 5, and for ﬁxed m = m = 2, we varied n {200, 400, 600, 800}. The details of the parameter values we used appear in the extended version of the paper on ar Xiv. We ran ﬁve independent runs of the NUTS algorithm, and averaged the time spent without burn-in samples. Figures 2a and 2b show how the time depends on m and n, respectively, in the two approaches. In the log-log plots, log(time) of HMCours is linear in both log m and log n, due

0 6000 12000 18000 24000 30000 Number of samples (HMCours)

0 50000 100000 150000 200000 250000 Number of samples (IPMCMC)

0 40000 80000 120000 160000 200000 Number of samples (LMH)

Figure 4: Convergence plots for estimating the sum of the ﬁrst moments for synthetic data, with HMCours (red), IPMCMC (blue), and LMH (green). Each x-axis represents the number of samples generated by the corresponding procedure, and the y-axis denotes estimated values. Each darker line shows the mean value at each point, and the corresponding error band around it shows the standard deviation.

to its time complexity O(mn). On the other hand, log(time) of HMCnaive is exponential in log m, and linear in log n yet with a slope nearly two times larger than that for HMCours, because of its time complexity O(nm). Overall, the results show that HMCnaive quickly becomes infeasible as the number of changepoints or time steps grows, but HMCours avoids such an upsurge by virtue of having the linear relationship between the two varying factors and the time per sample.

Results for RQ2. Figure 3 shows the synthetic and realworld data used in answering RQ2. The synthetic data was generated with parameters n = 300, m = 6, μ 1:m = (10, 2, 10, 2, 10, 2), σ 1:m = (1.8, 1.1, 1.7, 1.5, 1.2, 1.3), and τ 0:m = (0, 50, 100, 150, 200, 250, 300). For each chain of HMCours, we generated 30K samples with random initialisation (when possible) after burning in 1K samples. We computed R and ESS for each latent parameter and changepoint using three chains, and repeated this ﬁve times as the R and ESS results varied across different runs. We also estimated the sum of the ﬁrst moments of (z1:m, τ1:m 1) and that of the second moments of them using the same 15 chains.2 The same setup was applied to IPMCMC and LMH except the following: since they sample faster than HMCours, we let IPMCMC and LMH generate 270K (resp. 1855K) and 200K (resp. 1750K) samples, respectively, for synthetic data (resp. well-log data) so that every run of them spends more time than the corresponding slowest HMCours run. We ﬁrst discuss results on synthetic data. Figure 4 shows the estimates for the sum of the ﬁrst moments by HMCours, IPMCMC, and LMH. HMCours shows a gradual trend towards convergence, while IPMCMC and LMH exhibit substantial variation across runs without convergence. We ob-

2Concretely, we estimated EP (z1:m,τ1:m | x1:n) m i=1 μi + σi+τi n and EP (z1:m,τ1:m | x1:n) m i=1 μ2 i +σ2 i +τ 2 i n2 .

Table 1: The ranges of the time (sec) taken by the three approaches and the ranges of the estimates computed by them, for synthetic data. For the estimated sum of the ﬁrst/second moments (i.e., the third/fourth row), we computed the values at 510.8 sec (the minimum time taken among all the runs) from each Markov chain, assuming that generating each sample (in a chain) took an equal amount of time.

HMCours IPMCMC LMH Time [510.8, 1043.9] [1053.2, 1170.6] [1092.2, 1105.2] 1st [746.9, 794.2] [651.3, 993.8] [415.4, 1164.0]

2nd [1.28E+05, 1.39E+05] [1.12E+05, 1.89E+05] [3.56E+04, 2.62E+05]

tained similar results for the sum of the second moments (see the extended paper on ar Xiv). Table 1 shows the ranges of the time taken by the Markov chains in Figure 6, and the ranges of the estimates from the chains.

Figure 5a shows the R values from HMCours, IPMCMC, and LMH. Three out of ﬁve experiments with HMCours were satisfactory in the sense that the R statistics for all the latent (z1:m, τ1:m 1) were between 0.9 and 1.1. Though the R statistics for some of the latent were over 1.1 in the other two experiments, most of the R values were less than or close to 1.1. On the other hand, none of the IPMCMC and LMH experiments placed R values for all the latent, within the interval. Also the values were farther from the interval.

Figure 5b shows ln(ESS) from HMCours, IPMCMC, and LMH in a similar manner. HMCours produced signiﬁcantly higher ESS values than LMH, demonstrating that HMCours draws samples more effectively than LMH within a ﬁxed amount of time. However, HMCours was not superior in ESS to IPMCMC despite the excellence in R. We conjecture that this is due to IPMCMC running eight parallel nodes independently, each with two particles to propose samples.

For well-log data, HMCours similarly outperformed the other two in terms of convergence, R, and ESS; R for all the latent (z1:m, τ1:m 1) were between 0.9 and 1.1 in all ﬁve experiments. One exception is that IPMCMC showed much higher ESS than HMCours, although it failed to converge (Figure 6). We think that this is again due to IPMCMC s tendency of generating independent samples. The full results are in the extended paper on ar Xiv.

We remark that HMCours performed poorly on well-log data with the same model but smaller m (e.g., 6 instead of 13). According to our manual inspection, this was because HMCours in this case got stuck at some modes, failing to generate samples from other modes. We think that increasing m (up to some point) lessens the barriers between modes in the marginal space; for large enough m, only a small amount of P(z1:m | x1:n, w1:n) should be reduced to change some of z1:m. One practical lesson is that having enough changepoints may help analyse real-world data using Bayesian inference and changepoint models.

μ1 μ2 μ3 μ4 μ5 μ6 σ1 σ2 σ3 σ4 σ5 σ6 τ1 τ2 τ3 τ4 τ5 1

HMCours IPMCMC LMH

μ1 μ2 μ3 μ4 μ5 μ6 σ1 σ2 σ3 σ4 σ5 σ6 τ1 τ2 τ3 τ4 τ5

HMCours IPMCMC LMH

(b) ln(ESS).

Figure 5: R and ESS from HMCours (red), IPMCMC (blue), and LMH (green) for synthetic data. The x-axis denotes the latent parameters and changepoints, and the y-axis R or ln(ESS) values.

0 6000 12000 18000 24000 30000 Number of samples (HMCours)

0 400000 800000 1200000 1600000 Number of samples (IPMCMC)

0 400000 800000 1200000 1600000 Number of samples (LMH)

Figure 6: Convergence plots for estimating the sum of the ﬁrst moments for well-log data. See the caption of Figure 4 for details.

5 Related Work and Conclusion

Related work. Modelling and reasoning about time-series data with changepoints is a well-established topic in statistics and machine learning (Eckley, Fearnhead, and Killick 2011; Truong, Oudre, and Vayatis 2018). We discuss two lines of research most relevant to ours. The ﬁrst is the work by Fearnhead and his colleagues (Fearnhead 2006; 2005; Fearnhead and Liu 2007), which is further extended to multi-dimensional time-series data (Xuan 2007). Fearnhead (2006) proposed an O(n2)-time algorithm for generating changepoint positions from the posterior of a given changepoint model in a particular form, where n is the number of time steps. Their algorithm also uses a form of dynamic programming on certain recursive formulas, but it does not target at marginalisation. Its conversion for marginalisation is possible, but inherits this O(n2) time complexity. The other work is Chib (1995) s technique for estimating the model evidence of changepoint models (Chib 1998), whose prop-

erties, such as sensitivity on chosen parameters, is analysed by Bauwens and Rombouts (2012). The technique is based on Gibbs sampling, and it is unclear whether the technique leads to a differentiable algorithmic component that can be used in the context of gradient-based algorithms. The observation that the summation version of dynamic programming is differentiable is a folklore. For instance, Eisner (2016) points out the differentiability of the inside algorithm, which is a classic dynamic-programming-based algorithm in natural language processing (NLP). He then explains how to derive several well-known NLP algorithms by differentiating the inside algorithm or its variants. However, we do not know of existing work that uses such dynamic programming algorithms for the type of application we consider in the paper: converting non-differentiable models to differentiable models via marginalisation in the context of posterior inference and model learning. The optimisation version of dynamic programming is not differentiable, and its differentiable relaxation has been studied recently (Corro and Titov 2019; Mensch and Blondel 2018).

Conclusion. We presented a differentiable O(mn)-time algorithm for marginalising changepoints in time-series models, where m is the number of changepoints and n the number of time steps. The algorithm can be used to convert a class of non-differentiable time-series models to differentiable ones, so that the resulting models can be analysed by gradient-based techniques. We described two applications of this conversion, posterior inference with HMC and modelparameter learning with reparameterisation gradient estimator, and experimentally showed the beneﬁts of using the algorithm in the former posterior-inference application.

Acknowledgements. The authors were supported by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921), and also by Next-Generation Information Computing Develop-

ment Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (2017M3C4A7068177).

References Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; Kudlur, M.; Levenberg, J.; Monga, R.; Moore, S.; Murray, D. G.; Steiner, B.; Tucker, P. A.; Vasudevan, V.; Warden, P.; Wicke, M.; Yu, Y.; and Zheng, X. 2016. Tensorﬂow: A system for large-scale machine learning. In OSDI, 265 283. Bauwens, L., and Rombouts, J. V. 2012. On marginal likelihood computation in change-point models. Computational Statistics & Data Analysis 56(11):3415 3429. Brooks, S. P., and Gelman, A. 1998. General methods for monitoring convergence of iterative simulations. J. Comput. Graph. Stat. 7(4):434 455. Carpenter, B.; Gelman, A.; Hoffman, M. D.; Lee, D.; Goodrich, B.; Betancourt, M.; Brubaker, M.; Guo, J.; Li, P.; and Riddell, A. 2017. Stan: A probabilistic programming language. Journal of Statistical Software 76(1). Chib, S. 1995. Marginal likelihood from the gibbs output. J. Am. Stat. Assoc. 90(432):1313 1321. Chib, S. 1998. Estimation and comparison of multiple change-point models. J. Econometrics. 86(2):221 241. Corro, C., and Titov, I. 2019. Differentiable perturb-andparse: Semi-supervised parsing with a structured variational autoencoder. In ICLR. Duane, S.; Kennedy, A. D.; Pendleton, B. J.; and Roweth, D. 1987. Hybrid Monte Carlo. Physics Letters B 195:216 222. Eckley, I. A.; Fearnhead, P.; and Killick, R. 2011. Analysis of changepoint models. In Barber, D.; Cemgil, A. T.; and Chiappa, S., eds., Bayesian Time Series Models. Cambridge University Press. chapter 10, 205 224. Eisner, J. 2016. Inside-outside and forward-backward algorithms are just backprop (tutorial paper). In Workshop on Structured Prediction for NLP@EMNLP, 1 17. Erdman, C., and Emerson, J. W. 2008. A fast bayesian change point analysis for the segmentation of microarray data. Bioinformatics 24(19):2143 2148. Fearnhead, P., and Liu, Z. 2007. On-line inference for multiple changepoint problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(4):589 605. Fearnhead, P. 2005. Exact bayesian curve ﬁtting and signal segmentation. IEEE Transactions on Signal Processing 53(6):2160 2166. Fearnhead, P. 2006. Exact and efﬁcient bayesian inference for multiple changepoint problems. Statistics and Computing 16(2):203 213. Gelman, A.; Rubin, D. B.; et al. 1992. Inference from iterative simulation using multiple sequences. Statistical Science 7(4):457 472. Haynes, K.; Eckley, I. A.; and Fearnhead, P. 2017. Computationally efﬁcient changepoint detection for a range of penalties. J. Comput. Graph. Stat. 26(1):134 143.

Hoffman, M. D., and Gelman, A. 2014. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. JMLR 15(1):1593 1623. Kingma, D. P., and Welling, M. 2014. Auto-Encoding Variational Bayes. In ICLR. Kucukelbir, A.; Tran, D.; Ranganath, R.; Gelman, A.; and Blei, D. M. 2017. Automatic differentiation variational inference. JMLR 18(1):430 474. Lio, P., and Vannucci, M. 2000. Wavelet changepoint prediction of transmembrane proteins. Bioinformatics 16(4):376 382. Lung-Yut-Fong, A.; L evy-Leduc, C.; and Capp e, O. 2012. Distributed detection/localization of change-points in highdimensional network trafﬁc data. Statistics and Computing 22(2):485 496. Mensch, A., and Blondel, M. 2018. Differentiable dynamic programming for structured prediction and attention. In ICML, 3459 3468. Neal, R. M. 2011. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2(11):2. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in Py Torch. In NIPS Autodiff Workshop. Rainforth, T.; Naesseth, C.; Lindsten, F.; Paige, B.; Vandemeent, J.-W.; Doucet, A.; and Wood, F. 2016. Interacting particle markov chain monte carlo. In ICML, 2616 2625. Reeves, J.; Chen, J.; Wang, X. L.; Lund, R.; and Lu, Q. Q. 2007. A review and comparison of changepoint detection techniques for climate data. Journal of Applied Meteorology and Climatology 46(6):900 915. Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In ICML, 1278 1286. Spokoiny, V., et al. 2009. Multiscale local change point detection with applications to value-at-risk. The Annals of Statistics 37(3):1405 1436. Stan Development Team. 2018. Stan Modeling Language Users Guide and Reference Manual. Version 2.18.0. Tolpin, D.; van de Meent, J.-W.; Yang, H.; and Wood, F. 2016. Design and implementation of probabilistic programming language anglican. In IFL, 6:1 6:12. Truong, C.; Oudre, L.; and Vayatis, N. 2018. Selective review of ofﬂine change point detection methods. Ar Xiv abs/1801.00718. Wingate, D.; Stuhlm uller, A.; and Goodman, N. 2011. Lightweight implementations of probabilistic programming languages via transformational compilation. In AISTATS. Wood, F.; van de Meent, J. W.; and Mansinghka, V. 2014. A new approach to probabilistic programming inference. In AISTATS, 1024 1032. Xuan, X. 2007. Bayesian inference on change point problems. Master s thesis, The University of British Columbia, Vancouver, Canada.