# nonstationary_sparse_spectral_permanental_process__3c1d50e7.pdf

Nonstationary Sparse Spectral Permanental Process

Zicheng Sun1 , Yixuan Zhang2 , Zenan Ling3, Xuhui Fan4 Feng Zhou1,5

1Center for Applied Statistics and School of Statistics, Renmin University of China 2School of Statistics and Data Science, Southeast University 3School of EIC, Huazhong University of Science and Technology 4School of Computing, Macquarie University 5Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing {sunzicheng2020, feng.zhou}@ruc.edu.cn, zh1xuan@hotmail.com

Existing permanental processes often impose constraints on kernel types or stationarity, limiting the model s expressiveness. To overcome these limitations, we propose a novel approach utilizing the sparse spectral representation of nonstationary kernels. This technique relaxes the constraints on kernel types and stationarity, allowing for more flexible modeling while reducing computational complexity to the linear level. Additionally, we introduce a deep kernel variant by hierarchically stacking multiple spectral feature mappings, further enhancing the model s expressiveness to capture complex patterns in data. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of our approach, particularly in scenarios with pronounced data nonstationarity. Additionally, ablation studies are conducted to provide insights into the impact of various hyperparameters on model performance. Code is publicly available at https://github.com/SZC20/DNSSPP.

1 Introduction

Many application domains involve point process data, which records the times or locations of events occurring within a region. Point process models are used to analyze these event data, aiming to uncover patterns of event occurrences. The Poisson process [15], as an important model in the field of point processes, plays a significant role in neuroscience [4], finance [11], criminology [31], epidemiology [5], and seismology [8]. Traditional Poisson processes assume parameterized intensity functions, which severely restricts the flexibility of the model. To address this issue, a viable solution is to employ Bayesian nonparametric methods, by imposing a nonparametric Gaussian process (GP) prior on the intensity function, resulting in the Gaussian Cox process [3]. This greatly enhances the flexibility of the model, while also endowing it with uncertainty quantification capabilities.

In Gaussian Cox processes, posterior inference of the intensity is challenging because the GP prior is not conjugate to the Poisson process likelihood that includes an intensity integral. Furthermore, to ensure the non-negativity of the intensity, we need to use a link function to transform the GP prior. Commonly used link functions include exponential [20; 12], sigmoid [1; 9; 6], square [18; 7; 29], Re LU [16], and softplus [24]. Among these, using the square link function, i.e., the permanental process [19], allows for the analytical computation of the intensity integral [18; 33; 29], thus receiving widespread attention in recent years. For this reason, this work focuses on the permanental process.

Currently, the permanental process faces three issues: (1) The permanental process inherits the notorious cubic computational complexity of GPs, making it impractical for use with a large amount

Corresponding author. Equal contributions.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

of data [18; 14]. (2) Existing works on permanental processes either require certain standard types of kernels, such as the squared exponential kernel, etc., to ensure that the intensity integral has an analytical solution [18; 7; 33]; or they require the kernels to be stationary [14; 29]. These constraints limit the expressive power of the model. (3) Furthermore, existing works have predominantly utilized simple shallow kernels, which restricts the flexibility of the model. Although some shallow kernels [34; 28] are flexible and theoretically capable of approximating any bounded kernel, they are limited in representing complex kernels due to computational constraints in practical usage.

In this study, we utilize the sparse spectral representation of nonstationary kernels to address these limitations in modeling the permanental process. The sparse spectral representation provides a low-rank approximation of the kernel, effectively reducing the computational complexity from cubic to linear level. The nonstationary sparse spectral kernel overcomes the limitation of stationary assumption. By treating the frequencies as kernel parameters for optimization, we can directly learn the nonstationary kernel from the data in a flexible manner without restricting the kernel s form. We further extend the shallow nonstationary sparse spectral kernel by hierarchically stacking multiple spectral feature mappings to construct a deep kernel, which exhibits significantly enhanced expressive power compared to shallow ones. We term the constructed model as Nonstationary Sparse Spectral Permanental Process (NSSPP) and the corresponding deep kernel variant as DNSSPP.

We conduct experiments on both synthetic and real-world datasets. The results indicate that when the data is (approximately) stationary, (D)NSSPP achieves similar performance to stationary baselines. However, when the nonstationarity in the data is pronounced, (D)NSSPP can outperform baseline models, demonstrating the superiority of (D)NSSPP. Additionally, we perform ablation studies to assess the impact of various model hyperparameters.

2 Related Work

In this section, we present some related works on how to reduce the computational complexity of the stationary permanental process, as well as some research on nonstationary kernels.

Efficient Permanental Process Due to the adoption of the GP prior in the permanental process, posterior inference involves computing the inverse of the kernel matrix. This results in a computational complexity of O(N 3), where N is the number of data points. To reduce computational complexity, several works have introduced low-rank approximation methods from the GP domain into the permanental process, such as inducing points [18], Nyström approximation [7; 33], and spectral representation [14; 29]. These methods essentially involve low-rank approximations of the kernel matrix, thereby reducing the computational complexity from O(N 3) to O(NR2), where R N is the number of rank, such as the number of inducing points or frequencies. In this work, we utilize the spectral representation method, reducing the computational complexity from cubic to linear level. Another reason for adopting the spectral representation is that it facilitates the construction of the subsequent deep nonstationary kernel.

Nonstationary Kernel Stationary kernels assume that the similarity between different locations depends only on their relative distance. However, previous studies have indicated that the stationary assumption is not suitable for dynamic complex tasks, as the similarity is not consistent across the input space [23]. To address this issue, some works proposed nonstationary kernels by transforming the input space [30] or using input-dependent parameters [10]. Additionally, other works leveraged the generalized Fourier transform of kernels [37] to propose nonstationary spectral kernels [26; 32]. Although the aforementioned shallow nonstationary kernels are flexible and theoretically capable of approximating any bounded kernels, they are limited in representing complex kernels in practical usage. In recent years, many deep architectures have been introduced into kernels, significantly enhancing their expressive power [35; 36]. Despite the development of (deep) nonstationary kernels in GP, interestingly, to the best of our knowledge, there has been no prior work applying them in Gaussian Cox processes. This gap is what this work seeks to address.

3 Preliminaries

In this section, we provide some fundamental knowledge about the permanental process and sparse spectral kernel.

3.1 Permanental Process

For a Poisson process, if x X RD, the intensity function λ(x) = limδx 0 E[N([x, x+δx])]/δx can be used to quantify the rate of events occurring at location x. A Cox process can be viewed as a Poisson process whose intensity function itself is a random process. The Gaussian Cox process employs the GP to model the intensity function: λ(x) = l f(x) where f is a GP function and l : R R+ is a link function to ensure the non-negativity of the intensity function.

The nonparametric nature of GP enables the Gaussian Cox process to be flexible in modeling events in space or time. However, the posterior inference of Gaussian Cox process is challenging. According to the likelihood function of a Poisson process:

p({xi}N i=1 |λ(x) = l f(x)) =

i=1 λ(xi) exp Z

X λ(x)dx , (1)

and the Bayesian framework, the posterior of latent function f can be written as:

p(f| {xi}N i=1) = QN i=1 λ(xi) exp( R

X λ(x)dx)GP(f|0, k) R QN i=1 λ(xi) exp( R

X λ(x)dx)GP(f|0, k)df . (2)

In Eq. (2), there are two integrals. The first one R

X λ(x)dx cannot be solved due to the stochastic nature of f. Additionally, the integral over f in the denominator is a functional integral in the function space, which is also infeasible. This issue is the well-known doubly-intractable problem.

The link function l : R R+ has many choices, such as exponential, sigmoid, square, Re LU, and softplus. This work focuses on the utilization of the square link function, also known as the permanental process. Furthermore, if we set the kernel in the GP prior to be stationary: k(x1, x2) = k(x1 x2), then the model is referred to as a stationary permanental process.

3.2 Sparse Spectral Kernel

The sparse spectral representation is a common method for the low-rank approximation of kernels. This approach is based on Bochner s theorem. Theorem 3.1. [2] A stationary kernel function k(x1, x2) = k(x1 x2) : RD R is bounded, continuous, and positive definite if and only if it can be represented as:

k(x1 x2) = Z

RD exp(iω (x1 x2))dµ(ω),

where µ(ω) is a bounded non-negative measure associated to the spectral density p(ω) = µ(ω) µ(RD).

Defining ϕ(x) = exp(iω x), we have k(x1 x2) = σ2Ep(ω)[ϕ(x1)ϕ(x2) ] where denotes the complex conjugate and σ2 = µ(RD). If we utilize the Monte Carlo to sample {ωr}R r=1 independently from p(ω), then the kernel can be approximated by

k(x1 x2) σ2

r=1 ϕr(x1)ϕr(x2) = Φ(R)(x1) Φ(R)(x2) , (3)

where ϕr(x) = exp(iω r x), Φ(R)(x) = σ

R[ϕ1(x), , ϕR(x)] , which corresponds to the random Fourier features (RFF) proposed by [25]. It can be proved that Eq. (3) is equivalent to a 2R-sized trigonometric representation (proof is provided in Appendix A):

Φ(R)(x) = σ

R [cos(ω 1 x), , cos(ω Rx), sin(ω 1 x), , sin(ω Rx)] . (4)

When we specify the functional form of the kernel (spectral measure µ), the RFF method can learn the parameters of the kernel (spectral measure µ) from the data. However, when the functional form of the kernel (spectral measure µ) is unknown, RFF can not flexibly learn the kernel (spectral measure µ) from the data. Another approach is to treat the frequencies {ωr}R r=1 as kernel parameters. By optimizing these frequencies, we can directly learn the kernel from the data in a flexible manner, which corresponds to the sparse spectrum kernel method proposed by [17].

4 Our Model

The permanental process inherits the cubic computational complexity of GPs, rendering it impractical for large-scale datasets. To efficiently conduct posterior inference, many works employed various low-rank approximation methods [18; 7; 33]. These methods require certain standard types of kernels, such as the squared exponential kernel, polynomial kernel, etc., to ensure that the intensity integral has analytical solutions. This limitation severely affects the flexibility of the model. To address this issue, sparse spectral permanental processes are proposed in recent years [14; 29]. The advantage of this method lies in its ability to analytically compute the intensity integral, while not requiring restrictions on the kernel s functional form. However, the drawback is that it is only applicable to stationary kernels. Therefore, a natural question arises: can we further generalize the sparse spectral permanental processes to not only remove restrictions on the kernel s functional form but also make them applicable to nonstationary kernels?

4.1 Nonstationary Sparse Spectral Permanental Process

Existing sparse spectral permanental processes rely on the prerequisite of Bochner s theorem, which limits the extension of the sparse spectral method to nonstationary kernels. Fortunately, for more general nonstationary kernels, the spectral representation similar to Bochner s theorem also exists.

Theorem 4.1. [37] A nonstationary kernel function k(x1, x2) : RD RD R is bounded, continuous, and positive definite if and only if it can be represented as:

k(x1, x2) = Z

RD RD exp i(ω 1 x1 ω 2 x2) dµ(ω1, ω2),

where µ(ω1, ω2) is a Lebesgue-Stieltjes measure associated to some bounded positive semi-definite spectral density p(ω1, ω2) = µ(ω1,ω2) µ(RD,RD).

Similarly, we can use the Monte Carlo method to approximate it. However, it is worth noting that, to ensure that the spectral density p(ω1, ω2) is positive semi-definite, we must first symmetrize it, i.e., p(ω1, ω2) = p(ω2, ω1); and secondly introduce diagonal components p(ω1, ω1) and p(ω2, ω2). We can take a general density function g on the product space and then parameterize p(ω1, ω2) in the following form to ensure its positive semi-definite property [26; 32]:

p(ω1, ω2) = 1

4(g(ω1, ω2) + g(ω2, ω1) + g(ω1, ω1) + g(ω2, ω2)). (5)

Then, we can obtain the sparse spectral feature representation of any nonstationary kernel:

k(x1, x2) = σ2

4 Eg(ω1,ω2) exp i(ω 1 x1 ω 2 x2) + exp i(ω 2 x1 ω 1 x2)

+ exp i(ω 1 x1 ω 1 x2) + exp i(ω 2 x1 ω 2 x2)

exp i(ω 1rx1 ω 2rx2) + . . . + exp i(ω 2rx1 ω 2rx2) (6)

= Φ(R) 1 (x1) Φ(R) 2 (x2) + Φ(R) 2 (x1) Φ(R) 1 (x2) + Φ(R) 1 (x1) Φ(R) 1 (x2) + Φ(R) 2 (x1) Φ(R) 2 (x2)

= φ(R)(x1) φ(R)(x2),

where σ2 = µ(RD, RD), {ω1r, ω2r}R r=1 are independent samples from g(ω1, ω2), and

Φ(R) 1 (x) = σ 2

R [cos(ω 11x), , cos(ω 1Rx), sin(ω 11x), , sin(ω 1Rx)] ,

Φ(R) 2 (x) = σ 2

R [cos(ω 21x), , cos(ω 2Rx), sin(ω 21x), , sin(ω 2Rx)] ,

φ(R)(x) = Φ(R) 1 (x) + Φ(R) 2 (x).

If the GP s kernel in the permanental process (Eq. (2)) adopts the above sparse spectral nonstationary kernel, then we refer to this model as NSSPP.

4.2 Deep Nonstationary Sparse Spectral Permanental Process

We can further enhance the expressive power of the nonstationary kernel by stacking multiple spectral feature mappings hierarchically to construct a deep kernel. The spectral feature mapping φ(R)(x) has another equivalent form:

φ(R)(x) = σ

2R [cos(ω 11x+b11)+cos(ω 21x+b21), . . . , cos(ω 1Rx+b1R)+cos(ω 2Rx+b2R)] ,

(8) where {b1r, b2r}R r=1 are uniformly sampled from [0, 2π]. The derivation of Eq. (8) is provided in Appendix B.

The spectral feature mapping in Eq. (8) can be viewed as a more complex single-layer neural network: it employs two sets of weights (ω) and biases (b) to linearly transform the input, followed by the application of cosine activation functions, and then adds the results to obtain a single output dimension. By stacking multiple layers on top of each other, we naturally construct a deep nonstationary kernel:

Ψ(R)(x) = φ(R) L φ(R) L 1 φ(R) 1 (x), k(x1, x2) Ψ(R)(x1) Ψ(R)(x2), (9)

where Ψ(R)(x) is the spectral feature mapping with L layers and φ(R) l denotes the l-th layer . This actually recovers the deep spectral kernel proposed by [36]. If the GP s kernel in the permanental process (Eq. (2)) adopts the above deep sparse spectral nonstationary kernel, then we refer to this model as DNSSPP.

DNSSPP exhibits enhanced expressiveness compared to NSSPP due to its deep architecture. In subsequent sections, we consistently use Ψ(R) as the spectral feature mapping. When it has a single layer, the model corresponds to NSSPP; when multiple layers, the model corresponds to DNSSPP.

5 Inference

For the posterior inference of (D)NSSPP, this work employs a fast Laplace approximation exploiting the sparse spectral representation of nonstationary kernels. Other inference methods, such as variational inference, are also feasible. Specifically, we extend the method proposed in [29] from stationary kernels to nonstationary kernels.

5.1 Joint Distribution

When we approximate a kernel using Eq. (6) or Eq. (9), the GP prior can also be approximated as:

f(x) β Ψ(R)(x), β N(0, I). (10)

Additionally, we set the intensity function as λ(x) = f(x) + α 2. Following [14], we introduce an offset α to mitigate the nodal line problem caused by the non-injectiveness of λ = f 2. This issue results in the intensity between positive and negative values of f being forced to zero, despite the underlying intensity being positive.

Combing the likelihood in Eq. (1) and the equivalent GP prior in Eq. (10), we obtain the joint density:

log p({xi}N i=1 , β|Θ) = log p({xi}N i=1 |β, Θ) + log p(β)

i=1 log((β Ψ(R)(xi) + α)2) Z

2β β + C, (11)

where C is a constant and Θ denotes the model hyperparameters. The intensity integral term can be further expressed as: Z

X λ(x)dx = β Mβ + 2αβ m + α2|X|, (12)

For simplicity, here we assume all layers have the same width R, but it can also vary.

where |X| is the window size, M is a R R matrix, and m is a R-sized vector with each entry:

X Ψ(R) i (x)Ψ(R) j (x)dx, mi = Z

X Ψ(R) i (x)dx, i, j 1, . . . , R. (13)

It is worth noting that the intensity integral is potentially analytically solvable. For a single-layer spectral feature mapping (NSSPP), analytical solutions for M and m exist (see Appendix C). However, for multiple-layer mapping (DNSSPP), analytical solutions are not possible. In the following, we analytically solve M and m for NSSPP, while using numerical integration for DNSSPP.

5.2 Laplace Approximation

We use the Laplace s method to approximate the posterior p(β| {xi}N i=1 , Θ). The Laplace s method approximates the posterior with a Gaussian distribution by performing a second-order Taylor expansion around the maximum of the log posterior. Following the common practice:

log p(β|{xi}, Θ) log N(β| ˆβ, Q) = 1

2(β ˆβ) Q 1(β ˆβ) 1

2 log 2π, (14)

where ˆβ is the mode of the true posterior and Q is the negative inverse Hessian of the true posterior evaluated at the mode:

β log p({xi}N i=1 , β|Θ)|β= ˆβ = 0, Q 1 = 2 β log p({xi}N i=1 , β|Θ)|β= ˆβ. (15)

The mode ˆβ is typically not directly solvable based on Eq. (15), so we employ optimization methods to find it. Then, the precision matrix Q 1 can be computed analytically (see Appendix D).

5.3 Hyperparameter and Complexity

The model hyperparameter Θ consists of the kernel parameters σ, ω, b (if there are multiple layers, these hyperparameters include parameters from all layers) and the bias of the intensity function α. We can learn these hyperparameters from the data by maximizing the marginal likelihood:

log p({xi}N i=1 |Θ)

= log p({xi}N i=1 , β|Θ) log p(β| {xi}N i=1 , Θ) log p({xi}N i=1 , β|Θ) log N(β| ˆβ, Q). (16)

Substituting Eqs. (11) and (14) into Eq. (16), we can obtain a marginal likelihood approximation. Optimizing this yields the optimal hyperparameter Θ.

Similar to common low-rank approximation methods for kernels, the computational complexity of the proposed inference method is reduced from O(N 3) to O(NR2), where R N, i.e., linearly with N. A complete algorithm pseudocode is provided in Appendix E.

5.4 Predictive Distribution

After obtaining the posterior β| {xi}N i=1 , Θ N(β| ˆβ, Q), we can compute the posterior of the intensity function. For any x X, the predictive distribution of f(x ) + α is

f(x ) + α| {xi}N i=1 , Θ N(µ(x ), σ2(x )),

µ(x ) = ˆβ Ψ(R)(x ) + α, σ2(x ) = Ψ(R)(x ) QΨ(R)(x ).

Then, the intensity λ(x ) = f(x ) + α 2 can be proven to have the following mean and variance:

E[λ(x )] = µ2(x ) + σ2(x ), Var[λ(x )] = 2σ4(x ) + 4µ2(x )σ2(x ). (17)

6 Experiments

In this section, we mainly analyze the superiority in performance of (D)NSSPP over baseline models on both synthetic and real-world datasets, as well as the impact of various hyperparameters. We perform all experiments using the server with two GPUs (NVIDIA TITAN V with 12GB memory), two CPUs (each with 8 cores, Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz), and 251GB memory.

0 2 4 6 8 10 x

Expected (x)

Ground Truth VBPP LBPP SSPP GSSPP

NSMPP NSSPP DNSSPP Events

(a) Stationary Data

0 2 4 6 8 10 x

Expected (x)

Ground Truth VBPP LBPP SSPP GSSPP

NSMPP NSSPP DNSSPP Events

(b) Nonstationary Data

10 20 50 100 200 width

1 2 3 4 depth

167.2 169.4 170.7 170.1 170.2

168.6 172.7 175.3 175.1 173.8

166.4 172.2 176.1 173.6 168.8

171.1 175.6 170.5 163.5 163.4

(c) Width v.s. depth

1 5 20 50 100 epoch

0.1 0.05 0.01 0.005 0.001 learning rate

173.0 175.3 175.6 175.1 175.2

171.8 174.8 175.6 175.0 175.4

168.9 171.5 174.6 175.7 176.0

168.4 169.9 173.9 174.7 175.4

168.3 167.8 170.1 171.4 172.9

(d) Epoch v.s. learning rate

Figure 1: (a) The fitting results of the intensity functions for all models on the stationary synthetic data; (b) those on the nonstationary synthetic data. The impact of (c) network width and depth, (d) the number of epochs and learning rate on the Ltest of DNSSPP on the nonstationary data.

6.1 Baselines

We employ several previous works as baselines for comparison. These baselines include VBPP which utilizes the inducing points method [18], LBPP employing the Nyström approximation [33], as well as SSPP and GSSPP utilizing the Fourier representation [29]. We use GSSPP based on Gaussian kernel, Matérn kernel with parameter 1/2 and 5/2 respectively, and we denote them as GSSPP, GSSPP-M12 and GSSPP-M52. Besides, we offer a baseline that employs stacked mappings of stationary kernels which we name it Deep Sparse Spectral Permanental Process (DSSPP). All of these methods utilize stationary kernels. As far as we know, there is currently no work employing nonstationary kernels in permanental processes. To compare with nonstationary baselines, we implement a nonstationary permanental process using the nonstationary spectral mixture kernel [26], referred to as NSMPP.

6.2 Metrics

We employ two metrics to evaluate the performance of various baselines. One is the expected log-likelihood on the test data, denoted as Ltest, with details provided in Appendix F. Additionally, when the ground-truth intensity is available, e.g., for the synthetic data, the root mean square error (RMSE) between the expected posterior intensity and the ground truth serves as another measure.

6.3 Synthetic Data

Datasets To better compare the performance of (D)NSSPP with baseline models on point process data in both stationary and nonstationary scenarios, we simulated a stationary and a nonstationary permanental process on the interval [0, 10], respectively. Specifically, we assume two kernels:

k1(x1, x2) = exp( 1

2 x1 x2 2), k2(x1, x2) = (x 1 x2 100 + 1)3 exp( 1

2 x1 x2 2),

where k1 is a stationary Gaussian kernel and k2 is a nonstationary kernel obtained by multiplying a Gaussian kernel with a polynomial kernel. We use these two kernels to construct a stationary GP and a nonstationary GP, respectively, and randomly sample two corresponding latent functions. Based on λ(x) = (f(x) + 2)2, we construct two corresponding intensity functions. Finally, we use the thinning algorithm [22] to simulate two sets of synthetic data. For each intensity, we simulate ten datasets and use each dataset alternately as the training set and the remaining ones as the test sets.

Setup For NSSPP, the number of frequencies (network width) is set to 50. For DNSSPP, we experimented with three different configurations: DNSSPP-[50,30], DNSSPP-[100,50], and DNSSPP- [30,50,30]. Each number represents the width of the corresponding network layer. Therefore, DNSSPP-[50,30] implies a model with two layers, where the first layer has a width of 50 and the second layer has a width of 30, and so on. For baseline models, we need to set the number of rank for kernel low-rank approximation. We set the number of inducing points to 50 for VBPP, the number of eigenvalues to 50 for LBPP, and the number of frequencies to 50 for SSPP, GSSPP, and NSMPP. For DSSPP, to facilitate comparison with DNSSPP, we adopt the same layer configurations.

Results The fitting results of intensity functions for all models are depicted in Fig. 1. To quantitatively compare the fitting performance of different methods in both stationary and nonstationary

Table 1: The performance of Ltest, RMSE and runtime for DNSSPP, DSSPP, NSSPP and other baselines on two synthetic datasets. For Ltest, the higher the better; for RMSE and runtime, the lower the better. The upper half corresponds to nonstationary models, while the lower half to stationary models. On the stationary dataset, DSSPP performs comparably to DNSSPP. On the nonstationary dataset, DSSPP does not outperform DNSSPP due to the severe nonstationarity in the data.

Stationary Synthetic Data Nonstationary Synthetic Data

Ltest RMSE Runtime(s) Ltest RMSE Runtime(s)

DNSSPP-[100,50] 156.43( 37.52) 0.061( 0.020) 9.27 175.70( 26.89) 0.076( 0.015) 9.40 DNSSPP-[50,30] 156.55( 37.44) 0.061( 0.017) 9.00 173.68( 26.70) 0.094( 0.012) 9.11 DNSSPP-[30,50,30] 156.55( 37.66) 0.068( 0.017) 10.32 174.47( 26.57) 0.086( 0.013) 10.07 NSSPP 153.56( 36.82) 0.070( 0.009) 8.39 172.87( 26.86) 0.10( 0.012) 8.96 NSMPP 153.05( 36.96) 0.065( 0.018) 3.19 172.17( 26.42) 0.079( 0.019) 4.63

DSSPP-[100,50] 155.62( 37.31) 0.060( 0.015) 7.64 173.91( 10.27) 0.090( 0.013) 8.27 DSSPP-[50,30] 155.51( 36.98) 0.065( 0.012) 7.56 172.54( 26.74) 0.103( 0.015) 7.54 DSSPP-[30,50,30] 153.75( 37.34) 0.073( 0.016) 8.95 174.35( 26.70) 0.086( 0.013) 9.74 SSPP 153.91( 36.70) 0.071( 0.016) 1.95 165.94( 30.00) 0.140( 0.009) 2.48 GSSPP 154.12( 37.09) 0.073( 0.014) 7.72 170.13( 26.43) 0.101( 0.022) 14.19 GSSPP-M12 150.65( 36.84) 0.071( 0.018) 9.30 168.49( 26.67) 0.083( 0.018) 15.33 GSSPP-M52 154.46( 37.24) 0.075( 0.022) 9.63 169.33( 26.61) 0.107( 0.026) 14.49 LBPP 150.80( 36.13) 0.082( 0.008) 0.31 168.97( 26.70) 0.126( 0.006) 0.37 VBPP 155.29( 36.52) 0.072( 0.021) 1.68 172.95( 30.00) 0.087( 0.016) 1.83

LBPP for Redwoods

DNSSPP for Redwoods

LBPP for Taxi

DNSSPP for Taxi Figure 2: The fitting results of the intensity functions from LBPP and DNSSPP on the Redwoods and Taxi datasets. Additional results for various baselines on three datasets are provided in Appendix G.

scenarios, we employ Ltest and RMSE as metrics. The comparison results are presented in Table 1. DNSSPP with three different configurations exhibits significant advantages on both stationary and nonstationary datasets. NSSPP and NSMPP show better performance in the nonstationary scenario compared to most stationary baselines. This is reasonable, as stationary baselines face challenges in accurately modeling nonstationary data. Because stationary kernels are a subset of nonstationary kernels, nonstationary models typically perform well on stationary data as well. Compared to relatively shallow models, DNSSPP achieves better performance but also incurs a longer running time.

6.4 Real Data

Datasets In this section, we use three sets of common real-world datasets to evaluate the performance of our method and the baselines. Coal Mining Disaster [13]: This is a 1-dimensional dataset containing 191 incidents that occurred between March 15, 1875, and March 22, 1962. Each event represents a mining accident that killed more than 10 people. Redwoods [27]: This is a 2-dimensional dataset that describes the distribution of redwoods in an area, consisting of 195 events. Porto Taxi [21]: This is a large 2-dimensional dataset containing the tracks of 7,000 taxis in Porto during 2013/2014. We focus on the area with coordinates between (41.147, 8.58) and (41.18, 8.65), and extract 3,000 pick-up points as our dataset. For each dataset, we randomly partition the data into training set and test set of approximately equal size.

Setup For real datasets, because we do not know the ground-truth intensity, we use Ltest as the evaluation metric. Since NSSPP generally performs worse than DNSSPP, we report only the results of DNSSPP in this section. For DNSSPP, we consistently use the same three configurations as in the synthetic data. For baseline models, we choose different numbers of rank for kernel low-rank approximation for 1-dimensional and 2-dimensional datasets. Specifically, for the 1-dimensional dataset (Coal), we set the number of rank, i.e., the number of inducing points for VBPP, the number

Table 2: The performance of Ltest and runtime for DNSSPP and other baselines on three real datasets. The upper half corresponds to nonstationary models, while the lower half to stationary models.

Coal Redwoods Taxi

Ltest Runtime(s) Ltest Runtime(s) Ltest Runtime(s)

DNSSPP-[100,50] 225.73( 2.96) 10.54 79.56( 0.014) 11.09 7171( 73) 73.3 DNSSPP-[50,30] 225.85( 2.61) 11.38 79.50( 0.020) 10.45 6682( 37) 70.1 DNSSPP-[30,50,30] 225.79( 4.49) 11.76 79.55( 0.025) 12.72 7246( 37) 81.0 NSMPP 223.28( 3.60) 2.88 77.64( 6.21) 4.43 6492( 62) 94.6

SSPP 221.42( 1.87) 1.72 78.57( 2.83) 2.45 6245( 45) 58.4 GSSPP 221.08( 6.32) 5.05 76.97( 5.80) 5.94 6445( 97) 110.64 GSSPP-M12 223.11( 5.02) 5.61 72.96( 11.89) 5.90 6599( 76) 104.75 GSSPP-M52 221.89( 3.06) 5.17 77.98( 6.05) 5.23 6526( 113) 112.54 LBPP 218.30( 4.12) 0.33 80.40( 0.72) 0.34 6096( 25) 3.16 VBPP 219.15( 4.54) 1.69 77.06( 0.88) 2.14 6156( 34) 9.10

of eigenvalues for LBPP, and the number of frequencies for SSPP, GSSPP, and NSMPP, to 10. For the 2-dimensional datasets (Redwoods and Taxi), we set the number of rank to 50. We discuss the reasonableness of the above setup in Appendix H.

Results Fig. 2 shows the fitting results of the intensity functions from LBPP and DNSSPP on the Redwoods and Taxi datasets. The quantitative comparison results are shown in Table 2. Since real-world data is more or less nonstationary to varying degrees, nonstationary models generally perform better on real-world data. DNSSPP secures both first and second places on the Coal and Taxi datasets and achieves second place on the Redwoods dataset. For simpler datasets, such as Coal and Redwoods, which have relatively simple data patterns, the performance of DNSSPP is comparable to that of the simpler baselines. However, for the more complex Taxi dataset, DNSSPP demonstrates significant advantages over the simpler baselines. This indicates that DNSSPP has a clear advantage in handling data with more complex patterns, such as large-scale or high-dimensional datasets.

6.5 Ablation Studies

In this section, we analyze the impact of four configuration hyperparameters on the performance of DNSSPP: the network width and depth, as well as the number of epochs and learning rate during the update of hyperparameter Θ.

6.5.1 Width and Depth of Network

We investigate DNSSPP s performance with varying network sizes by adjusting width and depth on nonstationary synthetic data. In our experiments, we maintain consistent width across layers. Results are illustrated in Fig. 1c. As the number of layers and the width of the network increase, DNSSPP s performance initially improves and then declines, reflecting the phenomenon of overfitting as the network size increases. Regarding the number of layers, DNSSPP generally exhibits underfitting when there is only a single layer, while overfitting tends to occur with four layers. In terms of network width, due to the relatively small size of the synthetic data, overfitting becomes apparent when the width is large. However, for larger datasets, a more complex network structure should be chosen.

6.5.2 Epoch and Learning Rate of Θ

During the inference process, we need to perform numerical optimization on the hyperparameter Θ, implying the need to determine the epoch and learning rate for optimization. We conduct experiments on the nonstationary synthetic data to investigate the coordination of epoch and learning rate. Results are illustrated in Fig. 1d. When the learning rate is too low or the number of epochs is insufficient, the update of Θ is sluggish, leading to inferior performance. However, when the learning rate is too high or the number of epochs is excessive, model performance generally declines as well. This is because our model training is a bi-level optimization: the inner level is the Laplace approximation, and the outer level is to update the hyperparameter Θ. An excessively high learning rate or too many epochs can cause Θ to converge too quickly, making the overall model training more prone to getting stuck in local optima, thus resulting in suboptimal performance.

7 Limitations

NSSPP not only removes restrictions on the kernel s functional form but also makes it applicable to nonstationary kernels. DNSSPP further enhances its expressive power through a deep architecture. However, the deep architecture of DNSSPP prevents analytical computation of the intensity integral, necessitating numerical integration. This reliance on numerical integration imposes a limitation on the model s computational speed.

8 Conclusions

In this study, we introduced NSSPP and its deep kernel variant, DNSSPP, to address limitations in modeling the permanental process. This approach overcomes the stationary assumption, allowing for flexible learning of nonstationary kernels directly from data without restricting their form. By leveraging the sparse spectral representation of nonstationary kernels, we achieved a low-rank approximation, effectively reducing computational complexity. Our experiments on synthetic and real-world datasets demonstrated that (D)NSSPP achieved competitive performance on stationary data while excelling on nonstationary datasets. This study underscores the importance of considering nonstationarity in the permanental process and demonstrates the efficacy of (D)NSSPP in this context.

Acknowledgments and Disclosure of Funding

This work was supported by NSFC Projects (Nos. 62106121, 62406119), the MOE Project of Key Research Institute of Humanities and Social Sciences (22JJD110001), the fundamental research funds for the central universities, and the research funds of Renmin University of China (24XNKJ13), the Natural Science Foundation of Hubei Province (2024AFB074), and the Guangdong Provincial Key Laboratory of Mathematical Foundations for Artificial Intelligence (2023B1212010001).

[1] Adams, R. P., Murray, I., and Mac Kay, D. J. Tractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 9 16. ACM, 2009.

[2] Bochner, S. Vorlesungen uber Fouriersche Integrale. Akademische Verlagsgesellschaft, 1932.

[3] Cox, D. R. Some statistical methods connected with series of events. Journal of the Royal Statistical Society. Series B (Methodological), pp. 129 164, 1955.

[4] Dezfouli, A., Nock, R., Arabzadeh, E., and Dayan, P. Neural network Poisson models for behavioural and neural spike train data. bio Rxiv, 2020.

[5] Diggle, P., Rowlingson, B., and Su, T.-l. Point process methodology for on-line spatio-temporal disease surveillance. Environmetrics: The official journal of the International Environmetrics Society, 16(5):423 434, 2005.

[6] Donner, C. and Opper, M. Efficient Bayesian inference of sigmoidal Gaussian Cox processes. Journal of Machine Learning Research, 19(1):2710 2743, 2018.

[7] Flaxman, S., Teh, Y. W., Sejdinovic, D., et al. Poisson intensity estimation with reproducing kernels. Electronic Journal of Statistics, 11(2):5081 5104, 2017.

[8] Geng, J., Shi, W., and Hu, G. Bayesian nonparametric nonhomogeneous Poisson process with applications to USGS earthquake data. Spatial Statistics, 41:100495, 2021.

[9] Gunter, T., Lloyd, C., Osborne, M. A., and Roberts, S. J. Efficient Bayesian nonparametric modelling of structured point processes. In Conference on Uncertainty in Artificial Intelligence, pp. 310 319, 2014.

[10] Heinonen, M., Mannerström, H., Rousu, J., Kaski, S., and Lähdesmäki, H. Non-stationary Gaussian process regression with Hamiltonian Monte Carlo. In Artificial Intelligence and Statistics, pp. 732 740. PMLR, 2016.

[11] Ilalan, D. A Poisson process with random intensity for modeling financial stability. The Spanish Review of Financial Economics, 14(2):43 50, 2016.

[12] Illian, J. B., Sørbye, S. H., and Rue, H. A toolbox for fitting complex spatial point process models using integrated nested Laplace approximation (INLA). 2012.

[13] Jarrett, R. G. A note on the intervals between coal-mining disasters. Biometrika, 66(1):191 193, 1979.

[14] John, S. and Hensman, J. Large-scale Cox process inference using variational Fourier features. In International Conference on Machine Learning, pp. 2362 2370. PMLR, 2018.

[15] Kingman, J. F. C. Poisson processes, volume 3. Clarendon Press, 1992.

[16] Ko, Y.-J. and Seeger, M. W. Expectation propagation for rectified linear Poisson regression. In Asian Conference on Machine Learning, pp. 253 268. PMLR, 2016.

[17] Lázaro-Gredilla, M., Quinonero-Candela, J., Rasmussen, C. E., and Figueiras-Vidal, A. R. Sparse spectrum Gaussian process regression. The Journal of Machine Learning Research, 11: 1865 1881, 2010.

[18] Lloyd, C., Gunter, T., Osborne, M., and Roberts, S. Variational inference for Gaussian process modulated Poisson processes. In International Conference on Machine Learning, pp. 1814 1822, 2015.

[19] Mc Cullagh, P. and Møller, J. The permanental process. Advances in applied probability, 38(4): 873 888, 2006.

[20] Møller, J., Syversveen, A. R., and Waagepetersen, R. P. Log Gaussian Cox processes. Scandinavian journal of statistics, 25(3):451 482, 1998.

[21] Moreira-Matias, L., Gama, J., Ferreira, M., Mendes-Moreira, J., and Damas, L. Predicting taxi passenger demand using streaming data. IEEE Transactions on Intelligent Transportation Systems, 14(3):1393 1402, 2013.

[22] Ogata, Y. Space-time point-process models for earthquake occurrences. Annals of the Institute of Statistical Mathematics, 50(2):379 402, 1998.

[23] Paciorek, C. J. and Schervish, M. J. Spatial modelling using a new class of nonstationary covariance functions. Environmetrics: The official journal of the International Environmetrics Society, 17(5):483 506, 2006.

[24] Park, M., Weller, J. P., Horwitz, G. D., and Pillow, J. W. Bayesian active learning of neural firing rate maps with transformed Gaussian process priors. Neural computation, 26(8):1519 1541, 2014.

[25] Rahimi, A. and Recht, B. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007.

[26] Remes, S., Heinonen, M., and Kaski, S. Non-stationary spectral kernels. Advances in neural information processing systems, 30, 2017.

[27] Ripley, B. D. Modelling spatial patterns. Journal of the Royal Statistical Society: Series B (Methodological), 39(2):172 192, 1977.

[28] Samo, Y.-L. K. and Roberts, S. Generalized spectral kernels. ar Xiv preprint ar Xiv:1506.02236, 2015.

[29] Sellier, J. and Dellaportas, P. Sparse spectral Bayesian permanental process with generalized kernel. In International Conference on Artificial Intelligence and Statistics, pp. 2769 2791. PMLR, 2023.

[30] Snoek, J., Swersky, K., Zemel, R., and Adams, R. Input warping for Bayesian optimization of non-stationary functions. In International conference on machine learning, pp. 1674 1682. PMLR, 2014.

[31] Taddy, M. A. Autoregressive mixture models for dynamic spatial Poisson processes: application to tracking intensity of violent crime. Journal of the American Statistical Association, 105(492): 1403 1417, 2010.

[32] Ton, J.-F., Flaxman, S., Sejdinovic, D., and Bhatt, S. Spatial mapping with Gaussian processes and nonstationary Fourier features. Spatial statistics, 28:59 78, 2018.

[33] Walder, C. J. and Bishop, A. N. Fast Bayesian intensity estimation for the permanental process. In International Conference on Machine Learning, pp. 3579 3588. PMLR, 2017.

[34] Wilson, A. and Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In International conference on machine learning, pp. 1067 1075. PMLR, 2013.

[35] Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep kernel learning. In Artificial intelligence and statistics, pp. 370 378. PMLR, 2016.

[36] Xue, H., Wu, Z.-F., and Sun, W.-X. Deep spectral kernel learning. In International Joint Conference on Artificial Intelligence, pp. 4019 4025, 2019.

[37] Yaglom, A. M. Correlation Theory of Stationary and Related Random Functions, Volume I: Basic Results, volume 131. Springer, 1987.

A Proof of Equation (4)

The derivation of Eq. (4) is provided below:

k(x1 x2) σ2

i=1 ϕi(x1)ϕi(x2)

i=1 exp(iω i x1) exp( iω i x2)

i=1 (cos(ω i x1) + i sin(ω i x1))(cos(ω i x2) i sin(ω i x2))

i=1 (cos(ω i x1) cos(ω i x2) + sin(ω i x1) sin(ω i x2))

=Φ(R)(x1) Φ(R)(x2). (18)

In the derivation, due to the symmetry of the spectrum distribution, it can be assumed that 2R points are symmetrically sampled, which allows the elimination of the imaginary part in the equation without altering the expectation.

B Proof of Equation (8)

The derivation of Eq. (8) is provided below:

k(x1, x2) σ2

exp i(ω 1rx1 ω 2rx2) + . . . + exp i(ω 2rx1 ω 2rx2)

cos(ω 1rx1 ω 2rx2) + . . . + cos(ω 2rx1 ω 2rx2) (19)

i,j=1,2 [cos(ω irx1 ω jrx2)],

where the second line is because the kernel is real-valued, so we can safely eliminate the imaginary part. Additionally, considering the law of total expectation, we have

Eωi,ωj[cos(ω i x1 + ω j x2 + 2b)] = Eωi,ωj[Eb[cos(ω i x1 + ω j x2 + 2b)|ωi, ωj]], (20)

where b Uniform(0, 2π). And consider the periodicity of cosine function, we have

Eb[cos(ω i x1 + ω j x2 + 2b)|ωi, ωj] = 0, (21)

so Eq. (20) is finally equal to 0. Therefore, we can further obtain another equivalent form of the spectral feature mapping, through the following procedure:

k(x1, x2) σ2

(i,j)={1,2}2 [cos(ω irx1 ω jrx2) + cos(ω irx1 + ω jrx2 + bir + bjr)]

(i,j)={1,2}2 2 cos(ω irx1 + bir) cos(ω jrx1 + bjr) (22)

= φ(R)(x1) φ(R)(x2),

where {b1r, b2r}R r=1 are uniformly sampled from [0, 2π].

C Analytical Computation of Intensity Integral

The intensity function of the permanental process has the following expression λ(x) = (β Ψ(R)(x)+ α)2. For ease of derivation, we adopt the form of Eq. (7) for the spectral feature mapping Ψ(R)(x) in subsequent steps, i.e., a 2R-sized vector. When Ψ(R)(x) takes the form of Eq. (8), i.e., an R-sized vector, the calculation result of the intensity integral remains the same. The intensity integral on X is: Z

X λ(x)dx = Z

X (β Ψ(R)(x) + α)2dx

X Ψ(R)(x)Ψ(R)(x) dxβ + 2αβ Z

X Ψ(R)(x)dx + Z

= β Mβ + 2αβ m + α2|X|,

X Ψ(R) i (x)Ψ(R) j (x)dx, mi = Z

X Ψ(R) i (x)dx, i, j 1, . . . , 2R. (23)

It is worth noting that M and m are analytically solvable. The analytical solution for the stationary case is provided in Appendix C.2 in [29], and we provide the analytical expression for the nonstationary case here.

For M, the product of two spectral feature mappings is:

Ψ(R) i (x)Ψ(R) j (x) = σ2

(a,b)={1,2}2 cos(ω aix) cos(ω bjx), if i, j = 1, . . . , R, P

(a,b)={1,2}2 sin(ω aix) sin(ω bjx), if i, j = R + 1, . . . , 2R, P

(a,b)={1,2}2 cos(ω aix) sin(ω bjx), if i = 1, . . . , R, j = R + 1, . . . , 2R, P

(a,b)={1,2}2 sin(ω aix) cos(ω bjx), if i = R + 1, . . . , 2R, j = 1, . . . , R.

(24) We can further transform the product term of trigonometric functions:

cos(ω aix) cos(ω bjx) = 1

2[cos (ωai ωbj) x + cos (ωai + ωbj) x ],

sin(ω aix) sin(ω bjx) = 1

2[cos (ωai ωbj) x cos (ωai + ωbj) x ],

cos(ω aix) sin(ω bjx) = 1

2[sin (ωai ωbj) x + sin (ωai + ωbj) x ].

Without loss of generality, we consider the integral domain X as [ d, d]D. Then we can compute the integral of the above expression analytically. Since we only consider the D = 1 and D = 2 point process data in this paper, we specifically demonstrate the analytical integral expressions in the one-dimensional and two-dimensional cases here.

One-dimension: Z

[ d,d] cos ηx dx =

( 2 η sin(ηd) if η = 0 2d if η = 0, (26)

[ d,d] sin ηx dx = 0. (27)

Two-dimension: Z

[ d,d]2 cos η x dx =

( 2 η1η2 cos((η1 η2)d) cos((η1 + η2)d) if η1, η2 = 0 4d2 if η1 = η2 = 0, (28)

[ d,d]2 sin η x dx = 0, (29)

where η1 denotes the first ordinate of η, and η2 denotes the second ordinate of η. It should be noted that theoretically, the integral in Eq. (28) only has two possible results as explained in the right side. The analytical solution of the integral of Eq. (25) can be obtained by replacing η with ωai ωbj or ωai + ωbj. Note that the case a = b and i = j corresponds to the second case in Eq. (26) and Eq. (28), which means the computation of the diagonal entries of M is different from that elsewhere.

Next, we can compute m:

[ d,d]D cos(ω 1ix) + cos(ω 2ix) dx, if i = 1, . . . , R, R

[ d,d]D sin(ω 1ix) + sin(ω 2ix) dx, if i = R + 1, . . . , 2R. (30)

It is easy to see that if D = 1, m can be analytically computed using Eq. (26) and Eq. (27), while if D = 2, m can be analytically computed using Eq. (28) and Eq. (29). It is worth noting that the analytical expression of the intensity integral above applies only to NSSPP (single-layer network). For DNSSPP (multi-layer network), the nested structure of trigonometric functions leads to M and m lacking analytical solutions, and we need resort to numerical integration.

D Derivation of Laplace Approximation

The gradient and the Hessian matrix of the posterior of β are:

β log p({xi}N i=1 , β|Θ) = (2M + I)β 2αm + 2

Ψ(R)(xi) β Ψ(R)(xi) + α, (31)

2 β log p({xi}N i=1 , β|Θ) = (2M + I) 2

Ψ(R)(xi)Ψ(R)(xi)

(β Ψ(R)(xi) + α)2 . (32)

In theory, setting Eq. (31) to 0 can solve for the mode of the posterior, denoted as ˆβ. However, in practice, we cannot analytically solve it, so we use numerical optimization to obtain ˆβ. Then according to Eq. (32), we can obtain the precision matrix Q 1 = 2 β log p({xi}N i=1 , β|Θ)|β= ˆβ.

E Pseudocode

The pseudocode of our inference algorithm is provided in Algorithm 1.

Algorithm 1: Inference for (D)NSSPP Define the network depth L and width R, and initialize the hyperparameter Θ; for Iteration do

Calculate M and m by Eq. (13); Update the mode ˆβ by maximizing Eq. (14); Update the covariance matrix Q by Eq. (15); Update the hyperparameters Θ by maximizing Eq. (16); end For any x , output the predicted mean and variance of λ(x ) by Eq. (17).

F Computation of Expected Test Log-likelihood

The computation of expected test log-likelihood is provided in Appendix E.2 in [29]. For the completeness of the article, we restate the specific derivation process here. Considering the posterior of the intensity function, we have the approximation of the expected test log-likelihood:

Eβ[log p({x i }N

i=1 | {xi}N i=1)]

X (β Ψ(R)(x) + α)2dx] +

i=1 Eβ[log(β Ψ(R)(x i ) + α)2], (33)

derived from Eq. (1).

First, we compute the first term in Eq. (33):

X (β Ψ(R)(x) + α)2dx] = Z

X Eβ[β Ψ(R)(x) + α]2dx + Z

X Var[β Ψ(R)(x)]dx

X (Ψ(R)(x) ββ Ψ(R)(x))dx + 2α Z

X (β Ψ(R)(x))dx + α2|X| + Z

X (Ψ(R)(x) QΨ(R)(x))dx

=β Mβ + tr(QM) + 2αβ m + α2|X|, (34)

where M and m are defined as Eq. (13).

Then, we consider the second term in Eq. (33). Applying the computational method used by [18], we can obtain the following expression:

i=1 Eβ[log(β Ψ(R)(x i ) + α)2] (35)

= G( β Ψ(R)(x) + α 2Ψ(R)(x) QΨ(R)(x)) + log(1

2Ψ(R)(x) QΨ(R)(x)) Const, (36)

where Const 0.57721566 is the Euler-Mascheroni constant. G is defined as

(2)j(1/2)j ,

where ( )j denotes the rising Pochhammer series:

(a)0 = 1, (a)k = a(a + 1)(a + 2)...(a + k + 1).

In practical computation, we first prepare a lookup table for the G function and then obtain numerical approximations through linear interpolation.

G Additional Experimental Results

In this section, we provide additional experimental results for the real data. Specifically, we present the fitting results of the intensity functions from all models on the Coal dataset in Fig. 3; the fitting results of the intensity functions from VBPP, SSPP, GSSPP, and NSMPP on the Redwoods dataset in Fig. 4; and the fitting results of the intensity functions from VBPP, SSPP, GSSPP, and NSMPP on the Taxi dataset in Fig. 5.

1860 1880 1900 1920 1940 1960 x

Expected (x)

VBPP LBPP SSPP GSSPP

NSMPP DNSSPP Events

Figure 3: The fitting results of the intensity functions from all models on the Coal dataset.

H Some Discussion about Experimental Setup for Real Data

For the real data, we chose different numbers of ranks for the 1-dimensional dataset (10 for Coal) and the 2-dimensional datasets (50 for Redwoods and Taxi) for all baseline methods, except for DNSSPP. The reason is that the Coal dataset is very small (only 191 events) and has a simple data pattern. A

VBPP for Redwoods

SSPP for Redwoods

GSSPP for Redwoods

NSMPP for Redwoods Figure 4: The fitting results of the intensity functions from VBPP, SSPP, GSSPP and NSMPP on the Redwoods dataset.

VBPP for Taxi

SSPP for Taxi

GSSPP for Taxi

NSMPP for Taxi Figure 5: The fitting results of the intensity functions from VBPP, SSPP, GSSPP and NSMPP on the Taxi dataset.

higher number of ranks would not further improve performance but would significantly increase the algorithm s running time. Therefore, we set the number of ranks to 10 for Coal, but 50 for Redwoods and Taxi because the latter two datasets are larger and more complex, requiring a higher number of ranks to improve performance.

To demonstrate this, we further increased the number of ranks for all baseline models to 50 on the Coal dataset and compared the result to rank=10. The result is shown in Table 3. As can be seen from the results, when we increased the number of ranks from 10 to 50, the performance of all baseline models did not improve significantly, but the algorithm s running time increased substantially.

In summary, for the same dataset, the number of ranks for all baseline methods is kept consistent, except for DNSSPP because it is not a single-layer architecture. The choice of the number of ranks for the baseline methods was made considering the trade-off between performance and running time. Further increasing the number of ranks would not significantly improve model performance but would result in a substantial increase in computation time.

Table 3: The performance of Ltest and runtime for all baselines on the Coal dataset, with the number of rank set to 10 and 50. When we increase the number of ranks from 10 to 50, the performance of all baseline models does not improve significantly, but the algorithm s running time increases substantially.

Coal (rank = 10) Coal (rank = 50)

Ltest Runtime(s) Ltest Runtime(s)

NSMPP 223.28( 3.60) 2.88 222.92( 3.63) 4.74 SSPP 221.42( 1.87) 1.72 221.77( 2.92) 3.16 GSSPP 221.08( 6.32) 5.05 221.67( 4.28) 19.08 GSSPP-M12 223.11( 5.02) 5.61 219.33( 4.25) 18.88 GSSPP-M52 221.89( 3.06) 5.17 217.80( 5.25) 16.94 LBPP 218.30( 4.12) 0.33 219.88( 2.68) 2.68 VBPP 219.15( 4.54) 1.69 219.25( 4.51) 6.26

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The main claims made in the abstract and introduction accurately reflect the paper s contributions and scope. See the Abstract and Introduction sections. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitation of the work is discussed in the Limitations section. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: For all theoretical results, the paper provides the corresponding proofs in the main text or appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: The paper fully discloses all the information needed to reproduce the main experimental results in the paper. See the Experiments section.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We provide the data and code in the supplemental material to reproduce the main experimental results.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We specify all the training and test details necessary to understand the results. See the experimental setup in the Experiments section.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We report the statistical significance of the experiments. In the tables presenting the experimental results, we provide the standard deviations across multiple runs, while in some figures, we also include confidence intervals.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: We provide sufficient details on the type of GPU, memory, time of execution needed to reproduce the experiments. See the experimental setup in the Experiments section.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: The research conducted in the paper conform with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [No] Justification: This paper presents work whose goal is to advance the field of machine learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: All code, models, and datasets mentioned in the text are appropriately cited with their original papers.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset.

For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [Yes]

Justification: New assets introduced in the paper, such as code, are well documented. The documentation is provided alongside the assets in the supplementary material.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: The paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: The paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.